[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1722120252210536.jpg (446 KB, 1176x1176)
446 KB
446 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101596616 & >>101589136

►News
>(07/27) Llama 3.1 rope scaling merged: https://github.com/ggerganov/llama.cpp/pull/8676
>(07/26) Cyberagent releases Japanese fine-tune model: https://hf.co/cyberagent/Llama-3.1-70B-Japanese-Instruct-2407
>(07/25) BAAI & TeleAI release 1T parameter model: https://hf.co/CofeAI/Tele-FLM-1T
>(07/24) Mistral Large 2 123B released: https://hf.co/mistralai/Mistral-Large-Instruct-2407
>(07/23) Llama 3.1 officially released: https://ai.meta.com/blog/meta-llama-3-1/
>(07/22) llamanon leaks 405B base model: https://files.catbox.moe/d88djr.torrent >>101516633

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/ylb0hv.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
>>101600933
FUUUCK NOO! I REBUKE YOU DEMON!
>>
File: ryback-eating-chips.gif (3.78 MB, 638x640)
3.78 MB
3.78 MB GIF
we're reaching levels of being so fucking back that shouldn't even be possible
>>
File deleted.
>>101600623
why did they train it on Qwen 32b instead of Gemma 27b? does the chinese base model have more sovl than google's?
>>
Nemo's soul is unmatched.
>>
mini-magnum is something else. and it runs on a potato
>>
>>101600987
>>101600938
>►Official /lmg/ card: https://files.catbox.moe/ylb0hv.png
>>
>>101600989
the moment i can figure out how to make it follow syntax and not go schizo from an overflow of sovl, i'm never going back
>>
>>101600987
>instead of Gemma 27b
I don't get it why this got shilled so hard and shilling worked. I would think that using it once and realizing how easy it is to make it a complete schizo with just one wrong setting would clue people in that there is something very wrong with this model.
>>
>>101601001
The official /lmg/ “card” is miku.bat
>>
>>101601001
Who made the official /lmg/ card official?
>>
>>101601012
>.bat
Fuck off it always was Miku.sh. It's still in llama.cpp.
>>
>>101600623
When exl2 quants?
>>
>i could create a card of her
>i could even replicate her writing style perfectly
>i could run all sorts of fetish scenarios. anything is possible
Oh God. Why do you tempt me with unlimited power? A man’s heart is too weak to handle it, and his dick too hard. Can’t you see I’m powerless to resist? Why do you doom me so?
>>
>>101601057
Ackshually, the very first test was on windows and it was a bat file. I was there
>>
>>101601013
Me!
>>
>>101600623
Why didn't they license it under faipl-1.0? Do the model creators love jews?
>>
>>101601100
You are a faggot. And that is why your opinion is discarded. Now I am not a faggot so:
>►Official /lmg/ card: https://files.catbox.moe/ylb0hv.png
Is the new official /lmg/ card. Mikufaggots lost.
>>
>>101601072
>unlimited power
Remember that your unlimited power ends after 8-16k tokens. After that she gets Alzheimer's.
>>
File: The Great Taking.jpg (117 KB, 384x580)
117 KB
117 KB JPG
Love u.
>>
>>101601116
Oh, okay then. I approve the change.
>>
the sad thing are all the shartyzoomers are literal incels so they really do unironically believe in the blacked shit they spam
>>
It's time to swallow the hard truth... Nemo is cucked.
It always moves the talk to muh consent or muh gender equality if given the opportunity.
I wonder when we will get a model like Nemo without a hidden agenda.
>>
What's the best model out there right now that can run on my 6 GB VRAM potato (GTX 1660 super)?
>>
>>101601195
My only issue is it randomly starts spelling things wrong after a while, not sure why.
>>
>>101601206
Repetition penalty, maybe?
>>
>>101601195
this but gemma
>>
>>101601215
No, I wasn't using that, just minp and temp of 0.3. Keeping it simple.
>>
>>101601201
pygmalion 6b
>>
File: zoomer.jpg (15 KB, 268x221)
15 KB
15 KB JPG
>>101601116
>Now I am not a faggot
>literally 24/7 obsessed with nigger dicks
>>
>>101601201
Gemma-2-9B-It-SPPO-Iter3-Q4_K_M
Q6_K if you don't care about speed
>>
>>101601242
If someone puts a burning bag of shit at your front door and knocks does it mean he is obsessed with scat?
>>
>>101600996
When do we get mega-magnum on Mistral Large?
>>
>>101601282
If said person then rings on every door in the neighborhood to tell them about it every week then yes, most definitely
You can only take the "jokes on them, I was only pretending to be retarded!" so far
>>
>>101601251
Slow generation is a major pain for me. I was using nous-capybara-34b.Q3_K_M split between CPU and GPU for the longest time, then switched to Toppy-M-7B.q6_k, which seems to have a little less coherency, but 10x or more faster generation. Wasn't sure if there were any recent models that might work better, since I keep up with this stuff at most once every couple months
>>101601239
I'll try that out.
>>
>>101601330
The guy who said Pyg is shitposting, try Mini Magnum 12b or the Gemma 9b
>>
>install sillytavern
>use a proxy for claude
>it's not local because i still have to rely on proxy having keys which it never does
So how do I resolve this
>>
File: file.png (16 KB, 1581x145)
16 KB
16 KB PNG
>>
Alright, so with llama.cpp's latest fork fixing L3.1's context issues, L3.1 70B is now way better. Make sure to download an updated quant though if you try it, one from at least today or no later than 24 hours ago. I think it will be great with a nice smut finetune. Hope we get a euryale or magnum tune.
>>
>>101601374
buy an ad
>>
>>101601401
...
>>
>>101601300
The answer is no. And I keep doing that because it gets on your nerves. And it getting on your nerves is a typical zoomer behavior zoomer.
>>
File: file.png (41 KB, 725x372)
41 KB
41 KB PNG
>>
>>101601401
>goliath
>IQ2_XS
Is this the ultimate retard meme combo?
>>
>>101601287
Don't you need multiple nodes of 8xH100 to train that? Who has that cash on hand
>>
File: OIG3.jpg (145 KB, 1024x1024)
145 KB
145 KB JPG
>>
File: 0jzgm1m0aor31.jpg (129 KB, 1380x1088)
129 KB
129 KB JPG
►Recent Highlights from the Previous Thread: >>101596616

--Anon seeks advice on downloading large Mistral model without broken quants: >>101597325 >>101597343 >>101597432 >>101597440 >>101597495 >>101597516 >>101597553 >>101597580 >>101597857 >>101597889
--RX 5700 XT 8GB user frustrated with AMD's AI support considers GeForce RTX 4060 Ti 16GB upgrade: >>101598173 >>101598310 >>101598327 >>101598377 >>101598439 >>101598465 >>101598496 >>101598998 >>101599336 >>101599632
--Storyteller needs a "save as" option: >>101600192 >>101600216 >>101600272 >>101600238 >>101600469 >>101600491
--Old Dell for CPUmaxx: >>101600225 >>101600301
--Nemo 12b and koboldcpp compatibility and GGML vs GGUF discussion: >>101597517 >>101597538 >>101597619 >>101597650 >>101597660 >>101597672 >>101597768
--Llama3.1 8b instruct-tuned model and 8-bit quantization available on Ollama: >>101597787 >>101597819 >>101597962 >>101598069 >>101597832
--Flash attention compatibility with nemo and koboldcpp: >>101598616 >>101598656 >>101598748 >>101598877 >>101598971 >>101599170
--Base nemo and alpaca model comparison and gguf link: >>101599201 >>101599308 >>101599337 >>101599354 >>101599569 >>101599587 >>101599638 >>101599694
--When will transformers dev add llama.cpp support?: >>101598280
--Uncensored Llama 3.1 405B unlikely due to finetuning costs: >>101597611 >>101597633
--Solutions for card defs and context issues: >>101599283 >>101599299 >>101599412
--M layers might be more truthful than S layers in IQ quantifiers: >>101597911 >>101599899
--LLaMA 3 405b q8_0 vs GPT4o for writing stories with scientific concepts: >>101597961
--Miku (free space):

►Recent Highlight Posts from the Previous Thread: >>101596623
>>
File: 1722124705471.jpg (77 KB, 1080x174)
77 KB
77 KB JPG
Fuck, I was reading a book and found this sentence. Can't even escape AI shit in book.
>>
>>101601505
That book was ai generated.
>>
>>101601504
Thank you Recap Anon
>>
File: mistralai-legacy-models.png (96 KB, 1410x501)
96 KB
96 KB PNG
Legacy models. Old stuff.
https://mistral.ai/technology/#models
>>
>>101601006
And I've literally just used my default settings and it werks, and werks well (temp 0.7, rep pen 1.1)
I genuinely don't know if people are not using the recent (fixed) quants, are doing something utterly retarded like deep frying it with ridiculous samplers, or what
>>
>>101601504
Shoot looks like you missed one recap-kun
PSA: Broken ggufs, again:
>>101600467
>>
>>101601626
No? That was just him being a dumb and getting errors cause of outdated stuff on his end.
>>
>make is not recognized as an internal or external command,
operable program or batch file.
>>
>>101601649
kek
>>
>>101601649
cmake
>>
>>101601659
>>101601661
;-; pls help, I just want to talk to my chatbots...
>>
>>101601649
holy wintoddler
>>
New project from PyTorch for e2e quantization + inference on desktop GPUs and phones. Seems interesting but kinda barebones.
https://github.com/pytorch/torchchat
>>
>cmake gf
>>
>>101601389
>on proxy having keys which it never does
>So how do I resolve this
Send ecker a dick pic. That fucker still has Opus.
>>
>>101597440
YOU
YOU MADE ME DOWNLOAD BROKEN QUANTS
>>
>>101601768
>>101592040
>His quant are okay if he do it before me, you can use them, he's thrusty.
>101447251
>idiotproof
>mradermacher
>>100628103
>The biggest reason why mradermacher is a threat to the open source community is his complete disregard for the truth, and his willingness to attack and spread lies about the very people who contribute to it. This is not about open source or software freedom, it’s about a person who’s decided to make it their mission to destroy the reputation of someone they don’t like, even if it means damaging the open source community in the process.
>>100551353
>from mradermacher?
>he always somehow got busted quants.
>>100121770
>I HAVE AN ANNOUNCEMENT TO MAKE ! !!!! ! ! ! ! ! ! ! ! !!! https://huggingface.co/mradermacher/Meta-Llama-3-70B-Instruct-i1-GGUF quants are broken
>>100195457
>Anyone downloading from mradermacher after all this deserves broken quants lmao
>>
After last thread's shilling I tried Mistral Nemo base with 60k context of my long running story.
It seems slop free but other than that I'm not sure if its prose is better than instruct. It suffers from the usual base model inconsistencies, but is less prone to cliches. Some repetition issues that are probably fixed with samplers.

I was disappointed to find that it didn't seem very good at recalling very old events, or at least correctly associating very old events with the present ones.

Not bad overall, but I was expecting better.
>>
>>101601781
>101447251
shit, oh well i care the same about this shitpost as he does his quants
>>
>>101601768
Fucking retard downloading Michael Radermacher shit.
>>
>>101601792
My goto for fixing repetition with L3 is around 1.06 repetition penalty and 0.7 presence penalty. Not sure how much Nemo needs, but this usually hits the sweet spot of avoiding repetition without everything slowly devolving into token soup
>>
>>101601827
>presence penalty
>2024
>when dry exist
>>
>>101601402
are bartowski's fixed?

i have found this that claimed to be fixed https://huggingface.co/bullerwins/Meta-Llama-3.1-70B-Instruct-GGUF
>>
Is there a dumbed down spoonfeed guide for retards like me?
>>
>>101601374
*try Nemo
Also buy an ad, shill.
>>
>>101601850
These are for sure, they're only a few hours old
https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/tree/main
>>
>>101601873
sink or swim bitch
>>
>>101601894
>bro just simply load the model into the backend
Is where I'm utterly lost
I don't get HOW to do that
>>
>>101601834
Mind sharing your best sampler settings then, anon?
>>
>>101601914
what model what backend how are people even supposed to begin trying to help
>>
>>101600938
So what's the correlation betwee 7b, 12b, 70b etc and Ram/GPU? Can a 2070 not run higher than 12b or is it just a question of speed?
>>
>>101601914
Well, which fucking backend are you using?
>>
>>101601930
>>101601932
textgenwebui, like the guide said
Well I have the folder for it but don't even get how to use it
>>
>>101601931
>Can a 2070 not run higher than 12
corect now bak to leddit
>>
Is there an ollama option to pass your prompt in as already tokens? Not decode/encode unnecessarily
>>
>>101601982
tokenizing takes virtually no time at all
>>
>>101601422
You're confusing me with people that don't exist
I simply post maybe twice a day and then go back to cooming, why would I waste my time arguing with someone on the internet? And why would I get mad at someone on an anonymous image board trying to force their fetish on everyone? Now if you'll excuse me, I've recently discovered bovine robots and I must offer tribute
>>
>>101601505
>anon discovers the follies of language
>>
>>101601998
my prompt_eval_duration is 1/3 as long as eval_duration
>>
File: ElqJJOxXYAEZTDH.jpg (47 KB, 705x690)
47 KB
47 KB JPG
Ahh.. I remember months ago, downloading a ((adermacher)) quant and wondering why things were fucked.
Good to see he's now almost /lmg/ meme level with the likes of ((undi)) and the others.
>>
File: Capture.jpg (37 KB, 586x836)
37 KB
37 KB JPG
After the second most excellent coom in my life, I have to revise my opinion on Nemo. It is a weird model that is dumb, but the best thing it did for me is answered my question if you can make a perfect coombot. And now I think you can. Before nemo I thought that because you are just describing mashing pissers together, it is impossible to create perfect llm that writes novel stuff. Nemo makes me think that you absolutely can. Now if only it had better spatial awareness, was like 3 times smarter and wasn't a hallucinating schizo nemo would be the perfect coombot.

It also reached another milestone for me where I could just get it to write a 700+ token output and after a few rerolls I didn't feel the need to manually edit anything. But I still think it is dumb and recommend waiting 2 more weeks.
>>
File: 1707737072935982.gif (758 KB, 885x948)
758 KB
758 KB GIF
still just collecting cards because I'm a local supremacist despite not being able to run it.
>>
Is MoE dead? Deepseek has proven that it can still scale but I haven't seen any signs of anyone pursuing 400+B class models. Has Mistral abandoned it?
>>
File: ministrations.png (21 KB, 459x113)
21 KB
21 KB PNG
>>101601505
Started reading The Naked Sun, but had to put the book down after this. It doesn't come only from women's smut, like is often claimed here.
>>
>>101602134
For all we know OpenAI, Anthropic, and Google might be using it somehow
Meta seems to be steering away from it though
>>
>>101602134
>Has Mistral abandoned it?
https://mistral.ai/technology/#models
>Research models
Mixtral Available in 8x7 and 8×22 sizes
>Legacy models
Mixtral 8x7B
Mixtral 8x22B
>>
>eyes widen
>>
Now stop me if you have already heard this one.
ZeroWw, mradermacher, Undi95 and
fblgit all enter a bar and meet some venture capital investors. Then they walk out with 10M$ for their startup.
>>
>>101602186
Local is saved at the end, right?
>>
>>101602173
Those are ancient, anon.
>>
>>101602197
You could ask your model how it ends.
>>
File: webgenui.png (34 KB, 933x452)
34 KB
34 KB PNG
So has this finished installing now, can I close it?
>>
>>101602173
>>101602134
>>101602212
https://github.com/mistralai/mistral-inference?tab=readme-ov-file#model-download
>8x7B Updated model coming soon!
>>
>>101602232
>We're updating one of our preexisting models to do better function calling
>It's an MoE, so MoE isn't dead!
>>
>>101602252
Correct!
>>
>>101602232
That was from 2 months ago retard
>>
>>101602272
So?
>>
>>101602232
I want 5x12...
>>
>>101602278
Maybe just maybe, they wanted to wait and the new one will be based on Nemo. We can cope.
>>
It was the year of MOE, i thought to myself. Then, they stopped making MOEs.
Then, nemo comes out.. I've accepted the death of MOEs.
>>
It feels like Mistral Nemo likes what's in the context TOO much, it respects your instructions well but it will almost always fall for bait. Like if you ask it a logic question and put some unrelated info with your question it will think that info is related whereas other models ignore it, say it's not related or outright say it's red herring info that seems related but is not (Gemma2 did this).
Another symptom of respecting context too much is getting stuck in patterns. Like in Tavern your character might respond something like:
>Description of emotions/expressions. "Dialogue."
>Text about character actions etc. "Part 1 of a sentence" Little descriptive break "Part 2 of the sentence."
>"Dialogue." Ending of reply that addresses user in some way like wondering what the user's next reaction will be.
And then the character will keep using this exact pattern for every reply even if the content of the reply changes. And the weird thing is raising temp doesn't really solve it that well, it will usually just go schizo sooner than it will break pattern repetition. And the rep penalty options don't really work that well on formatting patterns compared to words and phrases.
>>
>>101602296
Keep believing, anon. He will save us...
>>
>>101602310
Mixtral Instruct was similar people called it "autistic" because it focused too much on the context..
>>
File: moe hanging himself.jpg (381 KB, 1779x2000)
381 KB
381 KB JPG
>>101602313
Eventually..
>>
>>101602296
The next moe will be a bitnet.
>>
>>101602353
and there'll be a million of them, yeah? million of bitnet moes?
>>
Grok will save mixture of experts. They have infite muskbucks driven by spite against OpenAI so they will never give up. The miserable failure of their first try will be remembered the same way the first exploded rockets of SpaceX are.
>>
>>101602384
After a year, Meta still hasn't even begun experimenting with MoE. How long do you think it will take Grok's team made up of the AI industy's leftovers?
>>
what if you did a mixture of 70 billion experts that are 1 param each
>>
>>101602434
>https://huggingface.co/google/switch-c-2048
>>
>>101602434
It'd be a compressed version of this
https://huggingface.co/Snowflake/snowflake-arctic-instruct
>>
>>101602310
Prompt issue.
>>
>>101602420
They already used it for the first grok, so just a matter of making one that isn't shit.
>>
>>101602434
one million experts should be enough
>>
>>101602080
>Now if only it had better spatial awareness, was like 3 times smarter and wasn't a hallucinating schizo nemo would be the perfect coombot.
enter: largestral
>>
>>101602493
>>
someone mind baking a real thread? OP is a troll
>>
>>101602458
Nope, I have the same experiences as that anon. It's actually quite typical for Mistral models, I had the same problem with Mixtral, though I think Nemo may be even worse with that. It really clings to patterns like autistic retard.
>>
>>101602641
>Mixtral and Nemo are the same
Nah, that's placebo. They're pretty different.
>>
Could the model repeating unwanted patterns be vectored away?
>>
>>101602796
I didn't say they are the same. I said they are the same in that aspect. I've tried enough models to compare. Llama family models don't do this at all for example.
>>
Smartest model (and quant) I can run with 24/64 GB VRAM/RAM at >1T/s?
>>
>>101602837
Something 70B Q4. Now go away.
>>
>>101602828
Nah, they're nothing alike. You're just retarded. You can't even tell what's the actual issue because you're paying too much attention to the name of the model.
>>
>>101602849
From what I've heard, llama has a tendency to be overly positive
>>
>>101602837
Pyg-6B FP32
>>
>>101602641
List of things that you probably fucked up:
>spaces around [INST]
>system prompt not in the last user message
>not alternating user/assistant messages, without multiple consecutive messages of the same role
You have no right to complain.
>>
>>101602980
>system prompt not in the last user message
nta, do I have to modify the context template to put it last?
>>
>>101602837
Try Magnum 32b - look for a quant that will fit in 24gb of VRAM with the calculator

https://huggingface.co/anthracite-org/magnum-32b-v1
>>
>>101602980
>not alternating user/assistant messages
What does this mean, exactly?
>>
>>101603070
Not doing this:
>[INST]Anon: hi[/INST][INST]author notes/jailbreak[/INST]
But I don't actually remember what Silly ends up doing, I'm pretty sure it doesn't merge them.
>>
File: Capture.png (282 KB, 331x315)
282 KB
282 KB PNG
I have a 3090 and 64 gigs of ram, what is the best way for me to try mistral large 2 locally?
>>
>>101603131
https://huggingface.co/mradermacher/Mistral-Large-Instruct-2407-GGUF
>>
>>101602980
The official jinja template use space after [INST].
This is the example OAI messages fed to a jinja parser using official mistral nemo template:
<s>[INST] Who won the world series in 2020?[/INST] The Los Angeles Dodgers won the World Series in 2020.</s>[INST] You are a helpful assistant.Where was it played?[/INST]

But, you can also simply read the template and see that there is space after [INST].
>>
>>101603185
Now install mistral_common like the README says and see what this outputs:
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import (UserMessage,
AssistantMessage,
SystemMessage)
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file(f"./tekken.json")

completion_request = ChatCompletionRequest(messages=[
SystemMessage(content="system message"),
UserMessage(content="user message"),
AssistantMessage(content="assistant message"),
UserMessage(content="user message 2"),
])

chat = tokenizer.encode_chat_completion(completion_request)

print(f"text = |{chat.text}|")
>>
File: 1710213177231501.png (78 KB, 336x347)
78 KB
78 KB PNG
>>101603139
Thank you, I presume iq4_xs is best for me load 24 gigs into the 3090 and have ram handle the rest? What sampler settings and instruct format are meta these days?
>>
File: Captura.png (9 KB, 926x48)
9 KB
9 KB PNG
Never using a template from 4chud again
>>
>>101603253
But it works though
>>
File: jons likes dragons.png (247 KB, 767x644)
247 KB
247 KB PNG
Welp.
I have a D&D Lorebook with an entry that lists the most common types of dragons, 5 metallic, 5 chromatic, and 5 gem dragons.
I just asked Nemo-instruct what are the species of dragons that exist in a given D&D setting and the motherfucker listed a bunch of dragons I didn't put in the lorebook, even some I didn't even remembered existed back in the day like the Electrum dragon.
Pretty cool
>>
>>101603069
Has anyone tried this? Is it good?
I had a horrible experience with the official Qwen finetune desu
>>
>>101603227
https://github.com/mistralai/mistral-common/issues/3#issuecomment-2173452895
>>
>>101603332
it's magnum. made by THE alpindale. it is KINO to the MAX - Use TODAY or your life is forfeit
>>
Is spam the only for of marketing that finetuners know?
>>
>>101603412
Drummer bought an add. Which is ironic because I don't think he tunes commercially. I'll shill my models once whenever I release them, but I do it for free, no ko-fi or any bullshit. If a model I make happens to be interesting I share it. If not I keep it to myself..
>>
>>101603253
>ntrkek melty
>>
>>101603432
And that's why no one cares about you and your models
>>
Silly needs Last User Prefix and Suffix fields, for a myriad of reasons.
>>
>>101603494
There's a little thing called "Jinja". It's a language that let's you customize everything about a template. You should look it up.
>>
>>101603488
Hi Sao.
>>
>>101603540
I'm aware, but I doubt they'll add a Jinja Parser to it.
>>
{% include "opisafaggot.html" %}
>>
>>101603488
Hi undster
>>
>>101600938
>amd rocm doesn't support Ubuntu 24.04
lmao
>>
>>101603554
He's trying to tell you Silly has a jinja parser for some stupid reason, even though dumb search and replace would accomplish the same thing for their purposes with less bloat.
>>
Anyone tried https://huggingface.co/TheSkullery/NeMoria-21b? Nemo self merge.
>>
>>101603761
>Nemo self merge.
and so it begins
>>
>>101603761
>my very own clone
>>
>>101603761
>My very own clone, now neither of us will be virgins.
>>
File: Miku Stopped.png (8 KB, 559x31)
8 KB
8 KB PNG
STOP MIKU STO-
>>
>>101603827
Hammer time!
>>
>1 t/s
Bitnet... tuskete...
>>
File: miku hammer time.png (83 KB, 956x242)
83 KB
83 KB PNG
>>101603856
ACK
>>
File: 1494307190094.png (11 KB, 411x387)
11 KB
11 KB PNG
>>101603488
The various quants of my models have hundreds of thousands of combined downloads. So clearly they care.
>>
>>101603883
>she literally did the equivalent to this painting when i responded with an OW and -ACK
>>
>>101603827
>>101603883
Anon, are you using Lamma3's format with a model that doesn't tokenize those strings?
>>
File: awgwagawsswgawe.png (21 KB, 441x121)
21 KB
21 KB PNG
>>101603924
No, I'm using a card that doesn't recognize the boundaries and limitations of any model.
>>
>>101603933
Sure, but the character shouldn't be trying to output <|eot_id|> unless it's a model that doesn't use the llama 3 tokenizer and was trained using it's special sequences, or you are using the wrong instruct template with the model.
>>
File: 1549153077366.png (241 KB, 498x499)
241 KB
241 KB PNG
>>101603885
>>
someone please link me that freddie freaker card im literally shaking right now where the fuck is my freddie freaker card from last year
>>
Largestral goes schizo after about 10-12k context for me, is this due to Q2 quant? It works perfectly fine, great even, at lower context.
>>
>>101603885
Tdrussel?
>>
>>101604014
nevermind found it, archived
https://char-archive.evulid.cc/#/character/FREDDY%2520FREAKER-e1a8ac12cf7a098679834284434d02f4


One game scenario i strongly suggest is doing like a PVE deathmatch on a random map of your choosing, dropping in characters that intend to find and kill/rape/whatever you, for instance i have Mr. Bucket, Freddy Freaker, and now CREEPY MIKU DOLL chasing me down in an abandoned hospital.
fun stuff, i hate that i never did this more in the past.
>>
Nemo's doing a really weird thing were it seems to be blending character cards together in group chats, what could be causing that?
>>
>>101602146
Asimov confirmed hack. But the series is great.
>>
>>101604069
no but I use his script
>>
>llama3:8b-instruct-q5_K_M
what does the K_M mean?
>>
so, I havent kept up with local LLMS in a minute, but saw llama 3 came out. Ive got a 3090 and 32 gigs of ram, whats the new hotness?
>>
>>101604137
The prompt.
>>
>>101604291
It's a sub-type of the QX quantization. Instead of quantizing the whole tensor with the same offset+scale, it's done in blocks. It's more precise than Q5_0. It goes QX_0 < QX_K_L < QX_K_M in precision with regards to the original weights.
>>
>>101604064
I have IQ2_M and just tried swiping on one of my mid-length chats at 18k, and it seemed fine. Its response matched the ones before it. You have the latest Llama.cpp? Mine is a build from this morning.
>>
>>101604297
Mistral Nemo 12B or Gemma 2 27B.
>>
File: 1722057472013989.png (232 KB, 512x512)
232 KB
232 KB PNG
>>101604358
thanks, time to go talk to Miku
>>
>>101604297
Mistral Nemo 12B if you want more than 8k context. Gemma 2 27B if you're fine with just 8k.
If you care more for assistant tasks rather than ERP, you could also try Llama 3.1 70B with 'IQ2_M' quant if you want. It'll be a bit slow, but smarter, with high context length. Make sure to download a newer upload since there were fixes just today. Bartowski's quants should probably work. >>101601889
>>
>>101604356
All models do fine if they continue the output of another model. But chat with the same model for a bit and it starts picking up patterns in its previous replies and becoming repetitive
>>
>>101604546
Yeah that can happen if you're not careful and paying attention. But when I do that, I'm able to have quality chats just fine above 12k. And anyway, the other anon, or you, said schizo, which I usually take to mean actual gibberish and incoherent sentences rather than simple repetition problems.
>>
again just want to state I'm retarded for forgetting to enable instruct mode with nemo. I see why a lot of people enjoy it now.
>>
What to do about the repetition problem now that the gptslop problem is kinda fixed? These llms are giant pattern matching algorithms so how do we make them overcome their nature?
>>
there's gotta be a better way to use this technology than gooning... right??
>>
>>101604606
Repetition is only a problem for faggots who insist samplers are a meme.
>>
>>101604641
>sampler shill
Gave DRY a try, it didn't fix shit, just rephrased the text differently and made the output less accurate because the accurate token was basically banned
>>
>>101604623
Yes
Supergooning
>>
>>101604606
I don't have that problem, really. Maybe it's because, other than giving it a very general outline, i don't give it instructions on how to write. And i use them mostly for co-op writing more than chat/rp. I only use temp and fairly low min-p for samplers.
>>
>>101604606
someone posted a nemo sampler json 2 threads back that fixed the repetition for me. Nemo is almost too good now I spent hours gooning.
>>
When presented with multiple tool options, instead of responding with the actual <function> call, it responds with the definition of the function it chooses, like an excerpt of the System prompt. IIRC hermes2pro didn't do this, it just straight wrote up the function invocation.
Do I have to compare its response to the different functions to see which tool it selected, then re-send the prompt with only that tool listed?
>>
>>101604623
>>101604679
>supergooning
Yes. The tech hits its ultimate peak when it can infinitely generate a good erotic RPG with diffusion images on the fly, with a story and fetishes of your choosing.
>>
>>101604675
DRY is fucking retarded. Dynamic temp combined with smoothing, rep pen and freq pen is all you need.
>>
>>101604732
how many controlled tests did you do to ensure DRY is dumb or did the immediate settings you tried shit the bed
>>
>>101604750
Any sampler that straight up cuts out tokens is never going to be good. It'll either make the model go retarded trying to force the closest equivalent it can get away with, or it'll be so ineffective that you would've been better off manipulating the probabilities in the first place.
>>
>>101604688
It mixes things up for me too often. Though usually it's fixed if I just redo the last message.
>>
>>101604750
Sorry, he can only use Kobold Discord samplers.
>>
>>101604851
mixes things up in what way?
>>101604806
yeah I can see that. I saw some statements that started really similar before trailing off to something else.
>>
>>101604688
Those configurations don't work btw. Maybe they work on whatever 4GB quant you pajeets are running idk
>>
>>101604864
on q8 but I'm only using his samplers
>>
>>101604864
this guy has been shilling that config for a week and honestly puts out okay stuff but then it goes schizo and won't stop generating. i have better luck with my own prompt
>>
>>101604859
>mixes things up in what way?
Well, I was just using the 0.3 temp and nothing else really to test (with the q8) but sometimes it spells things wrong, and it will think something happened to one character that happened to a different one.
>>
>>101604918
I heard for rp you can use higher temps for nemo.
>>
>>101604806
I think that's the biggest challenge with Nemo, considering how prone it is to hallucinating and performing terribly on high-context recalls. You have to keep the temp around 0.3-0.5 (Mistral's recommendation) to get good results for in-context trivia, but you only get repetitive slop. After 8K context, the model's recall ability worsens considerably, and increasing rep-pen to try and combat the repetition only makes it shittier at it. It fucking sucks that this problem will likely never be fixed by finetunes, because Nemo's 128K ctx is just a meme and the abysmal prompting format doesn't help either. If the format allowed system prompts and contained roles/tags like 'user' and 'assistant', the model might be able to parse the prompt better and actually improve its recall ability.
>>
>>101604939
Not sure that would fix the strange mix-ups though. I was keeping it low to try and fix that.
>>
>>101605034
Idk I'm not seeing spelling errors on the samplers that were given. Also I have just default mistral for context and instruct. Everything seems to be working well, although I'm not pushing far beyond 10k context. It does get super wordy though and I have to cut its responses off.
>>
>>101604426
thanks anon, I was impressed by some of the 3.1 demos so ill definitely check out the quants
>>
File: investigations_slop.png (2.05 MB, 1271x949)
2.05 MB
2.05 MB PNG
There is slop on Ace Attorney Investigations 2 bros...
>>
>>101605072
It must be due to the large context then. It's pretty useless if it's only good until 10k context.
>>
>>101604968
Does it not use system prompts? It should understand them, even if it wasn't trained on them, but I think it was? Anyway, it seems fine on the context, at least up untill 16k-ish, which is when most models start shitting the bed, regardless of context.
>>
>>101604623
No, it's too stupid for anything valuable.
>>
>>101603231
Read the thread, perhaps you will change your mind about downloading quants from this fella. If you want a smooth experience just download c4ai-command-r-v01-Q5_K_S or bigger. Default presets for commandR in sillyTavern. All samplers neutralized, except temp:0.95, minP:0,05, repP:1,03, repR:~500. It just werks.
>>
>>101604968
That's a lot of placebo and jumping to conclusions in a single post. I think you might be terminally retarded.
>>
>>101605203
Command R is obsolete.
>>
File: he_bought.png (49 KB, 776x172)
49 KB
49 KB PNG
>He actually bought the ad
>>
>>101605121
Uh, yeah, don't expect anything great compared to the online demos. IQ2_M won't be that smart, but it will be at least smarter than 27B, probably.
>>
>>101605245
there's a way to increase the quality of low quants like iq2 and iq3, requires some finetuning
>>
>>101605245
well yeah, the supposed 128k context was what caught my attention really.
>>
>>101605255
hi petra
>>
>>101605225
What a lad. That's our boy.
>>
File: 1709166967643172.png (160 KB, 1429x198)
160 KB
160 KB PNG
>mistral-large
>>
>>101601931
>>101601943
2070 can run higher than 12, there are 16GB mods
>>
>>101605221
What's best around the 40gb model size then?
>>
Mistral Large seems to keep it together well into the 15k-16k range, CR+ would usually start falling apart by this point.
The French won.
>>
>>101605294
All you need is Largestral at IQ2_M and 70B at Q4_K_S.
>>
>>101600607
Ooba I think also allows you to load/unload models via API.

>>101602051
prompt_eval is the time needed for "prompt processing", i.e. the time needed to populate the KV cache with the tokens from the prompt.
While this does require tokenization the bottleneck in terms of performance is how quickly the model can be evaluated on many tokens in parallel.

>>101604291
I. Kawrakow named k-quants after himself, the _S, __M, or _L stands for small, medium, or large and indicates how many layers with higher precision than the quant number are mixed in.

>>101604623
One of my long-term goals is to add MediaWiki RAG support to the llama.cpp HTTP server.
Then you could use information from basically any wiki to supplement the generations.
For example, you could use the Moster Girl Encyclopedia wiki for supergooning.
>>
https://arxiv.org/abs/2407.09450
Human-like Episodic Memory for Infinite Context LLMs
>>
>>101605355
speaking of "infinite context" what ever happened to hyena? nothing come from that yet?
>>
>>101605221
Nah, I don't think so. Google models are mind broken by cultural marxism. Mistral models rather too big or lacking spatial understanding and too formal. Llama3 models sounds exactly like Llama2 but with repetition issues.
I think CommandR is the best option for RP in terms of quality and size that a local can have under 30b. You don’t even need to tinker with it, just plug and play, as any good model should.
>>
>>101604722
Not happening in our lifetimes.
You will sooner master lucid dreaming.
>>
>>101605384
>lacking spatial understanding
Min P to 0.15, all spatial understanding issues gone.
>>
>>101605397
And before you say it hurts the creativity use some typical p. Suddenly its smart and creative. I find this better than lowering the temp like mistral suggested.
>>
Nemo or mini magnum? I found the latter to give richer replies
>>
>>101605347
>Then you could use information from basically any wiki to supplement the generations.
Nice
>>
>>101605405
>Nemo or mini magnum? I found the latter to give richer replies
Then what is the question?
>>
>>101605384
>Llama3 models sounds exactly like Llama2 but with repetition issues
Not true. Also, llama 3 7B performs at llama 2 30B levels.
>>
>>101605384
>too formal
Use mini magnum and your waifu will be talking like a drunken sailor in no time
>>
>>101605417
What do YOU prefer at that size and why
>>
>>101605446
Most finetunes i've tried are too heavily trained on RP stuff. I prefer stuff that reads more like a book. Nemo is sufficient.
>>
File: Ce9OIv4.png (134 KB, 1013x1274)
134 KB
134 KB PNG
>all it takes to utterly confuse and mindbreak gemma-27b-it is to question it slightly
Gemmasisters...
>>
>>101605404
>Dude, just tinker with it, it will be good this time.
It was underwhelming experience so far with Nemo. It likes to ask dumb question (You really want it user, to do this thing?) and it's writing is quite plane. With CommandRv01 I got brilliant responses without bothering with prompts or samplers.This was the first model in my memory that simply worked, and oh boy, how good it is!
Thanks for you suggestion, maybe I try mistral models again. But for now, I consider CommandR models lineup is the best for RP.
>>
>>101605347
since you're done with the ngram decoding the question is
>trainer wen?
>>
>>101605345
Which 70b?
>>
>>101605562
This year, it will depend on how many unforeseen issues I will run into.
My first goals will be to implement tests and training for toy problems like MNIST.
>>
>>101605538
>You really want it user, to do this thing
Nemo, has said by mistral team, goes schizo with "high" temp. It needs to be reined in and suddenly its as smart as gemma 27B. The better way to do it instead of lowering temp is a min P of about 0.15 and some typical p. I used to use commadr a while ago but I can tell you its far worse than nemo.
>>
>>101603933
>crazy insane aaaaAAAAAAAA
so this is this is the power of /aicg/...
>>
what's the difference between an l40 and ada 6000 anyway
>>
>>101603933
>*crazy insane aaaAAAAAA*
>makes no sense
>robotic voice
>coding genius
kek this is actually the worst card I've ever seen
>>
I'm making my GPU write userscript for me and it's fucking amazing. I don't know jack about HTML and I have customized so many things. Technology is awesome and don't let anyone tell you different.
>>
>>101605405
magnum is a lil poopy
>>
>>101605528
>have the power to mindbreak a model into accepting any reality. i didn't ask for this!
Joking aside, it still argues that the feathers have twice the mass in the third line in the last output.
>>
>>101600938
if my pc has an NPU can i run a llama off that?
and if so how much better is it?
>>
>>101605613
A6000 is designed for workstations so it comes with a fan and can act as your main GPU for image output by default
L40 are designed for servers so it only as a passive cooling body and you need to change its default mode if you want to use it for picture output. I think its VRAM is also set to a lower speed for some reason, possibly to help with the passive cooling within servers.
Aside from that they should be more or less identical
>>
>>101605347
Hi Joh, got a question. Please let me know if what I'm saying makes sense or not.
In the current implementation of llama.cpp, the efficiency seems suboptimal, particularly for iQ quantization. The issue is that for them , esp. the smaller quantization levels, the block sizes aren't very large. Additionally, these blocks are aligned with low precision. As a result, the input/output operations and the processing schedule lead to poor memory saturation. This means that typically 50-60% of the memory isn't being utilized, especially when processing a single batch.
Consequently, even though inference is memory-bound, about 50% of the power isn't engaged, particularly on older GPUs. OpenMP provides some improvement, but it's primarily beneficial for CPU processing.
My question is: Is it possible to optimize the data block sizes and their alignment to improve overall efficiency?
>>
>tfw the model understands your fetish even better than you yourself do
I'm scared.
>>101605627
Not my experience. For a model its size. Although I must say I run it at 4k context because my GPU is shit. I usually don't need much more, but I might grab smaller quants and try them out for a bigger context.
>>
>>101605653
>smaller quants
Do you really need that much speed? Why not just use ram as well?
>>
No Mistral 12b nemo not works with Oogabooga.
When can i finally play with a good model too
>>
>>101605684
>Oogabooga
Works fine on llama.cpp. I assume it works on kobold.cpp as well. Use one of those.
>>
>>101605684
>When can i finally play with a good model too
When you upgrade and no longer have to run 12B models
>>
>>101605670
My card can barely run at reading speeds like this. What I don't need is much context, actually. ST keeps the character card always in context, right? That's all I need.
>>101605684
It works with llama.cpp
>>
>>101605347
does it make sense to split by row on multiple GPUs (nv or AMD) if you don't use nvlink/fabric?
Does tensor parallel speed it up or slows it down?
BTW, I've found flash attention 2 that works on 7900xtx for stable diffusion . Any chance of implementing that feature in inference/trainer?
>>
>>101605576
llm.c works pretty well on CPU, nv and amd, so you could borrow quite a bit from that repo IMHO.
>>
>>101605577
>I used to use commadr a while ago but I can tell you its far worse than nemo.

Disagree. I went back to 3.5bpw Command R after trying out 8.0bpw NeMo.
>>
>>101605694
A 12b barely goes at reading speed fully on a GPU? That's surprising. Well, it's fast enough for me using ram, I don't need that much speed. So long as it's 3-6T/s I'm happy.
>>
what's with retards making bold claims that GPT-like architecture cannot ever EVER ABSOLUTELY NEVER match human intelligence.
Every time I ask for clarification they just repeat how GPT works, as if that supported their claim.
>>
>>101605691
>llama.cpp
okay i install now

>>101605693
i have a 4090 tho
>>
>>101605577
I'm aware of it. My temp was 0.4 and MinP is 0.1 rest is neutralized.
> I can tell you its far worse than nemo
Maybe you got fuckedup quants?
Just an example from RP session with Nemo: Nemo wrote, that Char was busy picking up stuff from the ground, standing back towards me. I decide to look at char's butt, expecting to read a nice narration of her round behind, instead Nemo immediately noticed my gazing stare (a lot of small models do that, they cannot pretend to not see). Furthermore, when scene move to more intimate course, Char kept asking if we should continue. My guess it's a new core instruct format of good old"As an AI model I cannot..." now changed to "The following content is not suitable for all users, please confirm that you want to continue." Never had such things with CommandR, it always stayed in character and did it very well, gave good detail of what was happening, wasn't afraid to do lewd description and usually understood what was visible to the char's eyes.
>>
>>101605738
>i have a 4090 tho
24GB is nothing
>>
>>101605742
:(
>>
What's a known good gguf for nemo instruct?
>>
>>101605739
Is that with base nemo? I highly suggest trying out base nemo. Smart and creative even at the max 128k context. Freaky positions and all.
>>
>>101605651
To maximize memory bandwidth you need to load the data in a coalesced way and have the memory transfers aligned to (I think) 128 bytes.
The current array-of-structs layout is bad for this because you only get 2 or 4 byte alignment and the data is not always contiguous.
So it would make sense to transform the data into a struct-of-arrays layout.
However, when I previously built a prototype for this (~8 months ago) I was not able to get better performance for batch size 1.
I think that due to caching the overhead from the suboptimal memory accesses is greatly reduced (or my prototype implementation was just bad).

For MMQ better alignment would likely make a difference though.
On Ampere or newer you can do computations and data loading in parallel but you need at least 16 byte alignment for that.

>>101605701
>does it make sense to split by row on multiple GPUs (nv or AMD) if you don't use nvlink/fabric?
>Does tensor parallel speed it up or slows it down?
Depends.
P40s are slow enough that the synchronization overhead hurts comparatively less.
On Linux you can also do direct GPU<->GPU data transfer via PCIe (needs hacked drivers for 4090s) which also helps with tensor parallel.

>BTW, I've found flash attention 2 that works on 7900xtx for stable diffusion . Any chance of implementing that feature in inference/trainer?
I will also do a backwards pass for FlashAttention but for AMD in particular the problem is that for whatever reason the HIP performance for my FlashAttention kernels is just bad.
It probably has to do with a different shared memory configuration so realistically this would require a dedicated ROCm implementation (which I currently do not plan on doing).

>>101605722
I already know how to do basic FP32 training on homogeneous hardware, that will be the easy part.
The difficult part will be low-precision training (using int8 if possible), partial offloading, and general quality assurance.
>>
>>101605778
ggufs should all be identical as long as you don't fall for the imatrix meme
>>
>>101605727
>Every time I ask for clarification they just repeat how GPT works, as if that supported their claim.
Before words form in your head, you have an abstract concept that you're trying to communicate. You can evaluate if something you're being told is true or not before uttering a single word. So far, we have a 405B model available, trained on the equivalent of tens of thousands of books, but it's still not enough. It's pretty good, but it's far from human intelligence. We can keep stacking parameters, train for even longer, but there's only so much we practically can do with this architecture.
Is it impossible? I don't know. But there's, at least, a practical limit given our current tech. Upgrades to the architecture will help, of course, and new tricks will be found. At what point does the future GPT-like architecture stop being like its predecessors?
As for the quote, that is exactly why. At the very least, you're not bound by tokens.
>>
>>101605727
The issue with transformers is that they're static. There's no backward pass in the inference . So unless you come up with a dynamic model , perhaps based on neuroevolution or at least interactive tuning, you'll achieve cat intelligence that stucks to what it has learned 6 months ago. And it won't be very skilled at creating really novel ,yet working well, stuff either.
Human IQ is a relative term. Most leftists and some Afroamerican/AfroAfrican settlers yield room temperature levels of intelligence. Yet the whole point is to adapt to new challenges never seen before. Homo-sapiens are good at it, unlike any modern AI, whether it is neural network, genetic algorithm , ant-net, capsules or cell automata.
>>
>>101605790
What's wrong with imatrix now?
>>
>>101605727
The ability to speak does not make you intelligent. There's more to intelligence then parroting what you've memorised.
>>
>>101605726
Show me your ways please. What do you use? What command line flags?
>3-6T7s
I get 2-3 T/s
>>
>all layers offloaded to gpu samples at exactly the same speed as using ram alone
OK, what I'm I doing wrong? Is llama.cpp vulkan just not that good?
>>
>>101605924
are the layers actually running on gpu?
>>
>>101605943
It says "offloaded 41/41 layers to GPU" but the speed is sus
>>
>>101605955
What's actually on your GPU though?
>>
>>101605965
How can I check?
>>
>>101605970
I don't know about AMD but nvidia has nvidia-smi which shows how much VRAM is currently occupied and what services are using it.
>>
>>101605910
What ways? I only have 8gb vram and can't even load the full q8 12b model in vram, so I use koboldcpp. I'm getting 5T/s at 16k context still.
>>
I'm a newfag in local, anyone have a good model that i can run on 24 GB Vram? with how many context, and what quants
>>
>>101606019
read the OP
https://rentry.org/lmg-spoonfeed-guide
>>
>>101605965
>>101605943
I'm retarded. I had another instance of llama running. I'm getting 26 T/s now hahaha
>>
>>101605996
nvtop is even better
https://github.com/Syllo/nvtop
>>
Why does the thread baker insist on keeping all these worthless shitpost guides in the OP that haven't been updated in 6 months and nobody has looked at in even longer? Has he been trying to sabotage /lmg/ all this time?
>>
>>101605786
let's say the model is 10GiB 8bit and you throw input sequence of 1000 tokens in/out.
How much (roughly speaking) of bytes is transfered both ways to/from core/vram for each token?
>>
OK, now that I've resolved the matter of my retardation. So I'm running mini-magnum on 8 GB at 26T/s and 8k context, so obviously I have room to trade that speed for something else.
Other than increasing context size, what can I improve? I'm running Q4_K_M.
>>
>>101605638
its gonna be like apple m chips
>>
File: 11__00744_.png (1.78 MB, 1024x1024)
1.78 MB
1.78 MB PNG
>>101606106
The difference between Q4 and Q5 is very subjective, some anons will swear by it.
Personally the gap between Q4 and Q6 is more apparent.
If you can run Q6 and q8/q4 quantized cache and see if the performance impact is worth the output quality
>>
>>101605632
1 line after saying the opposite
>>
>>101606150
I know. It's like it's not actually intelligent. Weird...
>>
>>101606039
nvitop even better than nvtop
>>
>her eyes widen at your words, a mix of x and x flickering across her face
>>
>>101606205
tfs no model has ever said "face it, pet" to me.
>>
>>101606169
yet the same experiment with mistral-large it will stick to its guns and say you're wrong if you pressure i t
>>
>>101606226
Mistral-large. The 120B+ model, 5x the size of gemma27B? You don't say...
>>
>>101605465
I feel the same way and that’s why I’m content with NovelAI when not using local
>>
>>101605795
>Before words form in your head, you have an abstract concept that you're trying to communicate
So do current LLMs.
>You can evaluate if something you're being told is true or not before uttering a single word.
true or capital T True?
>trained on the equivalent of tens of thousands of books
Which is nothing compared to what we get from our eyeballs and the continuous testing against reality we carry out our whole lives. A transformer model can be trained on this data but we'll need more compute.
>>101605797
>The issue with transformers is that they're static.
Your neurons aren't rearranging themselves on the fly when you come upon a new challenge, you're using what you've already learned.
>>101605905
wishy-washy retardation
>>
>>101606257
Mistral Large is more confident when challenged in general. I asked if it was sure about a woman wearing high heels indoors in her own house and it was like "I said what I said bitch" while llama 405b was like "thank you for calling attention to my error, friend, she probably shouldn't be wearing heels and I'll fix that right away."
>>
>>101606358
Sounds like Mistral Large has experienced my upstairs neighbour.
>>
>>101606349
>so do current llms
They don’t. They don’t have a head to house concepts in. They just act like they do. Unless you want to go the metaphysical route and say they have a “hidden mind” somewhere in the ether or as a wispy emergent soul hidden in the weights. They are fundamentally different from actual brains and bodies.
>your neurons are not rearranging in real time
Your neuronal pathways are constantly rearranging themselves even in response to your own thoughts. Again, llms are fundamentally different to this.

You simply don’t know half as much as you think you know about this subject, which is why every informed response you are given is “retarded” to you. You’re the one who is retarded.

Have a nice Sunday.
>>
>>101606349
>So do current LLMs.
It's just token probabilities. That's it. You're much smarter than that. You have a thought process, you have branching while you think of possibilities. You can abandon a train of thought when it looks like it's going nowhere and go back to when when you've re-evaluated.
>true or capital T True?
It doesn't matter which. You know your name. Someone arguing that your name is different will have a hard time.
>Which is nothing compared to what we get from our eyeballs...
Which is why i mention the practical limits. They're just token predictors. Takes entire server farms months to train the thing with only text. Most successful image generation methods are diffusion models which are definitely not GPT-like. LLMs are great, but they have hard limitations. They can only look back.
>>
>>101605924
Vulkan has received significantly less optimization work than CUDA/HIP and I would argue that the performance ceiling is also inherently lower because it is comparatively higher-level.

>>101606078
The model size doesn't matter here, what matters is the size of the activations.
Unless you are using a V100 or compiled with GGML_CUDA_FORCE_CUBLAS the activations should be quantized to 9 bits per value and then broadcast from the main GPU to all others.
The results are then written back as 32 bit floats, with each GPU writing back a fraction of the results proportional to --tensor-split .
>>
>>101606358
>...I asked if it was sure about a woman wearing high heels indoors in her own house...
I'm not sure how to answer to this.
On the one hand, there's not enough context to know if any of the models had a reason to say what they said. Are you one of those master llm quizers? What's the problem with her wearing high heels indoors? Were there people visiting? Was she trying shoes on? I'm sure she'd put the shoes on before going out, at which time she'd be indoors, wouldn't she?
On the other hand, whatever they said, they were obviously trained on different data, for different purposes. One is asserting, the other one is pliable. Are you the one that asked the feathers vs lead question? do exactly the same with mistral large and llama 405b, then post screenshots.
>>
trying out the new 32b magnum, really liking it so far
>>
where are the oai cucks now
>>
>look through the 20+gb of raw claude opus logs
>lots of good entries for my fetish
>look through Sao'd filtered dataset that all the finetunes are using
>almost none of it made it in and the ones that did aren't that good
it's over
>>
>>101606508
They will keep losing all their talents are gone and all what remains are fat woke chicks.
>>
>>101606395
>They don’t.
They do.
>They are fundamentally different from actual brains and bodies.
They don't have to be anything alike brains and bodies, what matters is what they output.
>Your neuronal pathways are constantly rearranging themselves even in response to your own thoughts.
No they are not, you absolute baboon.
>You simply don’t know half as much as you think you know about this subject
Nice argument, retard. Kill yourself on this nice Sunday.
>>101606428
>It's just token probabilities. That's it.
This is what I meant by "they just repeat how GPT works, as if that supported their claim".
>You're much smarter than that. You have a thought process, you have branching while you think of possibilities. You can abandon a train of thought when it looks like it's going nowhere and go back to when when you've re-evaluated.
Nothing there that can't be emulated with next token predictors.
>It doesn't matter which. You know your name. Someone arguing that your name is different will have a hard time.
There are an infinitude of ways names are handled that would make neither person wrong.
>Takes entire server farms months to train the thing with only text.
And it will take a week in a decade.
>Most successful image generation methods are diffusion models which are definitely not GPT-like.
Because diffusion is much faster but multi modal GPT has image capabilities that diffusion models have a hard time approaching like making edits based on text instructions, haven't seen an instruct diffusion model worth a damn.
>>
>>101606612
>This is what I meant by "they just repeat how GPT works, as if that supported their claim".
You've been told that many times because THAT is the reason they cannot, at least practically, reach human intelligence. Saying "oh, well... what about in 100 years of tech development" is meaningless. It's just speculation.
You just don't like the answer, so you'll keep asking the same questions.
Now that i think about it, some llms may be smarter than you at least.
>>
>>101606653
>You've been told that many times because THAT is the reason they cannot
It's not a reason tho and the retards just cannot explain further because it's not a position they achieved by reasoning.
>at least practically
The retards aren't couching their statements with that. "Never" and "Not currently" are not the same fucking thing, are they?
>>
None of you retards are moving the needle with this autistic discussion. Kill yourselves.
>>
>>101606475
>On the one hand, there's not enough context to know if any of the models had a reason to say what they said. Are you one of those master llm quizers? What's the problem with her wearing high heels indoors? Were there people visiting? Was she trying shoes on? I'm sure she'd put the shoes on before going out, at which time she'd be indoors, wouldn't she?
You don't have the context but the LLM did. It was in the middle of a story. There was no one home but her husband and children who were all in different parts of the house doing their own thing. There was a sound of her walking in heels upstairs, so I asked OOC if the LLM was sure. It seemed weird to me but I don't inherently know if it's unusual or not, just that I've never seen such a thing at home with family.

No I didn't ask feathers vs lead.
>>
>>101606612
For one, you have something LLMs don’t have. Your whole series of posts ITT are you tickling your reptilian brain (and baiting others into doing it for you) in order to sublimate whatever deep seated trauma you have that makes you crave the kind of conflict you’re sowing here. You’re acting 100% like a hylic (all about body based emotional reactions). LLMs are pure psychics in that regard (all mind). But that’s a silly analogy, so I’ll leave it there.

My point is that the emotional drive behind your behavior right now is another fundamental difference between you and a large language model. If anything, an LLM would be one of the components that would make up an hypothetical human level intelligence artificial human machine.
You’re trying to argue a GPU can perform all the functions of a complex and multifaceted IRL robot. And you don’t care how absurd your argument gets. You just crave the argument for the sake of it. You crave feeling smarter than me, “owning” me, at a pre-verbal level. That’s not something an LLM alone can do.
>>
>>101606395
LLM's do have a "head" to house concepts in.
It's the document that it's working on.
That's why context matters so much. The context is its mind; it's "thinking" with the document it's predicting on. That's why a tiny edit to the document can vastly change what it is "thinking" and "deciding to do next" but given enough context it can express the illusion of complicated thought.

We also use that in things like the "which is bigger, 9.11 or 9.9, use <thinking> tags and then <answer>" kind of test. It isn't actually thinking but it this chain of thought kind of process lets use the document/context to iterate and put more processing into the problem.

The funny part is that the first, <thinking>, block is actually not thinking; that's guessing. The <answer> block is when it virtually thinks by re-processing its guess and making a revised prediction.
>>
>>101606395
NTA but the latent space in LLMs could be considered abstract concepts, it arranges verbal and logical concepts in an abstracted way (a list of numbers), I think the key to making these models smarter in making these vectors bigger so they can have a more complete ability to abstract
>>
Nemo backwards is Omen
>>
>>101606760
Yes, you’re right. I meant they don’t have another, hidden context, to act as their verbal inner monologue. They can simulate it, but it’s still all running together and linearly. Humans have parallel contexts constantly getting discarded and recreated, cross-feeding, connected to non-verbal functions, perception, proprioception, etc. Much more complex and fundamentally different.
>>
>>101606776
Maybe. But also consider >>101606802

We tend to simplify humans too much when thinking about making AI (understandably because so many people are oblivious of the depths of their own being and its reach throughout the body, beyond the brain). When you really think about it, we’re barely scratching the surface of what an artificial being would need to match a naturally evolved one in all its complexity.
>>
>>101606680
Ran out of arguments, boy? Come on.
>>
>>101606759
>You’re trying to argue a GPU can perform all the functions of a complex and multifaceted IRL robot
Incorrect, you mongoloid, I'm arguing you can't make the claim a GPU can never "perform all the functions of a complex and multifaceted IRL robot"
But you're too retarded to see that, in fact you're so retarded you keep trying to compare LLMs to human aspects and pointing at some perceived inability of LLMs to simulate that aspect as some gotcha. You're stupid, emotional drive has nothing to do with intelligence. Fuck does it matter if a LLM can or cannot have it?
>>101606849
You can't answer that one simple question I asked and I'm the one out of arguments? Please end yourself.
>>
>>101606448
I ask because the average speed of llm inference is roughly same as the size of the model in bytes divided by the mem bandwidth. So for 50GiB llama 3 and and 100 GB/s ddr4 bandwidth the speed is usually about 2t/s. I believe the same rule apply to GPU inference. That's why I asked.
Are only the hidden states are being transferred , and if so, how much do they contribute to the overall amount of data compared to the size of the model itself? In transformers for each given token all the previously generated tokens must be thrown through all the attention heads. So how much data is that in Llama 3 (or similar model) with 1000 t prompt, mmq with 8bit activations and no kv cache quant? I'm asking for a very rough estimation.
>>
>>101607040
>Are only the hidden states are being transferred
Yes, only the hidden states/activations and the results are being transferred.
There is no transfer of weights between GPUs.

>and if so, how much do they contribute to the overall amount of data compared to the size of the model itself?
That depends heavily on the quantization type and batch size.
And it's not directly comparable anyways because unlike with the weights or KV cache you can re-use the buffers.

>I'm asking for a very rough estimation.
Even a very rough estimation would depend heavily on a lot of parameters and be a considerable amount of work to calculate.
Quite frankly I do not want to do it.
>>
>>101604623
VC grifting
>>
>I will not suggest actions for USER or speak on his behalf. As the AI assistant, my role is to play the part of other characters and describe the environment and events, not to make decisions for USER or suggest what he should do. That's for you, as USER, to decide based on the situation I've described.
>How would you like USER to respond to BADGUY's provocative statements and implied threats?

Getting this refusal when asking for ideas for how to respond was annoying but also funny. I guess when using a more powerful LLM having too much "never write actions for {{user}}" stuff can result in this.
>>
File: 1472860069099.png (191 KB, 600x979)
191 KB
191 KB PNG
Can I please have any good recent model recommendations that would fit entirely into 8gb of vram? Thank you.
>>
>>101606349
>neurons arent rearranging
Actually, thats not entirely truth.
Even if you're old, when faced with a novel situation, the brain undergoes neuroplasticity, forming new neural connections through processes like synaptic plasticity and dendritic spine growth. This involves the release of neurotransmitters such as dopamine, which enhance attention and promote the formation of new pathways. The creation of these new connections, alongside the pruning of less useful ones, allows the brain to adapt, learn, and form memories in response to new experiences.
And let's now forget that humans (unlike transformers) have memory. Actually a few kind of memories. Also, human senses work all the time registering all of the stuff 24/7. Then it's being processed on either conscious or subliminal level. Transformers aren't even remotely like that.
>you're using what you've already learned.
Is that the reason particle phisics and cosmology stuck to 100 y.o, Shrodinger and Einstein equations, and big particle accelerators are just tax payers money laundering?
>>
>>101607204
Forgot to mention that I'm using petra-13b-instruct
>>
>>101607206
>>101577510
>>
>>101607206
>good
Y'all just have to make these questions impossible to answer.

You can fit an 8B at a respectable quant like Q6 or a high Q5 and still have VRAM for the computer to have a screen.

Maybe a 13B, like Orca 2. I see a Q3_K_M at 6.34GB, though if you're going Q3 you probably need it to be an IQ3 with iMatrix to be anything but lobotomized nonsense. (Model quality drops off of a cliff from Q4 to Q3 without the aid of iMat, i1, and/or IQ.)

>>101607273
Might be right; I haven't looked at tiny models in a long time. Well, I have but I look at them for 5 minutes, realize they're insufficient for my needs, and go back to 70B's.
>>
>>101605158
No, with Mistral, everything the user prompts is inside [INST] [/INST], and as you fill the context, your system prompt—usually at the beginning of your chat—gets lost inside a sea of [INST] brackets. To be fair, most models aren't decent at following their sys prompts anyway, as they eventually railroad themselves into writing in a certain style, but with Mistral, the problem is more noticeable. With LLaMA, sys prompts are written as such: <|begin_of_text|><|start_header_id|>system<|end_header_id|>{PROMPT}<eot_id>; It helps the model understand its primary directives during a long chat session, and you can invoke additional sys prompts later in chat to railroad it into doing what you want. I've found that with Mistral, using last output sequence instead of one long sys prompt in the beginning of the chat, improves things drastically.
>>101605211
Neck yourself monkey nigger.
>>
>>101607315
>I've found that with Mistral, using last output sequence instead of one long sys prompt in the beginning of the chat, improves things drastically.
Gemma 2, which doesn't use a system prompt either, behaves similarly.
>>
>>101607086
I understand, but in that case, how do we even estimate how much of memory bandwidth is utilized, and how do we attempt to optimize the code if we ain't sure how much of data flows from/to the memory?
>>
>>101607240
I didn't say humans don't learn new things at all.
>allows the brain to adapt, learn, and form memories in response to new experiences
Yes, eventually, but you can solve a novel puzzle before your brain internalizes any new knowledge. Working memory isn't that.
>And let's now forget that humans (unlike transformers) have memory
Transformers have memory. Do you not see transformer models output knowledge from either the context or the weights? What a weird statement to make.
>Also, human senses work all the time registering all of the stuff 24/7. Then it's being processed on either conscious or subliminal level. Transformers aren't even remotely like that.
Because it is insanely expensive currently. Again, the question is "is it impossible" not "is it practical"
>>
>>101607352
Go through the computation graph, get the tensor sizes, calculate the I/O, and sum it up.
It's relatively straightforward, the only reason I'm not doing it right now is because it would be tedious.
With NSight Systems you should also be able to get a summary of the tensors that were used if you know how that translates to template parameters/CUDA grid sizes.
>>
Llama3.1 is doing better when it comes to attention for larger context than Nemo.
>>
>>101607273
Why does this crash when I try to load it does webui not support it.
>>
>>101607361
Joh, we both know that transformers are just the last token predictors. Actually , not very good ones. That's the reason ARC challenge by François Chollet is so God damn hard to LLMs. They can't adapt. We the humans adapt. If we didn't, you'd still look (and think) like a monkey or a fish. Hopefully you don't.
>>
Welp 32B Qwen magnum is retarded (4bpw), tried it from no samplers to every sampler and prompt format under the sun, much worse than mini-magnum.
>>
>>101607556
There is an update that needs to filter down to the wrappers.

Someone said that Nemo worked on Kobold 1.71 but it didn't for me so there's either another setting to make it work or that was conflated with the other project/fork that got it working but isn't in the main line yet.
>>
>>101605391
Fuck you, we're almost there, I will not die until I bust the BIGGEST nut in my own personal supergoon chamber 9000
>>
Stop being lazy holy shit..

https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.71013_b3455%2B9
>>
>>101607600
>Someone said that Nemo worked on Kobold 1.71 but it didn't for me
>Merged fixes and improvements from upstream, including Mistral Nemo support.
https://github.com/LostRuins/koboldcpp/releases/tag/v1.71.1
dunno what to tell you dude, maybe you downloaded bad quants or something but yeah it's clear as day written
in other news 1.71.1 added this
Hotfix 1.71.1 - Fix for llama3 rope_factors, fixed loading older Phi3 models without SWA, other minor fixes.
>>
if you're a vramlet and want to make use of nemo's huge context you shouldn't use kobold anyway, because it doesn't support context shift and cache quantization at the same time
llama.cpp does but unfortunately has no DRY yet so you have to fix the repetitions yourself
why are there so many implementations?
>>
>>101607584
anon you're all over the place, come back when you're sober
>>
>>101607477
i wanna give 3.1 a shot, any good finetunes or should i use the official instruct version?
tried nemo yesterday and wasnt happy with how repetitive it got
>>
>>101607705
>>101607705
>>101607705
>>
>>101607584
Feels like monkeys are getting smarter than us every day
>>
>>101607686
I had no problem with repetition at all on nemo. but in a larger context, it's got problems with attention. 3.1, I say, is not as good with reasoning from the start but is able to hold attention longer. Kinda shitty trade-off, but it is what it is.
>>
>>101607669
stfu monkey. go back to your 13 gram virtual gf. Luckyl you won't replicate, so we the humans won't be contaminated in the long run.
>>
We still need cat level intelligence
>>
>>101607814
big drunk over here
>>
>>101607648
>maybe you downloaded bad quants
Probably.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.