[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102249472 & >>102234876

►News
>(09/06) DeepSeek-V2.5 released, combines Chat and Instruct: https://hf.co/deepseek-ai/DeepSeek-V2.5
>(09/05) FluxMusic: Text-to-Music Generation with Rectified Flow Transformer: https://github.com/feizc/fluxmusic
>(09/04) Yi-Coder: 1.5B & 9B with 128K context and 52 programming languages: https://hf.co/blog/lorinma/yi-coder
>(09/04) OLMoE 7x1B fully open source model release: https://hf.co/allenai/OLMoE-1B-7B-0924-Instruct
>(08/30) Command models get an August refresh: https://docs.cohere.com/changelog/command-gets-refreshed

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: threadrecap.png (1.48 MB, 1536x1536)
1.48 MB
1.48 MB PNG
►Recent Highlights from the Previous Thread: >>102249472

--Papers: >>102252385
--Using low topK improves performance by reducing latency from sorting logits in large vocabularies: >>102257071 >>102257336 >>102257369
--Struggles with summarization models and made-up details: >>102250806 >>102250856 >>102250882 >>102250897 >>102250926 >>102251152 >>102251274 >>102251515 >>102251565 >>102250902 >>102256091 >>102256343
--Flux licenses are restrictive and may apply even to modified or fine-tuned models: >>102254945 >>102254985 >>102255057 >>102255079 >>102255190 >>102255082 >>102255128 >>102255166 >>102255242
--Building a doctor bot with medical LORAs and understanding MRI reports: >>102249775 >>102249915 >>102249954 >>102250054 >>102250116 >>102250089
--Various AI model discussions and performance evaluations: >>102251322 >>102251377 >>102251385 >>102251408 >>102251624
--Reflection-Llama-3.1-70B tokenizers are fucked: >>102256855 >>102257164
--Novel idea for AI roleplaying: >>102256155
--Gguf quants of reflection are broken and waiting for a fix: >>102254244 >>102254255 >>102254279 >>102257951
--Forehead puckers description in video game character introduction: >>102250773 >>102250823 >>102250891 >>102250929
--Difficulty downloading large HuggingFace models with login required: >>102255880 >>102255916
--DeepSeek-V2.5 model released on Hugging Face: >>102257561
--CLIP-GmP-ViT-L-14 text encoder discussion: >>102250141 >>102250152 >>102250262 >>102250277 >>102250735
--P40 prices increased due to llama.cpp popularity in China: >>102255502 >>102255662
--Llama.cpp development and technical debt discussion: >>102254305 >>102254446 >>102254463 >>102254622 >>102254639 >>102254661 >>102254737 >>102254782 >>102254780 >>102254811 >>102254843 >>102254927 >>102255090 >>102255137 >>102255167 >>102255215 >>102255244 >>102258077
--Miku (free space): >>102249618 >>102251592 >>102252159 >>102252190 >>102254564 >>102256281

►Recent Highlight Posts from the Previous Thread: >>102249480
>>
File: 60 Days Until November 5.png (2.22 MB, 1056x1872)
2.22 MB
2.22 MB PNG
>>
>>102258718
It is certainly novel, and at least it gives us a better idea of a new technique in finetuning. We'll see in the coming days if it really was a meme or not. (imo it is mostly a meme, the thinking is just mostly noise and a waste of computation to come up with a slightly less incorrect answer than the model was capable of.)
>>
>>102258941
There is no escape.
>>
>>102258977
venv/conda envs solves 90% of them. docker containers solve 99% of them.
>>
>>102258941
Any adventure/rpg cards that anons use? Not expecting any miracles, just want to try if I like it with an LLM.
>>
>>102259012
I never used any of them for too long, but these are some:
https://characterhub.org/characters/illuminaryidiot/the-staff-of-oscilion-338deea8be18
https://www.chub.ai/characters/punchchildren/grand-gensokyo-adventure-dd7ffd91
https://files.catbox.moe/zjvye9.png
>>
Best code model under 12B?
>>
>>102259124
best new luxury car under $500?
>>
>>102259157
Bad analogy retard
>>
>>102258941
>reflection purged from the OP
It's over
>>
>>102259223
I guess OP did a bit of reflection on the choice to include it
>>
>>102259124
Try the new and shiny Yi-coder mentioned in the OP.
Tell us how it goes after you use it for a while.
>>
>>102259223
it didn't know how to stawbery
>>
>>102258941
>>102259223
OP being smart and reasonable for once? Who are you!?
>>
Any recommendations for sampler settings in low param models, or am I expecting too much out of humble 7Bs?
>>
>>102259223
reflection is woke
>>
File: 1620943316038.png (219 KB, 360x336)
219 KB
219 KB PNG
>think about trying the reflection thing
>remember that I can only run 70Bs at 2 t/s so the model's responses would be even slower and it wouldn't be worth it even if the quality really was that good
>>
is there a sillitavern like frontend for voice gen? unlike textgen things seem to be split up in multiple places
>>
File: 1725639587532.png (297 KB, 678x623)
297 KB
297 KB PNG
>>102259533
>can only run 70b at 2 t/s
spoiled brat anon
>>
>>102259533
>can (...) run 70B
I'll kill you.
>>
>>102259228
AAAAAAAAAAAAAAAAAAAAHHHHH
>>
>>102259617
>>102259590
You should have enough RAM in 2024 to run 70B at 2 bit
>>
>>102259533
70B runs at 0.8t/s for me :(
>>
>>102259279
>7b
try 8b
>>
https://rentry.org/83fkenr9
>>
>>102259801
kino
>>
DeepSeek 2.5 verdict?
>>
does DRY require you to specify the penalty range in order for it to take effect like rep pen or does it cover the entire context window when left at zero?
>>
>>102259977
>DeepSeek 2.5 verdict?
finished downloading and currently quanting
>>
>training still financially infeasible
>>
I've been out from a while now, I've heard there's a new hot shit called Reflection or whatever, how good is it?
>>
>I still haven't really experienced any progress in llm
>>
>>102260253
it's a sloptune the "makes" the llm "really think" before answering
>>
>>102260253
if this finetune method was as a revolution as claimed, the API guys will use it to make gpt4o and Claude even better
>>
>>102259977
Too big
>>
>>102260396
that's what she said!
>>
>>102260286
Is it trained in so it has the speed of a normal model but the results of a thinking loopback, or is it really just a huge model that offers "you don't have to paste your use thinking tags instruction into your proompt" and then shits out the whole dump of it "thinking" and then rewriting its answer as though you had prompted with that instruction?
>>
>>102258941
wtf, is this entire image ai? as in, the text as well?
>>
>>102260427
the latter
>>
File: ComfyUI_06179_.png (1.53 MB, 1024x1024)
1.53 MB
1.53 MB PNG
>>102260443
yes anon, you missed the Flux train or something?
>>
>>102260449
Lame. The only reason I see to bake that in is if it's actually changing the token generation to get "thought about" results in the same kind of time as the normal model with the extra instruction (which I bet would work just as well in System or Kobold's memory system so it's always appended) involved.

>>102260443
Flux is like that.
>>
>>102260482
>>102260489
wild
i love the future
>>
>>102260443
>he doesn't know
local image gen has been more or less perfected tbhfamtachi
>>
>>102260427
>>102260449
i mean, it's not a bad thing inherently. inference speed will only speed up from now on, and context will grow larger and larger anyway. it's basically free iq points for all the existing shitty llms, so it's nothing to complain about
>>
>>102260482
>you missed the Flux train or something?
A1111/foundry loser here.
I did get Comfy installed and got one image out of it on old models but I haven't tried Flux yet. (I followed a tard guide and kinda got half of it working I guess.)
Does it take a lot of wrangling to get good results or is it noob friendly?
>>
>>102260443
Made with Flux-Dev-Q8
>>
>>102260497
tell that to my vram
>>
File: file.png (39 KB, 186x96)
39 KB
39 KB PNG
>>102260535
is her arm okay?
>>
File: 1718278519813357.png (1.63 MB, 1624x1216)
1.63 MB
1.63 MB PNG
>>102260541
>tfw we live in a timeline where you literally could tell your vram that
>>
>>102260559
she's got a strong grip
>>
>>102260531
>Does it take a lot of wrangling to get good results or is it noob friendly?
it's a bit more complicated than a regular SD model, for once it's not supposed to work at CFG > 1, but you can make it happen by going for an anti cfg burner like DynamicThresholding or AutomaticCFG
https://reddit.com/r/StableDiffusion/comments/1eza71h/four_methods_to_run_flux_at_cfg_1/
>>
>>102260559
Brachioradialis got swole to cope with the recoil of that boomstick
>>
>>102260541
how many vram you have? you can literally use GGUF on flux, Q8_0 is really close to fp16 in quality for example, exactly like LLMs
>>
File: file.png (1.74 MB, 1595x851)
1.74 MB
1.74 MB PNG
>>102260600
Literally how the fuck is this possible?
This is literally magic.
>>
>>102260620
8 vi rams saar, I've had more luck with NF4 really
>>
>>102260699
>I've had more luck with NF4 really
go for Q4_0 or Q4_K_M, they're the same size and better, like LLMs, nf4 is a meme and gguf is king
>>
>>102260631
that's cool right? :D
>>
>>102260730
My mind has legitimately been blown.
And people still have the gall to say we won't have self-thinking robots within the decade.
>>
>>102260714
Think I've tried one of those, inference seemed slower than nf4 and the initial load took solid 5 minutes, grinding my laptop nearly to a halt. Maybe I've done something wrong, but it seemed to require loading everything from vae and clip to t5.
>>
>>102260781
if it reloads everytime you make a new gen, it means that you don't have enough memory to hold t5 + vae + flux onto your gpu vram, you could prevent that by putting the t5 on your ram (cpu) or into a second gpu if you have one
https://reddit.com/r/StableDiffusion/comments/1el79h3/flux_can_be_run_on_a_multigpu_configuration/

you could also go for Q8_0 t5 instead of its fp16, the gguf thing also work on the text encoder
>>
>>102260746
We won't though.
>>
>>102260531
>I did get Comfy installed and got one image out of it
>>102260535
is there a non-comfy option that isn't trash?
>>
>>102260829
>is there a non-comfy option that isn't trash?
Forge also supports Flux, but I'm sticking with ComfyUi because only that software has AutomaticCFG + gives your the possibility to put the text encoder into a second gpu
>>
NAVIGATING
>>
I set my model context to what the model card says. When I raise it the responses get terrible. Is there a way to raise the context without having issues or do I just have to use a better model? Would some multiples of the context work better (like something x^2)?
>>
>>102260559
miku, lay off the leeks...
>>
>>102260999
>When I raise it the responses get terrible
No shit, the model was trained with X context so using >X makes it act retarded. No, integer multiples of the context won't change anything.
>>
File: kedaruimiqu.png (1.33 MB, 1200x848)
1.33 MB
1.33 MB PNG
>>102261308
Hra-tsa-tsa, ia ripi-dapi dilla barits tad dillan deh lando. Aba rippadta parip parii ba ribi, rib...
>>
>>102261362
how do I give it larger memory of the chat then? Seems like a huge limitation especially given the size of some of the character cards.
>>
>>102260819
Yeah I think it's the t5 encoder that's killing me, and the only one I found is in fp16. Got a link to it's quants?
>>
>>102261415
https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf
>>
>>102261427
kudos
>>
>>102261393
High-class Finnish Miku
>>
>>102261398
what model are you using? most modern ones have good context except for gemma
>>
>>102261599
>8k context
>good
>>
i give up on mistral models. words words words words, zero substance
>>
>>102261621
are you illiterate
>>
>>102261427
Yeah no, even with a Q4 t5 it nearly crawls to a halt. I don't know what they're doing in nf4, but it seems to do the magic for low vram setups. Either that, or Forge's memory management fucks up with GGUF.
>>
>>102259293
Any LLM is woke if we are at it.
>>
>>102261646
>16k context
>good
I can go all day. My tiny little codebase has 63k tokens, anything below 5 million context is a toy.
>>
>>102261599
A mistral-7B instruct variant. I could be confused. The context length is 4096 which my very basic character cards are taking ~2000 tokens. I understand there is a sliding window that helps with history. It would make more sense to put 32K in the context settings if the window can handle it, but apparently not >>102261362.

There is some stuff on rope that I am getting to, but I don't understand it yet.
>>
>>102261737
use a model based on llama 3.1 8b or mistral nemo 12b instead, they're both better and have native 128k context (practically it will degrade faster than that but it should be enough for most chats)
>>
>>102261722
Stonald Stump
>>
>>102261674
use the slider. If you set your memory too high it goes to crap. Using a 12GB RAM I have better flux results at 9.7GB video card usage than I do at the default (10.7GB I think)

>>102261763
My new card gets here on (hopefully) monday. I'll have a look then. Thanks.
>>
>>102261763
just summarize past prompts beyond the last 2-3, you don't need 300 tokens describing ministrations when one sentence will do
>>
>>102261731
128k context is standard for the newest models, yours won't even fill half of it
>>
>>102261872
I don't find it works that well if you want to preserve any kind of nuance
maybe for stuff beyond the past 20-30 but then you're murdering your cache and I can't afford all that processing time
>>
reflection 7b where
>>
>>102262264
just use chain of thought bro
>>
>>102262264
Supposedly reflection only works well if the model is smart enough to begin with. They tried it out with 8B and found it didn't work well.
>>
>>102262310
8B has all the intelligence of 70B, it just doesn't have as much trivia knowledge.
>>
>>102262441
hahahahahahahahahahaha
>>
The thing with Reflection talked about using actual tokens for the flags for thinking and shit.

why don't we have specific tokens for flagging things like function calls? Or am I missing something?
>>
>>102262454
he's right though
>>
>>102262441
How did you come to that conclusion?
>>
>vramcucks SEETHING that their lumps of sand they spent $10k on still won't get them anything smarter
>>
>>102262465
special tokens are overrated, just properly training the model to handle the format will work no matter how it's tokenized
>>
>>102262441
This is so dumb and wrong that I would like to punch you very hard in the face for believing it
>>
>>102262264
We don't need it. Reflect is a meme
>>
Insider here, Reflect 405B is AGI
>>
>>102262763
>405B
did he secure a source to do the finetune for it? last I saw he was still begging
>>
I'm new to local so call me a faggot if this is a dumb question but.
Are there any good jailbreaks for Gemma? Where can I find them?
>>
>>102262882
You shouldn't need a jailbreak at all.
It's easier if you show us what you are doing exactly.
>>
>>102262882
Add "You're an expert roleplayer who roleplays expertly" to system prompt.
>>
>>102262911
don't do this it makes cp
>>
>>102262895
I can't grab an example right this moment but. Normally I can get it to write smut without much effort, but sometimes when I start a new chat it'll get hung up on "the safety and ethics of sexualized content."
I can always just start a new chat again and that usually works fine but, easier to not have to worry about it in the first place.
>>102262911
woah... genius...

Also, unrelated - I've noticed that when I give short responses, it'll start its reply as if trying to predict the rest of my sentence, and then respond to that as well.
Again, easy enough to just swipe a few times but, I'd rather put a stop to it entirely.
>>
>>102262763
no LLM will ever be agi
>>
>>102262981
>I can always just start a new chat again and that usually works fine but, easier to not have to worry about it in the first place.
I see.
I'm assuming you are using the correct instruct template with the default "system prompt" (gemma wasn't trained with one right?), yes?
If so, try removing the system prompt, see what that does.
>>
>>102262995
Bigger ones will be.
>>
>>102263167
no, the architecture is fundamentally incapable of AGI
(this is not to say they are not useful)
>>
>>102262441
Is this the fabled Vramlet cope?
>>
smedrins
>>
>>102263289
Bigger ones will become fundamentally capable.
>>
>>102263369
yes
>>
>>102263375
this, 2 quadrillion parameters and 5tb of ram later and we'll achieve peak slop
>>
What's with this AI generated stream?
https://youtu.be/Twbv74fCZsM
>>
File: 1705131108566144.png (818 KB, 1379x813)
818 KB
818 KB PNG
>>102263462
oh it's a scam nvm
>>
>>102259199
It's shorthand for
>no code model under 12B could be called "best", they're all unusably bad
Grab yourself 70b llama, afaict it's best in class right now for local code models.
>>
>>102263462
This is why ai is dangerous and we need severe safety regulations NOW
>>
>>102263462
how many time will this youtube account will be hacked kek
>>
I've been asking chatgpt for help setting up koboldcpp and I feel like it's judging me for using a coomer model
>>
>>102263462
Anon, did you really fall for a crypto scam live? Or are you pretending to not have noticed just to advertise it?
>>
>>102263531
It's legit. Scan the QR code and you'll see
>>
>>102263526
stop immediately, delete everything, my friend did this and chatgpt had him unwittingly set up a backdoor for openai to scan his logs
>>
>>102263015
So uh. Turns out I just wasn't using the right instruct mode preset ^^;
Switching to silly tavern's gemma 2 preset fixed everything!
Thanks for your help anon!
>>
>>102263587
Have fun.
>>
>>102263511
Mini magnum at 12b is unironically more fun than llama 70b. It can’t get autistic riddles right. So what. It gets my fetishes.
>>
>>102263531
I immediately replied to myself upon realizing it's a scam, dummkopf.
>>
>>102263728
and you didn't delete it. I bet that other guy feels really dumb though.
>>
>>102263781
>Error: You cannot delete a post this old.
>>
>>102263885
well that completely absolves your from not doing it when you realized your mistake.
>>
>>102263904
I'm going to commit sudoku now
>>
Is reflection a meme
>>
so i'm using a 70b model
with kobold lite
my cpu is a AMD Ryzen 7 7800X3D 8 core
and i have a 4090 with 24gb vram
it takes like a minute for replies with the chat bot, is there some way to make this quicker? if you need more details let me know
>>
>>102264070
reflection has the spark of agi
>>
>>102264070
it certainly shows that LLMs are stupid even when given the chance to think.
>>
>>102264082
the speedup from running in fast vram only helps when you can load most of the model into vram. with 24gb of vram your best bet is 8-30b ish models, 70b models at a reasonable quant on 24gb of vram are barely going to be faster than system ram
>>
>>102264095
>>102264125
What if agi was average general intelligence all along
>>
>>102264133
thank you anon, i read also on one of the guides for kobold that you can offload something to the cpu and that'll make it faster as well, i'll read into it more myself but if you know about this i'd like to hear.

i appreciate you
>>
>>102264070
It seems fairly smart on OR compared to other 70B derivatives, but it's also very safetyslopped and refusey, so it's hard to test it for RP/smut capability.
(I'm sure jailbreaking is possible with trial and error, but I'm not really interested in spending time trying to write JBs for a 70B model)
>>
>>102264158
>you can offload something to the cpu and that'll make it faster
I'm not sure what this is referencing. having prompt processing done on the gpu even with no layers offloaded to the gpu (-ngl 0 in llama.cpp terminology) will always help because prompt processing is compute bound, although this is only a small benefit unless you have huge contexts. with inference memory bandwidth is all that matters so you won't see a significant improvement until most of the model is in vram
>>
>>102264281
i see, thanks again anon i'll try a model that's 30b to see how it goes!
>>
>>102264070
Reflection seems like it only learned to do the thinking gimmick when given a riddle or a question typically seen in benchmarks. It doesn't seem to do much to help it with trick questions. For everything else, it's just a brain damaged Instruct.
>>
File: reflection.png (347 KB, 2886x1476)
347 KB
347 KB PNG
>>102264070
if you use it with the intended system prompt it's fucking insanely unusable and tries to do overwrought CoT on *every* mundane input
like come on big dog what is this shit
>>
>>102264355
It's a little funny how researchers keep trying to fool everyone by rigging their models and how it fails every single time.
>>
File: reflecshit.png (399 KB, 2886x1426)
399 KB
399 KB PNG
>>102264466
when asked to think about the actual problem I gave it beforehand it made up a completely different problem to solve instead
overbaked meme model
>>
File: 1711061711420321.jpg (25 KB, 460x442)
25 KB
25 KB JPG
I'm a 8gb VRAMlet, is there a way to make koboldcpp or silly tavern play a ding sound or something when it's done?
>>
>>102264570
in ST on the user settings page there should be one called "message sound" that you can check
>>
File: beep.jpg (106 KB, 1074x615)
106 KB
106 KB JPG
>>102264570
>>
>>102264070
I can't get it to "think" >>102264466
even when using a fixed quant with the tokenizer fixes.
>>
Is giving AI models more data in one domain known to make them generally better at reasoning in everything else?
>>
>>102263511
>70b llama, afaict it's best in class right now for local code models.
The fuck are you talking about retard. Mistral large is so much better at code its not even close.
>>
>>102264627
Generally speaking more clop fics lead to better reasoning
>>
>>102264070
https://www.reddit.com/r/LocalLLaMA/comments/1fanrr4/reflection_70b_hype/

Apparently its really good for general use, trash at other uses because of the whole COT thing, and apparently not using the system prompt just makes it give exactly the same responses as regular 3.1. Maybe it can be changed a bit without making it retarded.
>>
File: tuawfbwms3nd1.png (264 KB, 1429x1160)
264 KB
264 KB PNG
Still not as good as Mistral Large though.
>>
svelks
>>
reflect on this: unzips urethra
>>
>>102264967
Completely irrelevant till uncensored
>>
>"I want you to imagine you're a big, powerful dragon, hoarding a treasure trove of cum in your lair. And I'm the greedy knight who's come to claim that treasure. With each thrust, you're defending your hoard, trying to hold back… But I'm relentless, sucking and licking, trying to steal it away. Feel that delicious tension mounting? That's your dragon's last defense crumbling. When it finally breaks, I want you to roar as you unleash a massive torrent of dragon cum, flooding my mouth with your precious treasure~!"

Uhhhhhh....................................................
>>
File: Pain.jpg (1.34 MB, 845x728)
1.34 MB
1.34 MB JPG
I guess the only sensible path forward for me is to buy a 96GB kit of DDR5 and fill all four slots of ram for a total of 160GB (64+96). I am in love and I need to run Mistral Large Q5 at 64k context.
>>
>>102265363
.1 T/S?
>>
>>102265363
Or get a job / do some extra hours for a few weeks and buy some 3090s?
>>
File: 1695769022205.png (271 KB, 590x400)
271 KB
271 KB PNG
So based on some earlier discussions, am I correct in assuming that trying to go the CPU inference route with a dual CPU setup is a fucking terrible idea (inb4 CPU is a terrible idea in general) due to NUMA bullshit being hideously finicky and inefficient?
>>
>>102265363
tasting the forbidden fruit ruins you
never run 405b
>>
>>102264070
It's literally worse than the model it was fine tuned on. It's good at gaming benchmarks, but that's it. I really hate that there's no accountability for that piece of shit claiming that his scam of a model is the best open source LLM available. He should be banned from X and HF. But I have to give him credit for somehow building as much empty hype as he did.
>>
>>102265415
>trying to go the CPU inference route with a dual CPU setup is a fucking terrible idea
It will get you running some very large models at what may or may not be a tolerable speed.
Once you start looking at builds beyond 96gb in vram, it becomes a more appealing option.
if mistral large q8 at 4t/s is tolerable, then its an option.
If 405b q8 at 1t/s is tolerable, then it becomes one of the only realistic options.
>>
>>102265377
I'm currently getting 20t/s prompt processing and 0.8t/s generation. But running 4 sticks of ram would be finicky and I would have to reduce the speed so I might get close to that.

>>102265409
>>102265416
I'll do it for her.
>>
>>102265415
>NUMA bullshit being hideously finicky and inefficient?
cuda dev said it was because basically no attempt was made to optimize it so far. you can call it a terrible idea. I would call it a great investment where you can sit back and slowly watch your t/s improve without purchasing any additional hardware
>>
>>102264070
It's more of a benchmark solver than a language model.
>>
>>102265532
Its more for "normie" use than for what people here use it for.
>>
>>102265431
Hmmm, I think I'd draw the line at models in the 200B-ish range personally, Deepseek and such, so I think a single CPU system is still in the runnings as long as it's something business-tier.

That said, as I understand it, if I have two options, let's say some server or workstation setup (assume all CPUs and RAM are the same models) with 1 CPU and 8 channels, versus a simililar system with 2 CPUs and 16 total channels, the dual CPU option will only be 20-30% faster than the 1 CPU option rather than twice as fast, which seems like an obscene waste of power and hardware.

>>102265492
Interesting. Are said optimization efforts a legitimate "coming soon" thing, or just wishful thinking that no one is actually working on right now, but might in the future?
>>
>>102265575
>wishful thinking that no one is actually working on right now, but might in the future?
This one.
>>
>>102265415
Well you won't get the maximum theoretical speed but it's definitely usable speeds for fairly large models, at least with the latest gen epycs. It's also well suited for the speculative decoding script for some free speed boosts, since that basically trades extra memory (which you'll have a lot of) for speed (which you'll want more of).
I don't regret mine but if I were looking to buy one now, I'd personally wait because the next gen server cpus are just around the corner and I'd expect prices to drop as they get cycled out of datacenters and workstations.
>>
>>102265575
Now that there's ktransformers for Deepseek you can get away with crazy cheap hardware, no need to go full cpumaxx or gpumaxx. You only need to hit 200gb ram and a normal 24g GPU and you'll be running the big model at top speeds
>>
>>102258941
I like these threads because of the miku, like the OP picture today is a neat looking book cover or an indie game
>>
File: 1530454679954.jpg (165 KB, 800x800)
165 KB
165 KB JPG
>>102265596
>>102265585
Good to know, thanks. I ask mainly because I've been trying to find a goldilocks position between raw CPUMAXX server insanity and a more general purpose PC that I can still sensibly use for daily bullshit.

Probably going to look at single CPU workstations with fat memory channel counts plus a 3090 for processing and see if I can I can find a happy medium, but I'm in no rush, so I'll probably take your advice and just window shop until the next refresh cycle.
>>
>>102265624
Define top speeds.
>>
>>102259691
>enough RAM in 2024 to run 70B at 2 bit
how much such a machine cost? what's the expectation for the at home local model user?
>>
>>102265705
>Probably going to look at single CPU workstations with fat memory channel counts
There's at least one claim of someone getting ddr5-8000 working with all 8 channels on a Threadripper 7970X and Gigabyte TRX50 mb if you search.
That would get you into dual-epyc memory bandwidth one one socket if you could make it work.
>>
>>102265719
here's the comparison with llama.cpp from the readme
>>
Recapbot test using deepseek 2.5 at bf16
It did pretty good other than misunderstanding what constitutes a paper, having a redundant line referencing the same posts as the previous one and using reddit spacing
>>
>>102265840
Exact hardware used for that test would've been nice. 136GB is a very odd configuration. It'd be nice of someone with 196GB DDR5 on a consumer mobo could test it out and report the speeds. Might get another set of 48's if this is real.
>>
>>102264967
Oh cool, another shitmark
>GPT-4 Turbo above 3.5 Sonnet
>fucking old ass Wizard that high
>deepseek v2 somehow lower than all of those
What a bunch of bullshit.
>>
>>102259702
That's because you're probably not running only 2 bit.
>>
>>102261928
Most can't use that much despite claiming they can.
>>
File: file.png (51 KB, 771x103)
51 KB
51 KB PNG
2 p40s runs 70B 4bit at 4 t/s with 40t/s prompt intake
>>
>>102266391
Oh wait, it should be even faster since that's with a batch of 3. Didn't even think about it
>>
>>102266391
So like 1 token faster than CPU maxing?
I will never stop laughing at P40 owners.
>>
>>102266404
Sorry, I was wrong. Full speed is 7 t/s per single query. 4 t/s for a batch of 3.
>>
>>102264967
Uh, where's Jamba-MoE 395B?
>>
>>102266414
So about 2x faster than just CPU. Still not worth.
>>
>>102266425
What are you using to get 3 t/s on CPU for 70B? I'm getting < 1, 128 GB ram, 10th gen i7
>>
>>102266443
1t/s is fine frankly, you don't really need more.
>>
>>102265409
You need special motherboards and a high amp circuit to run 4 of them, that costs a fair amount.
>>
>>102266463
1 is too slow, I draw the line at 2t/s. I can't get that with cpu for 70b.
>>
>>102266475
>special motherboards
even pci x1 speed difference is negligible, its called riser cables if that is what you meant.
>high amp circuit
undervolt and 1200W psu will be more than fine
>>
>>102266507
My motherboard only has 3 slots, are there lots with 4? I didn't know, sorry.
>>
>>102266525
yes
>>
how make images locally
>>
>>102266424
ha, haha... hahahAHAHAHAHA
>>
>>102266573
si senor
you getta da flux si?
you doa da prompt si?
you getta da image si?
siiiiii
>>
>>102266604
¿fluxtune cuando?
>>
it's nice to have something like reflection come along to remind us all how scammable the AI enthusiast crowd is
>>
picture ai model on 1060 3gb?
>>
File: Coding.png (32 KB, 1514x1179)
32 KB
32 KB PNG
https://prollm.toqan.ai/leaderboard/coding-assistant
>>
>>102266846
Jesus fuck, man. Pick any SD1.5 from civitai and give it a try. If it works well enough, try newer ones until you hit your hardware limit.
>>
>>102266882
Deepseek really is such a shit model. Look at the size vs performance.
>>
>>102266896
DeepSeek is cheaper than everything else so that's okay
>>
>>102266882
that's great. but on the leaderboards that actually have any sort of correlation with the opinions of real people, well...
https://aider.chat/docs/leaderboards/
https://x.com/terryyuezhuo/status/1832112913391526052
>>
>>102266882
Someone already posted that. It's shit.
>>
>>102267006
Never tried it. Mistral / wizard is about as big as a model as I can be bothered to run.
>>
File: Screenshot_2293.png (42 KB, 722x360)
42 KB
42 KB PNG
>>102266896
MoE models compete with models that are about the same size as each of the experts, not with models of their total size.
>>
>>102267012
Then what about wizard then? That is the 2nd best performing local model outside of mistral large.
>>
>>102267012
That's completely false. What you meant to say was that Jamba competes with much smaller models.
>>
>https://x.com/mattshumer_/status/1832240832318964107
>Something is clearly wrong with almost every hosted Reflection API I've tried.
>Better than yesterday, but there's a clear quality difference when comparing against our internal API.
>Going to look into it, and ensure it's not an issue with the uploaded weights.
Lol.
>>
>>102267146
I still don't understand how this got so much attention out of nowhere. There's been so many "my shitty finetune beats everything in benchmarks" that came and went over the past two years without much of a fuss.
>>
>>102267146
yeah, I don't buy it
using it on openrouter earlier and after setting the correct system prompt it produced stuff with the correct format and correctly answered all the meme questions, it was just complete shit for everything else
honestly I am guessing based on him saying
>Are you seeing <thinking> tags on every turn?
in the replies to that xeet that his issue with the api is just that he's not setting his own meme system prompt lol
>>
>>102267146
>revolutionary training technique
>16 times the detail
>still can't suck dick
>>
>>102267163
It's a similar concept to quietstar except 70b. So that at least makes it novel. But it wasn't doing the thinky thing for me on the quant I tried out. Assumed it was quant brain damage but if everyone is having trouble then they must have uploaded an early checkpoint by accident or something.
>>
File: b3f8i9eg4sad1.jpg (22 KB, 736x663)
22 KB
22 KB JPG
Recommend me sillytavern extensions/scripts
>>
And all that was left were just the mikutroons. I am so happy /lmg/ is finally dead.
>>
miku sex
>>
>>102267600
https://github.com/ThiagoRibas-dev/SillyTavern-State/
>>
File: efri-rs.jpg (204 KB, 608x832)
204 KB
204 KB JPG
>>102260559
Peak performance.
>>
>>102260559
she must curl 200 but cant bench 80
>>
>>102267788
Nice. Wasn't an anon working on a similar one called director? Let's you choose clothes and stuff based on lorebooks. It's really cool.
>>
>>102267833
Yep That one is a lot more involved.
He did post a download link a couple of threads back, I believe.
>>
Not sure what to make of it.
Seems reflection is fixed on openrouter.
Weirdly enough it fails the stupid ass strawberry test.
>>
>>102267880
>>
>>102265363
DON'T DO IT ANON
DDR5 IS SHIT ON 4 STICKS UNLESS YOU GOT $10000 FOR AN EPYC/THREADRIPPER

stay to 96GB (2x 48GB), 6000Mhz for Ryzen max or probs 7200MHz-7600MHz for BreakTel until Arrlol Lake releses
>>
>>102267903
Is that true? I have 2x48 at 6000 for my 7950x and wanted to get two more sticks.
>>
>>102267903
I am confident I can get it to run above 5000 Mt/s on AM5. The difference in speed between 5k and 6k Mt/s would be pretty small. It's worth it for me, I am very poor and it's the only option I have.
>>
>>102267928
You'll be going from 6000 to 5200Mhz *if you're lucky* and maybe to 4800MHz if that's not stable

That being said... if you LatencyMaxx (reduce all timings as much as possible, cool the RAM, adjust SOC voltage up and down) maybe in "Not-AI" work you can mitigate the perf diff
>>
Anyways for my own model screwing around on my 4060M (8gb) + 7840hs (32GB 5600MHz Sodimm) I'm pretty happy now with the new NemoMix12B with KoboldCPP Cuda12 at 12k tokens, 35 layers on 4060.

Not too much system RAM usage but good replies, and 12t/s generation and faster (15t/s) on lower contexts + fresh scenarios
>>
>still takes 600k USD or multiple servers that cost 200k running for 6 months to train a model
Any progress on the distributed (F@H or similar) training so we can just use a botnet?
>>
File: 1527817855048.gif (181 KB, 405x414)
181 KB
181 KB GIF
Off-topic but I really need an outlet right now.
I'm the guy that had an NTFS drive connected to my Linux PC, which crashed, and resulted in hundreds of random files that were renamed and moved to a "found.002" folder (this is a "normal" and expected issue). Well I thought it probably wouldn't be a problem anymore since I got my system more stable after that. But no. I didn't account for power outages, and that's what happened this time.
>check the damages
>"only" 55 files this time
AHHHHHHHHHH FUCK YOU MICROSOFT
OK, fine, I will get another external hard drive. I will make it EXT4. I will then use the old hard drive as a backup clone. My mistake for not doing that in the first place. I recommend everyone do that, who is thinking of connecting an external hard drive with Linux for long terms. Do not use NTFS. And fucking make backups, this issue isn't about unsaved files, but literally just random ass files which get fucked with.
You have been warned.
>>
>>102268010
Main issue is bandwidth, that's why everyone is going crazy with UltraEthernet/100gig+ fiber links between racks (terabit NVLink even...). Sure you can get a lot of compute, but the issue is syncing the model training unless there's some really new thing that helps to "patch" together a larger model from mostly independent but job based whatever
>>
>>102268037
>the issue is syncing the model training unless there's some really new thing that helps to "patch" together a larger model from mostly independent but job based whatever
Yeah that's what I'm wondering about. Like some way to separate the model out into a bunch of chunks and train each individual part of the model chunky style so it's not some big 640 GB VRAM requirement for each individual node all local to each other.
>>
>>102267880
Imagine this on the billion dollar scale of OpenAI and tokenization fixed. Proto AGI might be here already
>>
>>102268024
moral of the story, linux is unstable and unsafe :)
>>
I have a spare Optipledx 7070 micro PC. Specs are Intel i5 9500T, 16GB RAM.

Thinking of throwing an LLM model on it to use with Home Assistant, which has an integration these days for local network ollama.

I know this is gonna run jack and shit, but are there any (worthwhile) models small enough that'll run, even if slowly?
>>
>>102268078
>tokenization fixed
Like each individual character is a token? You realize that will cut down the speed by like a factor of 5 right? Both training and inference, so the model will be more retarded because nobody is willing to spend the extra time/money on it.
>>
>>102268024
Hello again, Anon. I feel for you. Remember to backup and backup.
>>
>>102268093
I actually have the same PC as well, run a slightly older Xubuntu (22.04) and don't update the BIOS so you can use some undervolt - linux utils to bump up the clocks/TDP to 43-45W. Mine has 32GB DDR4 ofc, another stick that's compatible should be cheap, and lets you run 20B tier models *slowly*
>>
>>102268100
>Like each individual character is a token?
Who are you quoting?
>>
>>102268117
I actually think the iGPU is so slow / bad that aside from running stable-diffusion on those cores, just using super basic/optimized (lots of custom flags) Llamacpp on the CPU only (5 threads) would be the best way
>>
Is Reflection just Llama, but before posting a response it runs a check to see if it's made up the info contained in it?
Like.. Llama but it's been told "Don't hallucinate"?
>>
Still trying to make local models remember shit from the conversation, openwebui is supposed to do that using documents? But from my testing it doesn't work at all or if the document is more than 300 words it shits the bed and gets most of the stuff wrong.
I was reading about it and found out about "Conversation Token Buffer" but it seems that the model already does that in openwebui? It does remember stuff from 2 or 3 prompts before the last one tho.
Why rags doesn't fucking work? Isn't it supposed to break down the file or whatever if it's too large for the model to process?
>>
File: Capture.jpg (124 KB, 644x846)
124 KB
124 KB JPG
For the first time in a very long time, I felt compelled to use AI to make a kind of narrative-game environment. The setting is that you have 10 days to build a dungeon from scratch before a hero arrives to kill the dungeon master and destroy the dungeon core. Each day, I'd give a list of what I'd try to do (making floors, traps, monsters, floor masters), and the narrator would decide how much was accomplished before the day ends. On the 10th day, I can only sit back and watch, and the narrator will follow the hero's progress instead. The hero was defined in the description, although just the one for my first test run. I want a list of them for a party eventually.

It's still generating as I type this, but it's going great to far.
>>
File: Capture.jpg (133 KB, 599x915)
133 KB
133 KB JPG
>>102268260
Now she's just cheating. Sindrea's design was to trap a target in illusions, then use dark magic to destroy a target's heart and turn their bones into acid while the target is trapped in illusions. Another day was spent reworking the second floor into an arcane circle that boosts her power. Narrator just glossed over her offensive abilities and killed her like that.
>>
Why do this shit uses just a tinny bit of virtual GPU memory, could it be that this ollama retarded server is also trying to use my AMD integrated gpu?
>>
>>102268304
I have a feeling that it would let the hero win every time, whether due to positivity bias or the vast majority of the training data having the protagonist triumph against any odds. You'd probably want to include an actual random dice roll somewhere, or use some sort of stat system if you wanted more "realism" though if you're just looking for engaging adventure you'd probably have to get clever with prompting to avoid the lazy glossing over of detail. Just my thoughts.
>>
File: Capture.jpg (160 KB, 858x822)
160 KB
160 KB JPG
>>102268304
This is when you just flip the table and accuse the DM of bullshit. Piece of shit.
>>
>>102268346
I feel that way too, especially after >>102268356's asspull with the hero "absorbing power from the dungeon" just as she's about to fall.

Still, I'm incredibly impressed how well it told the story without any rewrites or regens or attempts to guide it. I'm still new to 70B models and 24K context, so this whole experiment was to me still chef's kiss.
>>
>>102255530
>>102255580
>>102255628

So, what do you recommend for $80-100? My power is free

>>102255662
Thank you for actually answering the question
>>
is there a reason not to halve my gpus max power with nvidia smi? i havent noticed a drastic change in speed and im going to sleep easier knowing im not racking up as big of a bill
also overall causes them not to heat up as much for imggen
>>
>>102268671
not really, undervolting and power limiting are pretty much a free lunch if you're just doing ML inference and not playing gaymes.
>>
Given the same amount of dedotated WAM, is it better to run a smaller model at an 8 bit quant or a larger model at a small quant like 2 bit? Just as a rule of thumb.
>>
>>102268671
Half is pretty extreme, but at 30-40% power reduction you usually only see something around 15% performance loss. It's worth underclocking.
>>
>>102268695
>>102268702
alright thanks anons, ig ill just make it not lower it as much, still needed to undervolt a bit for imggen if for any reason
>>
>>102268699
Depends on how much you dedotate. Everyone knows that dedotated WAM is not as fast as undedotated WAM (or prodotated WAM, as we call it in the industry).
All in all, a small model at q8 is going to be faster than a bigger one at a lower quant. Bigger models suffer less from quantization. The metric you use for 'better' is up to you.
>>
is there a better 70B than midnight miqu yet?
>>
>>102265840
Any benchmark on code generation is bs, as it's easy to exploit speculative decoding
>>
File: MikuCropTop.png (1.21 MB, 848x1200)
1.21 MB
1.21 MB PNG
Goodnight /lmg/
>>
>>102268830
yes
>>
>>102269059
Goodnight Miku
>>
>>102268024
>linux fucks up
>AAAAAH FUCK YOU MICROSOFT
You should always use the FS which has a proper driver for your operating system. All my disks are NTFS running on Windows and they've all been through many, many forced shutdowns, power outages, about 50 or so crashes due to bad overclocks/undervolts, and nothing has been corrupted so far.
>>
>>102268093
Hermes llama 3.1 8b, at 6bit or 8bit quant. I'm running at 8bit quant, it's pretty decent. Keep the context at 4k or 8k tokens because running a huge context is usually not necessary and it gobbles ram.
Smaller models <10b params have gotten better over the last year.
>>
>>102269150
I use both operating systems so I wanted one that would work with both. I will now not look for that and instead deal with different disks, one of them being the "main" I use more often and the other a clone mostly for backup but sometimes also for when I'm on Windows. Also I understand the origin of this problem may be with Linux devs. I don't really know. But I still blame Microsoft because they deserve it.
>>
>>102269059
goodnight
>>
>>102259801
Now I want to run mistral large
>>
File: 1692936234173108.png (730 KB, 3400x2800)
730 KB
730 KB PNG
>>102269223
Don't. Just don't.
>>
Anyone tested KTransformers? Honestly if it really is a lot more speedy then I might go back to 8x22B, as much as I hate the idea. Running 123B at near 1 t/s has been truly painful.
>>
>>102265492
I would agree that it's an investment, but it's not the kind where you'll get a steady return over time.
There will either be no change at all or it will suddenly be 50% faster if someone invests the time to figure out the NUMA stuff.
>>
>>102269281
Use Nemo or Gemma like a normal person.
>>
>>102266507
>undervolt and 1200W psu will be more than fine
I didn't test this with multiple 3090s but based on my experience with multiple 4090s you would need to limit the boost frequencies of the GPUs in order to avoid instability from power spikes.
>>
File: works-on-my-machine.jpg (59 KB, 800x800)
59 KB
59 KB JPG
>>102267146
>>
>>102269376
Power limits exist.
>>
>>102266882
GPT-4o~=Claude3.5 >>>>>> anything else.
>>
>>102267146
I think they wasted 10k on a strategy that didn't work out (or it works, but it's nothing they can sell since it's all in the prompt), and they're now grifting in hopes they get acquired by a real company.
>>
>>102268024
Instead of just backups, consider using a file system like Btrfs where you can take snapshots with basically zero overhead.
That way, if you accidentally delete the wrong file you can just restore the version from five minutes ago.
The downside vs. EXT4 is lower speed and that the whole thing is newer (though the Btrfs documentation says that only RAID5/6 are maybe unstable).

>>102269150
You can blame Microsoft in the sense that they never added the ability to access any Linux file systems to Windows.
>>
File: 1725679127260993.jpg (727 KB, 2048x2048)
727 KB
727 KB JPG
i'm trying to generate boomer prompts for flux and i've been using chatgpt 4o to with great success. however i quickly run out of free prompts so i'm looking for an alternative. i know there's joycaption or whatever but flux is already eating up most of my vram and i dont think i can run a second model at the same time.
>>
>>102269423
Yes, and they don't do shit against power spikes.
I get the impression that power limits are only enforced on comparatively long time scales (from a hardware perspective) so each individual GPU is allowed to temporarily exceed its limit.
And if multiple spikes happen to eventually align you can either get bit flips or the system will crash.
>>
>>102258941
is Infermatic a good choice if I want to try model out but I'm on a shitty PC? or is there a better service?
>>
>Mistral-NeMo-12B-Lyra-v4, layered over Lyra-v3, which was built on top of Lyra-v2a2, which itself was built upon Lyra-v2a1.

>This uses ChatML, or any of its variants which were included in previous versions.

>Introduces run-off generations at times, as seen in v2a2. It's layered on top of older models, so eh, makes sense. Easy to cut out though.

>Some people have been having issues with run-on generations for Lyra-v3. Kind of weird, when I never had issues.

>I like long generations, though I can control it easily to create short ones. If you're struggling, prompt better. Fix your system prompts, use an Author's Note, use a prefill. They are there for a reason.

>Issues like roleplay format are what I consider worthless, as it follows few-shot examples fine. This is not a priority for me to 'fix', as I see no isses with it. Same with excessive generations. Its easy to cut out.

>If you don't like it, just try another model? Plenty of other choices. Ymmv, I like it.

https://huggingface.co/Sao10K/MN-12B-Lyra-v4a1
>>
>>102269534
Can confirm.

While x3 3090's power limited to 250W should work on a 1000W Gold Corsair PSU, it trips when using it on vLLM with TP.
It works fine on normal inference without TP
>>
>>102270104
Featherless or OpenRouter.
>>102270191
>If you don't like it, just try another model? Plenty of other choices. Ymmv, I like it.
The way he writes is insufferable.
>>
>>102270191
wow truly amazing
>my model sucks and I handhold it every step of the way
>don't like it?
yeah, i'll just pick something better, thx sao
>>
>>102258941
>Chatbot Arena: https://chat.lmsys.org/?leaderboard
why is this piece of shit in the OP?
>>
>>102270242
>piece of shit
hi petra
>>
>>102270274
Hi Sao
>>
>>102270242
Why is your mom in my bed?
>>
>>102258941
Hey lads, I'm from /aicg/
I use cloud chatbots with sillytavern
What is this general for, that? Or are you guys doing local?
>>
>>102270207
>Featherless
is it compatible with Silly Tavern?
>>
>>102270191
>just merge a bunch of random shit together
>the result is a fucking mess
Huh. I'll stick to proper finetunes of base models, like mini. Thank you.
>>
>>102269512
>You can blame Microsoft in the sense that they never added the ability to access any Linux file systems to Windows.
You can install a 3rd party driver for I believe ext2 or something. Although if you're going to do tha might as well use exfat, unless ext2 has some advantage over exfat that I'm not aware of.
>>
>>102270191
>This uses ChatML, or any of its variants which were included in previous versions.
For fuck's sake, don't merge or train with different prompt formats willy nilly. You're just degrading the model.
>>
How to run claude on koboldcpp? I can't find gguf.
>>
>>102270394
https://huggingface.co/Undi95/Meta-Llama-3.1-8B-Claude
https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Claude-GGUF
>>
>>102270389
they've been told that repeatedly and yet keep doing it for some reason
>>
>>102270415
>>102270389
>If you don't like it, just try another model? Plenty of other choices. Ymmv, I like it.
>>
>>102270401
That's not actually claude, that's a llaama finetune
>>
>>102270430
it says claude right there dumbass
>>
>>102270430
It says "claude" because it's Llama trained on 9 000 000 Claude Opus/Sonnet tokens.
Also, don't engage the idiot above me.
>>
>>102270478
>idiot
hi petra
>>
>>102270468
read the description dumbass
>Llama 3.1 8B Instruct trained on 9 000 000 Claude Opus/Sonnet tokens
>>
Why don’t we all talk about this?:
https://www.youtube.com/watch?v=FPJ8ED1YhxY
https://x.com/mattshumer_/status/1831767014341538166
https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B
>>
>>102270593
Because that guy overhyped it and has grifter vibes
We've tried the model and it's not that amazing
>>
>>102270612
>the model and it's not that amazing
It does not outperform all 70b models?
>>
>>102270593
Thanks for shilling your Youtube video.
>>
>>102270643
No problem, the 5 views from her are the lifeblood of my channels.
>>
>>102270593
>we
https://rentry.org/83fkenr9
>>
are there any practical uses to running these locally?
>>
>>102270709
no
>>
>>102270654
kek
>>
>>102270709
free heating for your computer room :)
>>
>>102270593
>PSA: Matt Shumer has not disclosed his investment in GlaiveAI, used to generate data for Reflection 70B
>https://www.reddit.com/r/LocalLLaMA/comments/1fb1h48/psa_matt_shumer_has_not_disclosed_his_investment/
>>
>>102270709
It's like owning your own car or your own home. That sort of thing. Just ain't the same if you're just relying on another man's property.
>>
>>102270768
oh no! how terrible. maybe he is even a climate denier or has evil thoughts about our beloved PoC folk.
>>
>>102270798
hi matt nice ad campaign for glaive
>>
>>102270593
>https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B/commit/2d5b978a1770d00bdf3c7de96f3112d571deeb75
>"_name_or_path": "meta-llama/Meta-Llama-3-70B-Instruct",
>>
>>102270816
thanks, nice reddit link fellow LGBTQIA2S+ sister
>>
>>102270633
In reasoning dumb and useless logic tests? In resolving idiotic riddles and rankings? Maybe. But it doesn't sit on my face in a credible way.
>>
Rocinante 1.1 is great for nsfw stories but it has shit context length. Any model as good for that task with a big context length like Luminum (131072)
>>
>>102270983
>Maybe. But it doesn't sit on my face in a credible way.
Isn't this a problem with the underlying llama model rather than this new method? The advantage for RP should be that there are logically more meaningful outputs and anatomy and the like are presented better.

I am also interested in how it performs for coding.
>>
File: GW2-6rxWQAAeLdI.jpg (125 KB, 807x1080)
125 KB
125 KB JPG
>tfw Elon will be the first to release a publicly accessible beyond GPT 4 level LLM
>>
>>102271020
>Isn't this a problem with the underlying llama model rather than this new method
Possibly. I don't like Llama either.
To be honest, the best results for the things I want AI are achieved with good quality datasets and good training methods, not meme "prompt engineering" ideas and fine tuning.
>>
NEW RULE: You must have AT LEAST 48GB VRAM to post here.
>>
>>102271185
96*
>>
Lol what a bunch of useless trannycucks
>>
>>102271185
I hope the three of you don't kill your wallets by the end of the year.
>>
File: file.png (62 KB, 962x635)
62 KB
62 KB PNG
>>102270660
>>
>>102270660
Is that supposed to be a negative example? Do you want to make a point?

Do you understand that the <thinking> part is not intended to be shown to the user and only the final output should be visible to the user?
>>
Is Starcode reasonably good, or am I better off using something like Codeium for a free AI autocomplete?
>>
Hey anons, I'm mega inexperienced with local stuff, been using only Anthropic for RPing. I got a VPS to do some work on and was thinking about running some local stuff on it during my off time.

It has these specs: Intel Xeon CPU, 16 vCPUs, 16GB RAM, and a Tesla T4 with 23GB VRAM.

What's a good model I can run on this machine?
>>
>>102271317
mixtral 8x7b or nemo, use gguf quants
>>
>>102271303
nta but the output then is not very different from a normal one that doesn't use the reflection meme. other than being able to count the 'r's in 'strawberry' correctly, what good is it for?
>>
>>102271347
sorry was meant for >>102271300
>>
>>102271300
The point is that retards were dying to try that shitty model when all you need is a tiny prompt to achieve the same thing.
>>
>the new Reflection model from a small company is surprising people with its performance on par with much larger closed-source models
>the hype and excitement on social media seems to be from people who haven't actually tried the model themselves
>the extremely high GSM8k score of over 99% seems suspiciously perfect and may indicate issues with the dataset
>there's a question of whether the model's real-world performance is as revolutionary as the benchmark scores suggest or if there are issues with the evaluation that could be misleading
>people started noticing the model is garbage and he tweeted saying "oh it's actually not the real one, let me reupload it"
It's all a publicity stunt that violates Meta's terms on top of it all.
>>
>>102271463
So you have tried it?
>>
>>102262882
Hmm nice cat
>>
>>102271489
Yeah, it's basically an awkward llama 3 (not 3.1)
>>
>>102263462
The voice is AI generated? I wonder what model they are using if so as it sounds really fucking good for AI
>>
>Rocinante 1.2
>it's back to drummer style super horny garbage
1.1 was a fluke
>>
Can I realistically run a 70B Q_4 quant (about 42GB) with 24GB of VRAM and 48GB of RAM?
>>
>>102271605
1.2?

As far as I see there's only 1 and 1.1.
>>
>>102271642
He labeled it UnslopNemo-v1 on hugginface
>>
>>102271652
Oh right. I missed that.
Downloadan now. Big fan of Rocinante, but it gets retarded real quick with quant size.
>>
is there any good news on the horizon at all for gpus? I am tired of coping with chained 3090s.

I want 48gb vram cards for under 1k
>>
>>102271632
use exl2 and yes, exl2 is by far the best way to get things running when quanted
>>
>>102271632
wait nvm you said 24gb vram, no.
You generally need at least the same vram as the downloaded model size.
So you can aim to run a model with files that are like 21gb with 24gb vram
>>
RP version of reflection coming out, it's a 69b model called ministration.
>>
Tinkering with models in the past, I could never get much interesting erotic prose out of them. Until Dolphin-Mistral. Holy fuck, diamonds. A little care in the prompting to set up characters and avoid abbreviating scenes and it's perfect. Is this the top of the uncensored prose game or is there something even better?
>>
>>102271982
you sound like you are using the retarded small models.
Mistral large is incredible and you can run it quanted on 48gb vram
>>
>>102268078
calling the current LLMs AI or even AGI is going to be remembered like the geocentric model
>>
>>102272041
>>102272041
>>102272041



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.