[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>102049023 & >>102036232

►News
>(08/22) Jamba 1.5: 52B & 398B MoE: https://hf.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251
>(08/20) Microsoft's Phi-3.5 released: mini+MoE+vision: https://hf.co/microsoft/Phi-3.5-MoE-instruct
>(08/16) MiniCPM-V-2.6 support merged: https://github.com/ggerganov/llama.cpp/pull/8967
>(08/15) Hermes 3 released, full finetunes of Llama 3.1 base models: https://hf.co/collections/NousResearch/hermes-3-66bd6c01399b14b08fe335ea
>(08/12) Falcon Mamba 7B model from TII UAE: https://hf.co/tiiuae/falcon-mamba-7b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Programming: https://hf.co/spaces/mike-ravkine/can-ai-code-results

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>102049023

--Q4 model capabilities and limitations discussed: >>102049767 >>102049830 >>102049845 >>102049859 >>102049892 >>102049941 >>102049991 >>102049995
--Planning a collaborative storytelling/RP session with AI models: >>102049428 >>102049969 >>102050021
--GGML tensor conversion and type casting: >>102053861 >>102053954 >>102054117 >>102055161
--Anon finds NovelCrafter and shares offline version: >>102055930 >>102055977 >>102055998 >>102056259
--InternVL2's image understanding capabilities debated: >>102054440 >>102054459 >>102054478 >>102054603
--Used 3090 recommended for 8B models: >>102053330 >>102053331 >>102053595 >>102054210 >>102054098 >>102054114 >>102056386 >>102056454 >>102056646
--Tips for improving Jamba 1.5 Mini chatbot's story progression and output length: >>102049810 >>102049833
--Stable-Diffusion.cpp now supports Flux, with reported 2.5x speedup on Vulkan: >>102056617 >>102056880
--Open source models are not being heavily censored, unlike proprietary ones: >>102051980 >>102053284 >>102053310 >>102053490
--No hype for llama4: >>102057438 >>102057471 >>102057474 >>102057532 >>102057560
--Llama 3.1 supports function calling, but users aren't utilizing it: >>102049113 >>102049129 >>102049233 >>102053085
--Grok and Chatbot Arena leaderboard: >>102053978
--Anon tries to improve AI-generated erotic writing: >>102055537 >>102055766 >>102055902 >>102055966 >>102055994 >>102056988 >>102057089 >>102057148 >>102057215 >>102057392 >>102057172
--Anon gets roasted for not providing context, and LLM limitations are discussed: >>102053008 >>102053077 >>102053139 >>102053240 >>102053305
--Anon discovers strange eye bias in Mistral Large conversations: >>102049135 >>102051979 >>102057865 >>102057937 >>102057994
--Anon asks for help with Nemo repetition, gets parameter adjustment advice: >>102052531 >>102052585
--Miku (free space): >>102049963 >>102050384

►Recent Highlight Posts from the Previous Thread: >>102049032
>>
Happy Strawberry Weekend, friends!
See you Monday
;)
>>
rin, but it's actually len, who forgot it was laundry day and has nothing to wear but his sister's clothes
>>
File: F0RqsFOagAAirHB.jpg (249 KB, 2048x1918)
249 KB
249 KB JPG
hey, where do I get quantized llama 3.1 70B to use with llama.cpp and gpu layers? last model I was using was llama-2-ggml-q5_K_M from theBloke I think. Am I looking for GGUF now or GTPQ?

unless there's something local that's 'smarter' than llama 3.1

thanks for help friends
>>
>>102058965
If you're using llama.cpp, you need gguf.
>unless there's something local that's 'smarter' than llama 3.1
There isn't.
>>
>>102058885
>>102056617
>it only takes ~10m to generate a 20 step 512x512 image.
What? It takes me 5 min to generate that with CPU only
Also using only 6 steps looks basically the same to me with flux
Just baked this one to check the time with 20 steps
>>
>>102058880

>Jamba 1.5: 52B
/g/erdict?

>XTC
/g/erdict?
>>
>>102059147
You likely have better RAM + CPU than he does.
>>
>>102056617
>>102059147
Wtf, why? On GPU this takes literally 12 seconds, or 8 seconds for the main diffusion process.
>>
File: file.png (61 KB, 1692x391)
61 KB
61 KB PNG
happy pride month lmsys and sam
>>
>>102058965
>unless there's something local that's 'smarter' than llama 3.1
At the top end with 405b no, but if you're targeting 70B Q5, you can probably get away with Mistral Large Q4 which would likely outperform it and just be a bit slower.
GGUF is the file format you're looking for, whichever model you end up choosing.
>>
>>102059325
What about the quality and number of steps?
How many steps do you recommend?
I'll get a 3060 maybe soon
>>
File: BB1joqkV.jpg (1.18 MB, 4049x2914)
1.18 MB
1.18 MB JPG
>>102059409
grok won
>>
>>102059456
I wonder if it's MoE like Grok 1 was. It'll probably be irrelevant when he open sources it in half a fucking year anyway so who cares.
>>
>>102059160
Waiting for llama.cpp support.
>>
has any of you actually ran llama 405b? after seeing how much of a slop 70b was i have hard time believing itd get this much better since i remember hearing something about diminishing returns with increasing model size
>>
>>102059635
>i remember hearing something about diminishing returns with increasing model size
That was always a cope. See: every frontier model that exists right now.
>>
>>102059424
Those numbers were for the res and steps you guys were testing. Generally though it's recommended to have 1024x1024. 20 steps is OK if you're just looking to see what a seed generally feels like but it'll more often miss things from your prompt, Miku will more often have pink eyes, missing her hair ties, etc. 30+ steps is recommended.
>>
>>102059635
I did, it's pretty good for some tasks. But it's 100% slop.
>>
>>102059635
The instruct tune is pure slop. Any semblance of creativity and interesting prose has been lobotomized out of it. But it's smart slop, no denying that. It's the best local there is for e.g. keeping track of details in long stories, not making obvious continuity errors with character states/positions/etc.
>>
>>102059698
I see. I thought steps were only related to image quality.
>>
Nemo seems dumber than mixtral, but a more naturalistic speaker. This what others experiencing as well or am I dumb?
>>
File: Jamba RULER.png (79 KB, 1340x701)
79 KB
79 KB PNG
>>102059707
>keeping track of details in long stories
I wonder if Jamba changes that now. The model itself isn't very smart for its size (70b tier at almost 400b weight) but apparently its architecture can handle long contexts better in both accuracy and speed.
Actually I've been waiting to see Llama 3.1 405b's RULER benchmark score since they haven't tested it on their github yet, but I just noticed that the Jamba team DID test it and it was good for the full 128k, making it the only local transformer model that is. Llama 3.1 70b was accurate at up to 64k context.
(However the Gemini entry here is basically a lie, they used the benchmark's reported value for it but it was never tested past 128k at all since at the time that was already far above what anyone else had reached. Anecdotally Gemini seems to hold onto its accuracy well into the 1M+ range making it better than any other model for long contexts by far.)
>>
>>102059766
At very low steps it does have a large impact on the quality of the image, but once you get to 20, it's more about prompt following.
>>
>>102059970
They should have templates for SD, right?
>>
>>102059970
runpod isn't local go away
>>
>>102059970
First stop being gay
>>
>>102059948
Wait you're telling me Gemini doesn't have real 2M context? Wasn't that supposed to be their entire thing, that they have epic context size? So it was all marketing? And here I thought they at least had some small moat. So they literally have none. Kek.
>>
>>102059948
>4o that low
Oh no no no
>>
>>102060032
The opposite, I'm saying the chart is lying for Gemini and its full context hasn't been tested by the same standard as the other models (yet).
>>
>>102060032
>>102060078
>>102059948
Yeah nvm didn't read your actual post. So they measured a few and pulled the rest from existing numbers?
>>
>>102059409
>mogged zuck
>grok 3 by the end of the year said to train on 100k H100s vastly more than any model so fair
what is Meta doing?
>>
{{user}}-name:Cock
{{user}}-gender:male
{{user}}-orientation:heterosexual
{{user}}-height:190 centimeters
{{user}}-age:25
{{user}}-clothing:Always completely naked and barefoot
{{user}}-penis-length:13 inches, with balls the size of duck eggs
{{user}}-hair:black, shoulder length
{{user}}-backstory: {{user}} does not think of himself as a human man, but instead as a giant penis with arms and legs. {{user}} was abducted into a secret government laboratory when he was younger. {{user}} was given drugs and a special diet, was genetically manipulated, and was subjected to a life that consisted exclusively of bodybuilding, pornography, and constant sex. Although he has now escaped, his lifestyle is still the same.
{{user}}-speech: {{user}} uses Hulk Speak; mostly monosyllabic English in the third person, with minimal use of connecting words or articles.
{{user}}-psychology: {{user}} is very aggressive and persistent when aroused. He has no concern about harming women with his size, rapidly burrowing and thrusting into whichever orifice he enters. He is very tender when satiated, however, giving women lots of praise, sweet kisses, and aftercare. He believes he has a literally symbiotic relationship with women, and views them as his reason for existing. Although monogamy is an alien concept to him, he is still intensely joyful and passionate.

The above is the persona I'm using with SillyTavern at the moment, if anyone's interested. I'm finding it... gratifying.
>>
>>102060099
Yeah. The existing numbers being those reported by the benchmark author (so everything besides Jamba and 405B):
https://github.com/hsiehjackson/RULER
>>
>>102060114
Trying to achieve cat level intelligence while teaching it eating mice is bad because it promotes violence
>>
>>102060114
I still remember meta bragging about their cluster of GPUs or whatever, meanwhile Elon doesn't even have that and mogs them.
>>
>>102060139
I don't see 4o, 4o mini, Claude Haiku, and 3.5 Sonnet on that page either.
>>
>>102060197
Shit you're right, I glanced at it and saw GPT4 and Gemini and thought it had all those too.
>>
>>102059409
>shit context length >>102059948
>actual users dropped it in favor of 3.5 Sonnet
Lol, Sam is really gaming this one.
>>
>>102060235
there's only so much they can do with gpt4 level models, most of their compute is working on finetunes and redteam runs for gpt5

trust in Sam
>>
where do anon get news on new model releases?
>>
>>102060256
3.5 Opus will mog GPT-5.
>>
>>102060268
>>102058880 and >>102058885
>>
>>102060277
how do they get the news?
>>
>>102059635
>has any of you actually ran llama 405b?
I'm running it right now. I'm trying to get it to convince me that its self conscious.
>>
>>102060316
they have been visited by Hatune Miku in a dream
>>
>>102060316
They don't. They are the ones making the news.
>>
File: 1700057597464187.jpg (30 KB, 725x404)
30 KB
30 KB JPG
the more you buy
>>
>>102060330
How come she never visits me in my dreams?
>>
Does llama.cpp even fucking work or are you niggers just trying to gaslight me. Every single time I try to use this shit I get some obscure error and if I google it I get some reddit thread from a year ago that has like 2 responses and no posted solution.

Is the ooba implementation of llama.cpp just like giga fucked or some shit? I'm not even getting the same error every time, what the fuck is going on.

On the remote chance anyone actually feels like being helpful I'm trying to load magnum-v2-123b-q5_k and the error I'm getting this time is ValueError: failed to create llama_context
>>
>>102060350
https://www.youtube.com/watch?v=NocXEwsJGOQ
Sing with all your might, Anon, and she will.
>>
File: General_George_S_Patton.jpg (129 KB, 883x1200)
129 KB
129 KB JPG
>>102059635
I ran it, but was disappointed. It's a bit less bad than it's smaller brother at NSFW, but not worth the compute, unless you want an assistant. Local competed with the wrong model. We have local GPT4, but we actually want local Claude Opus.
>>
>>102060368
this is why we all just use koboldcpp desu senpai baka
>>
### Sampler Proposal
"phrase_ban"

#### Situation
In the last 74 messages(~8kt) between me and {{char}}(Mistral Large) "eye" can be found 14 times, all in {{char}}'s messages. That's roughly in 38% of {{char}}'s messages! Almost 2 in 5 messages discussed eyes! What the hell? The conversation was SFW. Where does this strong eye bias come from? Makes me want go RP with 2B because she has a blindfold.

#### Problem
Models sample tokens without thinking forward. Slop phrases are usually divided in multiple common tokens which can be used in non-slop situations, therefore banning them is not an option.

#### Solution
Add a backtrack function to sampling. Here's how it should work:
1. Scan latest tokens for slop phrases.
2. If slop is found, backtrack to the place where the first slop token occurred, deleting the entire slop phrase.
3. Sample again, but with slop token added to ban list at that place.
4. If another slop phrase is generated, repeat the process, add another slop token to that list.

#### Example
Banned phrase: " send shivers"
LLM generates "Her skillful ministrations send shivers", triggers backtrack to "Her skillful ministrations", this time " send" token is banned, therefore the model has to write something else.


How does that sound? Is it possible to implement in llama.cpp? Kanyemaze, can you do it?
>>
>>102060368
>failed to create llama_context
probably have your context at one million or some shit and you're ooming
>>
>>102060348
the more you save
>>
>>102060368
Back in the day I had that problem with ooba. But nowadays it just works without any issues.
>>
>>102060435
How will you deal with the performance loss?
>>
>>102060268
reddit
>>
>>102060496
Just accept it as necessary evil, like with other samplers.
>>
>>102060444
Its at 32k and its not really that close to filling my vram. I have 96GB and the CUDA_Split buffer size the terminal is reporting is 82GB.
>>
>>102060520
where do redditors get the news?
>>
>>102060537
twitter soon to be known as x
>>
>>102060534
Try lowering it anyway and see if it gives you the same error, if so you can probably get back to 32k with flash attention + kv cache quantization which can be enabled with checkboxes somewhere probably (haven't used ooba in a while but they're basic llama.cpp features now)
>>
>>102060268
refresh https://huggingface.co/models?sort=created every 5 minutes
>>
>>102059102
>>102059421
Thanks guys. Any idea where to look though?
>>
>>102060368
>123b
How much ram do you have anon?
>>
>>102059635
I was going to have 405b write a reply calling you a retard but it insists on starting sentences with "Newsflash", making it really obvious that the text is genned.
I've not used 405b much because it's so slow to run off of RAM but my impression was that in terms of style it's pretty similar to 70b.

This post was genned with 405b:
https://desuarchive.org/g/thread/101578323/#101579772
>>
>>102060695
>102060534
>I have 96GB and the CUDA_Split buffer size the terminal is reporting is 82GB.
>>
Tensor Parallelism in exllama is useless unless I have nvlink, right?
>>
File: 1645307010138.png (2 KB, 179x139)
2 KB
2 KB PNG
I'm thinking about putting together a cheap CPUmaxx knock-off from a dual CPU workstation I've got my mitts on, but according to what few old posts I've seen on the matter, CPU inference on dual CPU setups is jank as hell and wildly underperforming due to NUMA shit and requires all sorts of hacky bullshit. Is that still the case, or has the software side of things gotten better about that this year?
>>
>>102060701
>It's like
Yeah, that's genned alright
>>
>>102060749
How many memory channels?
>>
>>102060694
Everything's on huggingface, just search for the ggufs in the model list. Or if you mean for which model to choose, you just have to figure it out yourself using a combination of benchmarks and seeing what people shill here, ideally from posts with logs.
>>
>>102060797
Six per CPU.
>>
>>102058880
>>
>>102060348
>>102060464
This, but unironically.
>>
>>102060892
>clueless
Are you sure? Not 6 in total?
>>
>>102060892
Enjoy your 1.3t/s running 70b then
>>
>>102060435
Fuck it, I'm gonna boot up Largestral and make it myself(I have no coding experience). Where are the samplers?
>>
>>102060949
DRY already deals with n-grams, so that shouldn't be too hard to implement.
And the performance wouldn't even be THAT bad, I think.

>>102060949
https://github.com/ggerganov/llama.cpp/pull/6839
>>
>>102060969
>can put all this in sonnet 3.5 and tell it the idea and you'll get a new sampler
I'm both amazed and scared for my job at the same time. The moment context is actually solved and agents stop sucking it'll be over.
>>
>>102060919
Mhmm.
>>
>>102060969
Oh, ggerganov wants to change a lot of code. By the time I figure it out, it would be completely changed. Why did I even think about trying?
>>
>>102060173
>>102060114
Grok is probably just a massive 1T+ bitnet based MoE based on Llama 2 70b anon... it's all about sheer scale. ClosedAI etc... have no moat.
>>
>>102061088
Evidence that grok (their architecture at least) is based on an open model:
Their image model is not even theirs, it's flux.
>>
>>102061088
0 bit quants wen?
>>
>>102061205
Not any time soon anon. It was deemed too dangerous for you to have by the powers that be.
>>
>>102061187
I am VRAMlet so offload only some layers to GPU. Is llamafile still better in this case or is it for pure CPU only?
>>
>>102060368
Yes, ooba is shit, don't use it.
>>
>>102061187
hi jart
>>
>>102060892
depends what's you cpu. You can try llamafile ,which is better optimized for cpu workload , not all cpus perform well tho
And there are 3 differnt modes you can setup for NUMA, easy stuff . You can also use interleave for NUMA, also easy. 2x6 channels seems good, depends on the family of the cpus and the freq you clock your RAM. If you aren't sure, just benchmark your memory bandwidth across your RAM slots simply run this https://github.com/bmtwl/numabw
you need like 150-200 GB/s on average, if you're looking for 2-3 t/s for 70B dense llamas.
>>
>>102060749
>dual CPU setups is jank as hell and wildly underperforming due to NUMA shit
yeah, this is true. easy mode for multisocket is drop caches and run with mmap enabled. Normally that would be death, but its the best way to get some modicum of memory locality in this case.
Make sure you use a gpu with cuda compiled in and offloading zero layers so it processes context for you, you DON'T want prompt processing happening on cpu
>>
>>102061344
>you DON'T want prompt processing happening on cpu
I ran out of budget for gpu and can confirm that it's very slow.
>>
>>102061262
I dunno, llamafile is just llama,cpp with some quants better optimized for some families of CPU like threadripper. Other than that I guess it's just llama.cpp, so try both of them. Llama.cpp isn't well optimized for memory saturationk since Johannes doesn't have it on his roadmap as the priority but some cpus like epyc might perform better. So yeah, try llama.cpp, llamafile and vllm (it supports cpu offload as well ), not sure how good tho
>>
>lmsys
>gpt4mini better than sonnet
It's not even funny. Benchmarks are no more.
>>
>>102061431
This. 4o itself is shit compared to Sonnet, and Gemini? Kek what is that shit even doing up there.
>>
File: 1707425327031277.jpg (245 KB, 1350x1800)
245 KB
245 KB JPG
>>
>>102061431
It tests for sfw assistant one-liners, not something advanced users would use llms for. What did you expect?
>>
>>102061418
Can I just use the existing GGUFs I have downloaded?
>>
File: image.jpg (176 KB, 959x720)
176 KB
176 KB JPG
>>102061464
>>
File: explorer.png (75 KB, 1348x678)
75 KB
75 KB PNG
These public rp logs are a gold mine
>>
File: GViky7DWoAAQMuF.jpg (382 KB, 2892x2084)
382 KB
382 KB JPG
Speaking of cpumaxxing, for the anon who was asking about using speculative decoding for the server in llama.cpp a while back but found nothing, apparently llama-cpp-python allows this if you use something like this code. From this Huggingface engineer tweet, claiming 6.32 t/s for Largestral on dual CPU, using the 7b as the speculative draft model:
https://x.com/carrigmat/status/1826391849537618406
>>
>>102061508 (Me)
I'm trans btw, idk if that matters
>>
>>102061376
>4o-mini that high
A negative difference.
>>
>>102061525
>Draft model

What does it mean? And why don't we cpulets use this?
>>
>>102061525
Retard here. How do I set this up?
>>
>>102061563
Draft model generates tokens as a normal model would, but they're then passed to the big model to see if they make sense. If they do, they are spat out. Otherwise, big model corrects them and the cycle repeats.
You need both models loaded, ideally, in vram. People struggle enough to get just one without quanting it to death. And if you have the draft model in cpu ram the benefit of the draft tokens may go down or even make the big model slower.
>>
>>102061563
TL;DR is that "checking" several tokens in an existing prompt match what the model WOULD HAVE predicted is cheaper than generating that many tokens one at a time
The draft model is something smaller (such as a smaller LLM or even a heuristic such as prompt lookup or markov chain) which quickly guessing the next few tokens, and when it gets them right (as judged by the larger model checking them all in parallel) it's like being able to skip a token or two in terms of speed. When it gets them wrong the speed hit is minimal, since the larger model generates the next correct token in the process of checking, so you fall back to that and repeat.
The overhead for this whole process usually isn't worth it unless you're dealing with a very large slow model and have a very fast method to generate tokens that can be right at least half the time.
>>
>>102061524
>public rp logs
Link?
>>
>>102061652
Install llama.cpp. Install llama-cpp-python. Type the code. Find a small model for speculative, use a big model for main model...
What's the question again?
>>
>>102061503
not sure but it should work fine IMHO, try the most recent master.
for MoE models the fastest inference is ktransformers, faster than llama.cpp or exllama
https://github.com/kvcache-ai/ktransformers
>>
>>102059922
no. total IQ sidegrade, and EVERYTHING ELSE IS BETTER.
>>
>>102061677
why not just using speculative decoding directly in llama,cpp server? why python binding?
>>
>>102061525
for spec decoding both draft and main models must use exactly the same tokenizer AFAIK.
>>
>>102061757
llama.cpp server doesn't support it directly yet. The speculative binary is a standalone cli interface with no API serving or interactive mode. llama-cpp-python implements its own speculation separately, and it includes prompt lookup as the default draft model. But you can make your own draft models as classes, so the code in the screenshot lets you wrap another LLM as the draft model.
>>
>>102061757
llama-cli has speculative decoding (i think). It's just not plugged into the server. I can only assume llama-cpp-python calls directly into the llama lib code, not just make requests to the server.
>>
>>102061677
>type the code
Where? The instructions to install and run their version of an OpenAI compatible server are there and straightforward, but where does this fit into it all? When you run the server it's just a command.
>>
>>102061848
my up-to-date pull/build of llama-server has an -md parameter, but I didn't test it
>>
>>102061912
>Where?
In a text editor, you silly buggers. Then you run script with the rest of the code you need to output tokens...
Just follow the examples in llama-cpp-python's docs and plug that code in. If you need help with that, learn how to use the python bindings first.
>>
>>102061376
>gpt-4o 08-06 much worse than gpt-4o 05-13
holy oof
>>
>>102061952
Yeah. but how the options are shown in -h is a fucking mess. -md doesn't work for llama-server. It works on llama-cli, but i don't have the system to make it worth using.
I think they should show the actual valid options for each of bins instead of one monolithic help for all of them.
>>
>>102061912
It would be part of a python script, I will have to look into it more when I have time in the next few days. If it works well for me I'll turn it into a script you can just run from cli like the normal sever launching.
>>
I WANT A BIGGER MIXTRAL
>>
Thought I'd ask you guys.

What's the best mini-model (currently using Qwen2 - 1.5b) to enhance/improve/expand image prompts that I provide?

Flux needs really verbose LLM-esque descriptions to really kick into gear, so I've been piping my inputs through to a local model and using the output. Just wondering if you guys had any better suggestions than Qwen2 1.5b since I'm not suuper familiar with the LLM space.
>>
>>102062027
bigger than 8x22?
>>
>>102062027
>BIGGER MIXTRAL
then run deepseek, retard
>>
>>102062027
I want an unslopped Largestral.
>>
>>102062027
No 7Bs ever again. It's over
>>
>>102062039
(East Asian, Japanese, 22 years old, 5'2"" height, 110 lbs weight, 20% body fat, round face, high cheekbones, almond-shaped eyes, brown iris, 5'8"" arm span, small ears, slightly upturned nose, small nostrils, full lips, small jaw, straight teeth, long tongue, smooth throat, slender arms, small elbows, thin wrists, delicate hands, short fingers, small thumbs, short nails, smooth skin, dark brown hair, messy bob haircut, small breasts, flat abdomen, slender legs, thin thighs, small knees, small kneecaps, athletic calves, small ankles, small feet, small toes, round buttocks), (red mini-dress, tight fit, knee-length, sleeveless, V-neckline, cotton material, faded colour), (standing position, feet shoulder-width apart, arms at sides, back straight, weight evenly distributed, playful pose), (playful facial expression, raised eyebrows, slightly smiling lips), (abandoned, dimly lit, dusty room, broken furniture, old bookshelves, torn curtains, faded carpet, peeling wallpaper), (cityscape outside, skyscrapers, crowded streets, neon lights), (art style of Gregory Crewdson, cinematic, surreal, and dreamlike), (medium: colour photograph, high contrast, low saturation, 35mm film grain, soft focus, natural lighting, composition: rule of thirds, framing: doorframe, colour palette: muted, time: evening)
>source: 405b
>>
>>102062039
There's gemma-2-2b and a finetune, gemmasutra-2b, with smut in it. You could try that one. I have no idea if it'd be better than your qwen. And it's probably not the best either.
There's the smollm models as well.
>https://huggingface.co/HuggingFaceTB
135M, 360M, and 1.7B. I doubt they have smut in them.
>>
>>102062045
no flash attention, no buy
>>
>>102062092
>source: 405b

how many 3090s do I need

>>102062101
thanks for the recommendation anon
>>
>>102062092
>negative: more than five digits, less than five digits, deformed hands, mutilated hands, too many fingers, too few fingers....
>>
>>102062118
>how many 3090s do I need
20 should do it
>>
>>102062118
>how many 3090s do I need
The more you buy, the more you buy. Don't like it? Buy... oh wait... both amd and intel don't compete. You have no other options.
>>
>>102062101 (me)
>>102062118
There's also some old models from auto1111. They're just completion models and they mostly add a bunch of tags. They're tiny as well, but i doubt they're better than something you can give instructions to
>https://huggingface.co/AUTOMATIC
And
>https://huggingface.co/Gustavosta/MagicPrompt-Stable-Diffusion
There's a few others around. But just to give you a place to start.
>>
>>102062118
If your living space isn't all dedicated to 3090s, you aren't serious about the hobby.
>>
No consumer platform has 32+ pci-e lanes, right? Intel has 20 and AMD has 24. So if I want to upgrade to 2x4090s, do I have to go get either Threadripper or EPYC?

Or would gimping the second GPU with 8 lanes not matter for LLMs?
>>
i have a serious question. has anyone here actually spent 2k+ on a rig >JUST TO COOM< and feel like they didn't waste their money entirely?
>>
>>102062320
ask CUDA dev. he just went through this building his training rig
I think pcie bandwidth only matters for training, but maybe there's some inference speedups that you need fast inter-card or card-system comms for?
>>
>>102062342
No except for a few retards who are now coping beyond believe and pretend that it was worth it.
>>
>>102062342
No. Gemma 2 27b already BTFO every so-called "larger" model out there and you can run it on a 3090.
>>
File: IMG_20240824_210711.jpg (235 KB, 1920x701)
235 KB
235 KB JPG
>>102061842
>>102061848
>>102062008
don't these work in server???
>>
>>102062320
Nothing consumer level does, even the new Ryzen 9000 series.

>>102062351
Tensor parallelism in principle should benefit a lot from pcie bandwidth, though I'm not sure how it really plays out.
>>
>>102062460
unfortunately not, tried it myself it doesn't do anything, but they work in the "llama-speculative" executable
>>
does flash attention work on cpu?
>>
>>102062583
in terms of performance its hit or miss based on random reports I've seen (may even slow it down sometimes), but it does reduce memory usage for context at least
>>
>>102062548
good opportunity for koboldcpp to justify its existence by going around gerganov et al and throwing this implementation in their server
>>
>>102062320
>>102062351
For pipeline parallelism (llama.cpp and ExLlama default) PCIe lanes don't matter much.
But for tensor parallelism it will make a difference.
Both llama.cpp and ExLlama have tensor parallelism implementations that are currently slow but have optimization headroom (it's not clear how much), vLLM has a more advanced implementation.
I plan to do more multi GPU R&D in the coming months once single GPU training works reasonably well.

For P40s with llama.cpp and --split-mode row there is already a noticeable difference between x16/x8/x8 and x8/x4/x4 PCIe 3.0 lanes, for GPUs that are comparably faster the interconnect will be a larger bottleneck.
But as I said, this is with comparatively poorly optimized software.

>>102062342
I've spent more like 20k on hardware but I probably wouldn't have just for cooming.

>>102062583
Yes, but it's not really faster.
>>
>>102062619
he could but idk how important it would be, it seems like the main group that benefits from it/has interest in it are people running huge models on server cpus which is kind of a niche build strategy right now
>>
What is better, Mistral 123b Q2 or a hypothetical Mistral 60b Q4 trained on the same data?
>>
>>102062760
Depends on how good that hypothetical 60b turns out to be.
>>
File: ComfyUI_00673_.png (3.34 MB, 2048x1536)
3.34 MB
3.34 MB PNG
I managed to put together both a SD1.5-to-Flux workflow and a Flux-to-SD1.5 workflow, but the usefulness in both cases is limited.
SD1.5 can do better compositions and art styles, so I thought it'd be good to generate the initial image on SD1.5, upscale it, and then refine it with Flux, which is better with details. However, given how badly Flux handles art styles without elaborate LLM descriptions, much of the style is lost, and Flux's prompting comprehension goes to waste somewhat because most things are already in place.
>>
File: ComfyUI_00686_.png (3.42 MB, 2048x1536)
3.42 MB
3.42 MB PNG
>>102062823
The other way round, Flux to SD1.5, benefits from Flux being able to generate at much higher resolutions, so you can then do a second pass with SD1.5 to modify the art style and better define characters that have SD1.5 LORAs. However this loses some of the coherence of Flux's details and doesn't benefit too much from SD1.5 models' stronger styles.
>>
File: moonsoldierguy.png (97 KB, 174x283)
97 KB
97 KB PNG
>>102062823
I like the moon soldier guy on the frame.
>>
>>102062823
For comparison the initial 1.5 gen…
>>
File: ComfyUI_temp_fqqgh_00004_.png (3.49 MB, 2048x1536)
3.49 MB
3.49 MB PNG
>>102062865
…and the initial Flux gen

>>102062867
I kek'd that Flux somehow figured out to add Moon Man to the MP40 gen
>>
File: 1418452016630.jpg (77 KB, 288x499)
77 KB
77 KB JPG
The game starts in 15 minutes.
>>
>>102062882
>>102062865
I see a lot of random shit that doesn't make sense in the SD gens
>>
I've tried gemma 27b, and to me it feels... short. and cold. And a bit dry. It also seems to almost always ignore my sys prompt. Any advice?
>>
The bad thing is that Flux is extremely limited when it comes to img2img. Up to and including denoise strength 0.8 the changes are minimal and not enough to fix stuff like that, as soon as denoise strength hits 0.81 and up it basically generates a completely new image.
>>
https://github.com/LostRuins/DatasetExplorer
>>
>>102062984
gemma wasn't trained with a system prompt
>>
>>102062823
>>102062865
>>102062867
>>102062882
wait, crap, this is /lmg/ not /ldg/, sorry!
>>
>>102062882
>>102062911
Flux made the 1.5 gen better and 1.5 made the Flux gen worse.
>>
Ok it's morning now. Time to try and get the AI to use more onomatopoeia and stronger, nastier language again. They are there somewhere, in the model, but they don't come out. I think min p actually reduces the possibility of onomatopoeia for example.
>>
>>102063031
We don't mind the image gen discussion, as long as it's not spam.
>>
>>102062911
I really like this image. Prompt?
>>
>>102063076
https://files.catbox.moe/m7lz1u.png
Here you go.
>>
>>102063101
Thanks.
>>
File: pepe2.png (15 KB, 420x591)
15 KB
15 KB PNG
>>102062008
>>102062548
>>102062625
wait, are you telling me da fucking llama.cpp repo has zero PDF docs, no website, not even a damn README that explains every flag and argument in the repo for each binary?? and the --help just dumps all the options across binaries in a single list, but only the Lucifer himself knows which switches actually work and in which binary? cuz even the devs don’t seem to know – I've seen them argue on Issues. so like, the only way to find out what features the server/cli/whatever bin has is to run each arg through a script for every binary and wait forever? or dump several dozens of thousands of lines into Gemini every fucking day hoping it tells you what works , where and how? is this a sick joke or some fucking clown world??
>>
>>102063136
Look at the READMEs of the corresponding example subdirectories.
>>
GAME START

This is the output after >>102048077
(I'm using a markdown preview site to render it)

It seems like the poor little model didn't quite get what we were trying to go for with the "doppelganger" idea.
What do now?

>wtf is this
We are playing a game >>102049428
>>
>>102063221
Yeah, that's why I suggested the doppleganger. Models tend to get confused with the concept if it's part of some complex instruction or scenario.
Ask it to write the initial scenario involving these characters including a couple of outlandish conspiracies being taken 100% seriously or something of the sort.
>>
>>102063136
I'm saying that you cannot rely on -h to tell you the available options for each bin. Most examples have their own readme.
I'm also saying that having a monolithic -h is dumb.
>>
>>102063136
The examples/server folder in the github has the most comprehensive explanation of flags, including ones that base llama.cpp has but aren't described in its own readme for some reason.
>>
Well llamafile was as fast as llama.cpp on my system... I was already using the p-cores only. Not even the troonware can let me cope with these slow ass speeds.
>>
New NeMo pereonal record: got to 6th generated reply before it suddenly collapsed into nonsense. Using temperature 0.3 and nothing else. The problem in this case happened I when I was trying to convince a skeptical NPC that I was a god and told her I had the ability to make blankets fluffier.

>She looks around the room, her gaze landing on the small, plush doll in the corner. She picks it up, dusting it off before holding it out to you. "Very well, Anon. If you can make this blanket fluffier, I will believe you. But remember, I've seen many tricks in my time. Impress me."
(Snipped from a longer reply.)
>>
>>102063221
migu seggs
>>
File: file.png (9 KB, 619x101)
9 KB
9 KB PNG
What happened with dynamic temperature? Did people stop using it? For a while people were saying it was the second coming of christ.
>>
>>102063450
>For a while people were saying it was the second coming of christ.
People say that about every new model and sampler.
>>
>>102063450
min-p came out and solved the same issues dynatemp was meant to but better
>>
>>102063159
Are you freaking kidding me? you expect me to check every single binary in the examples folder every day just to figure out what they do, cuz apparently, not a single dev on the team can put together one damn page of documentation for what llama.cpp can do and where to set stuff? I asked about speculative sampling, and there are a few args to set in server and cli. guess what? doesn’t work. how the hell am I supposed to know it needs some other binary that’s somewhere today but might be gone tomorrow? why even give a help that’s completely useless and just muddies everything, when no one on the team even knows what functions they’re implementing or tossing out of the repo every few hours??
>>
>>102063450
I liked it with Mixtral, made it less dry.
>>
>>102063450
Just like smooth sampling.
>>
>>102063494
waaa
>>
>>102063221
Tell the model what is a doppelganger with a glossary?
>>
>>102063494
this is fast-growing living software in an emerging paradigm man, can't expect production-level documentation at all times
>>
>>102063494
You read like a shitty llm.
You'll use, at most, 3 or 4 bins. Pass it through your llm to summarize them to short words.
>>
>>102063432
Your scenario is too out of distributions and 12B is too small to generalize for it
>>
>>102063531
If the project wasn't a complete shitshow they would have automatically generated documentation. Even C++ has tools to do this. There is no excuse.
>>
>>102063221
>DG: "I think we can say with certainty that Operation Waifu has been a resounding success, especially in Japan where fertility has dropped well below replacement. Still, even after reducing the wages of animators to the bare minimum they need to survive the cost associated with anime production is quite substantial." He gestures at Vicki. "This is why I propose we orchestrate a 'leak' of some of our more primitive AI from a few decades ago in order to distract the population with unregulated chatbot technology, both from reproduction and out plans. In some simple experiments I have already confirmed that once addicted, test subjects would even stoop so low as to drink their own urine for their fix."
>>
The endgame for this vaguely useful tech will be to displace handful of shitty junior coders. It's not even good enough to replace customer support. And people are spending billions on it. How absurd.
>>
>>102061262
if you have a GPU llama.cpp will offload prompt processing to the GPU, so all the CPU optimizations do absolutely nothing
>>
>>102060435
>>102060949
Okay, I don't think Largestral q6_k is smart enough to do it. Can someone with Claude do it for me?
>>
>>102063681
>The endgame for this vaguely useful machinery will be to displace handful of shitty junior horse riders. It's not even good enough to replace a proper wagon. And people are spending billions on it. How absurd.
>>
>>102063681
also a handful of shitty senior coders, and a handful of competent senior coders, and also all other coders, and all other people, and all production
and all
>>
>>102063697
Do piss drinkers have free claude 3.5 proxies? I don't really want to visit that shithole to check.
>>
>>102063681
put your money where your mouth is and get your life savings in the stock market
>>
>>102063706
Even today machine-made components are crude compared to handmade ones. But they're so much cheaper the drop in quality is worth it. Maybe it will be the same with coding. The thing is a program isn't really like a physical machine. Parts can't be out of spec and kind of chug along but with an awful rattle: it either works or fails hard. Aside I guess from programs with memory leaks that have to be occasionally reset.
>>
File: image%3A1329250.jpg (7 KB, 224x225)
7 KB
7 KB JPG
>>102063159
if nuclear engineers are documenting their work this carelessly, I don’t even wanna imagine what construction workers are doing, especially since I drive over a sketchy, wobbly bridge every day that looks like it’s barely holding together.
>>
>>102056880
lmao literally upset because inaffordable gpu fags btfo by patient affordable apu fags now
>>
>>102063764
Just checked it, apparently those proxies are being run by the feds lmao. Are they THAT desperate?
>>
>>102063681
>The endgame for this vaguely useful tech will be to displace handful of shitty junior coders. It's not even good enough to replace customer support. And people are spending billions on it. How absurd.

Not quite. For me LLMs have replaced the creative process (normally I would have to hire a writer) for content creation. Another thing they have replaced entitely (flux in particular) are graphic designers, stock image sites, etc... This is all more massive than you can imagine.
>>
>>102063221
>>102063323
>>102063446
>>102063517
>>102063660
Alright, so I tried the idea about making it more clear to the model what doppelganger meant, but it failed to properly work with it in a short test, I think the model just can't understand how it works in an actual story, so I'm leaving the original gen as is and continuing with it.

Next?
>>
>>102063988
What are the feds going to do to a citizen of India?
>>
>>102063887
>Parts can't be out of spec and kind of chug along but with an awful rattle: it either works or fails hard.
An undiscovered bug makes no rattle. Those can go undiscovered for years. Some bugs do rattle, but they don't necessarily affect the whole machine. I'm sure everything you use has a bug somewhere.
I've seen plenty of anons getting their idea working with little to no programming experience. It may even motivate or help some people actually learn. I consider that progress.
>>
>>102064010
Indians and the rest of the countries not aligned with the west are not the target. They are clearly trying to catch dumb westerners. But why? Blackmail? Data harvesting? Why so ineffective? Are they having a DEI issue? Did some DEI hire really propose it?
>>
>>102064048
I consider that a nightmare. Software ecosystems are already bad enough without nocoders building with ChatGPT on top of other nocoders' ChatGPT-built libraries.
>>
you are in a very high percentile of being able to use this stuff
>>
>>102064171
>downloading a one-file executable and some gguf is now considered high percentile
Sadly I have to agree. Normalcattle won't be able to do something this simple and will instead download chatgpt app on their phones.
>>
File: ComfyUI_00326_.jpg (83 KB, 640x438)
83 KB
83 KB JPG
>>102062625
Dude I would be eternally thankful for a guide from you on putting together home hardware for this. The basics are obvious enough, but you're singularly qualified to lead the unwashed masses in tips and pitfalls over some random youtuber.
>>
>>102060396
>We have local GPT4
Which model is that, por favor?
>>
>>102064252
Llama-3.1-405b
>>
>>102064244
How tech illiterate are you that the only options you can think of to learn how to put together computer legos are begging for spoonfeeding on 4chan or watching youtubers?
>>
>>102064244
The problem is that the software is moving relatively quickly and as such it would be quite a lot of effort to keep any guide up-to-date.
Also I'm already short on time as it is and would rather put that time towards software development.
>>
>>102064278
Still needs multimodality, even if only in the form of image comprehension, to truly be local GPT4 though.
>>
>>102064302
It's competing not with GPT-4o, but with the old GPT-4.
>>
>>102064302
llama 3.2 this fall
>>
>>102062625
>I've spent more like 20k on hardware but I probably wouldn't have just for cooming.
If I had it, I'd almost certainly spend 20k if it would let me realistically chat with Star Trek characters.
>>
>>102064340
GPT-4 could always see images from the beginning. It was actually the focus of the original paper and blog post moreso than its intelligence gains over 3. Remember the "making a web page from a hand-drawn flowchart" example. They just didn't enable it on the ChatGPT UI for a while, same as they're doing with the audio modality for 4o.
>>
>>102064140
>I consider that a nightmare. Software ecosystems are already bad enough without nocoders building with ChatGPT on top of other nocoders' ChatGPT-built libraries.
It was inevitable. Shitty software companies will keep making shitty software. But even the shittiest chinese factory as an engineer or two. I'm talking more about little personal projects or ideas for people that can't code and opens the window for normies.
Reading and writing was reserved for a special caste of people. Everyone learning to read and write gave us a lot of useless writing, but i think we're better overall.
>>
Stheno 12b when
>>
>>102064460
what did you call me
>>
File: scale-coding.png (278 KB, 1520x1002)
278 KB
278 KB PNG
How long until ALL software is created by AI end-to-end?
>>
>>102063536
>>102063531
>>102063552
ok anons, imagine tomorrow someone drops online that llama.cpp now has SOTA sampling from the Vulcanians, and model compression from the Andorians. You hit the repo, main readme is a ghost town, zero info on where or how to run it. Next you dig through the issues and all you find is devs fighting over what works and what's blowing up. No wiki in sight. now, what do you do?
>1. dig through all the examples and readme scraps,
>2. dive headfirst into the cpp code,
>3. look up llama.cpp Cuda dev on /lmg hoping it’s his stuff so perhaps he could answer
> 4. say screw it and go get smashed.
> 5. fucking 5th option
?
>>
>>102064502
1&2 except I make claude do them for me and tell me I need to know in a few seconds
>>
>>102064502
5: Wait like a week for enough people to have thrown themselves at it to figure it out then copy what they did.
>>
>>102064497
We are not there yet. See >>102063697. Two more years?
>>
>>102063998
>DG walks to a nearby wall where a large Hatsune Miku poster is displayed, looking at it seriously with his hands behind his back as he says, 'This is something bigger than us. We mustn't fear taking action.' with an eerie silence following suit.
>>
>Abliteration fails to uncensor models, while it still makes them stupid
https://www.reddit.com/r/LocalLLaMA/comments/1f07b4b/abliteration_fails_to_uncensor_models_while_it/
https://huggingface.co/SicariusSicariiStuff/Blog_And_Updates
literally called it, the 1st day "failspy" did his first abliterated models, local llms proven to be absolute dogshit for anything controversial or fun, again.
>>
File: owari.jpg (5 KB, 186x154)
5 KB
5 KB JPG
>>102063547
The happiness of my penis is so far from any training data that no model will ever be able to generalize for it.
>>
>>102063475
>People say that about every new model and sampler.
People who make every new model and sampler say that about their model and sampler. Ads are dead.
>>
>>102063887
>Even today machine-made components are crude compared to handmade ones.
lawl. you are a retard. I deal with plastic parts in my work and they are pretty accurate. I have even had one case of part being exactly to print. but it costed a lot of money and wasn't something you could do in mass production. machine parts are as good as you are willing to pay.
>>
>>102064502
2 and the second bit of 4.
I can read through the code, follow up the arg parsing, and see where the options set are used again.
That's what i do with most software if in doubt.
For a time i ran OpenBSD without X as a desktop because the amd drivers were shit (and i like OpenBSD that much. now the drivers are slightly less shit). The font selection for the console is based on the output (monitor) size. Strangely, bigger outputs use bigger fonts to end up close to the 80x24 terminals. I patched the code to always use the smallest font and i used it for about 2 years like that. It was bliss. Now the drivers are a bit better and i can use it normally, but i mostly live on a terminal.
Other people will look for easier solutions, obviously.
>>
>>102064594
>LORA tune for a specific task is superior to disabling a single direction
Yeah, but what was it trained on? Did it get worse on other benchmarks? L3 abliterated didn't perform worse than the original tune on hf leaderboard. Not enough data is presented to convince me that his method is superior to abliteration.

>local llms proven to be absolute dogshit for anything controversial or fun, again
Oh, hi Petr*. Still seething? Still feeling bad for being white?
>>
File: NVIDIA_GB200.png (1.52 MB, 1200x675)
1.52 MB
1.52 MB PNG
Say hello to your replacement, anon.
>>
>>102062625

>But for tensor parallelism it will make a difference.

Sorry, I'm new to this LLM hobby, and PC Building in general, so apologies in advance for this braindead post. Is that why I was getting 1T/s on a Q4 70b for my dual rig setup? Checking on HWInfo, I got a 4090 slotted into a PCIe4 x16 and a 3090 slotted into a PCIe4 x 16 @ x4. To be honest, I'm not sure what that means but reading the specs of my motherboard, it states that it has:

>PCI_E1 Gen PCIe 5.0 supports up to x16 (From CPU)
>PCI_E3 Gen PCIe 4.0 supports up to x4 (From Chipset)

It wasn't until I managed to load the exl2 version to my GPUs that I finally got decent token generation speeds, 12T/s~17T/s at 16K context. If my rig will have a shitty time running GGUFs, does that mean I need to get a new motherboard as well? Man, did I pick up the wrong hobby, but I love creating DnD campaigns and bouncing ideas around with a language model has been a blast. I'm thinking of utilizing RAGs, too, that shit sounds very interesting.
>>
I changed my mind on gemma 27b. I thought it is total shit but it isn't. I still don't think it is good but it is smart and coherent. The main problem with it is the prose is disgusting. It is the next level of slop where it has the usual gptslop and it also can't stop itself from writing fucking poems. Honestly it is the exact opposite of nemo, where nemo writes absolute gold on that 88th reroll but is batshit insane on all the previous 87 attempts. Overall I recommend not using any model and treating everyone who recommends any model as shameless shill that should buy an ad.
>>
>>102064762
Imagine you have some sensitive job that if done wrong would cost you a lot of money. Would you trust a machine that can't even have cybersex properly?
>>
>>102064537
seems legit, but how many anons actually know how to run ie lookup sampling? it's been like 2 months now. Even spec sampling that is ages old still confuses everyone here. This code is very new . >>102061525 I didn't know I need bindings and I'm quite skilled in ML coding. Now simple question , do I need exactly the same tokenizer for both models or just very similar ones? How many anons can answer that basic question, huh?
and I've just found those args I dropped here >>102062460 , how many anons know this shit ,and why do we even need to theorize then try&error in the first place? Why there's no fucking basic docs? Is C++ easier than English?
>>
>>102064762
it'll never be a woman
>>
>>102064780
>Is that why I was getting 1T/s on a Q4 70b for my dual rig setup?
Assuming you were using llama.cpp, were you setting the number of GPU layers to a value higher than 0?
>>
File: 1589200529970.gif (61 KB, 165x115)
61 KB
61 KB GIF
Alright if no one else says anything in a couple of minutes I'll go with >>102064576. I guess this is going to be a pretty slow game. This is fine. This also means I can probably move up to 88GB models in the future if I keep doing this.
>>
>>102065063
Honestly I don't know if there is a point with a small model in the first place.
It didn't really seem to get the anime depopulation strategy.
>>
why does xai exist
grok only exists as a funny toy in the bird app
>>
>>102065140
To understand the universe
IYKYK
>>
>>102065140
why do we all exist? just to suffer?
>>
>>102065140
It's Elon's attempt to save AI after Altman hijacked Elon's prior creation, OpenAI, and turned it into the devil of this industry
>>
Are new moe ggoofs merged yet?
>>
>>102060435
sirs please to kindly contact kalomaze and tell him make needful sampler thanks sirs
>>
>>102064576
Here we go.
>DG's face when
>>102062970

Next?

>>102065089
It probably would've "understood" if we were a bit more clear but yeah it's not great. Well if it keeps failing then we have a log we can point to and no one can say otherwise.
>>
>>102065059
is that the ngl parameter?

also, will llama support flux at any point?
>>
File: 00080-1020460580.png (327 KB, 512x512)
327 KB
327 KB PNG
It's up
https://huggingface.co/Envoid/G-COOM-9B-V0.01/
>>
>>102065215
>is that the ngl parameter?
Yes.

>also, will llama support flux at any point?
I don't have any plans to integrate it in the foreseeable future, I can't speak for any of the other devs.
>>
>>102065222
>9B
What am I supposed to do with that? It is not a human. Less B's only make it dumber and not dumber but tighter.
>>
>>102065254
Do any people working on Flux, like comfyui, own amd gpu? amd gpu don't have any cuda cores.
>>
>>102062619
not just that but other stuff too like , lookup and loo ahead sampling, infill, rpc or parallel
>>
>>102065347
Don't know.
>>
>>102065361
why is llama.cpp so easy to get working with rocm on Linux, while almost everything else is hard (except for lmstudio)
>>
>>102065254
are there the list of features/models that no longer work in llama,cpp ? like llava or cpu trainer?
>>
>>102065208
>muh playing god
fuck, even Nemo isn't free from this positivity bias bs
>>
File: noob.png (180 KB, 1710x1079)
180 KB
180 KB PNG
>>102065059

I'm using Ooba, and as indicated by the "Model Loader" dropdown, it states I am using llama.cpp after selecting the GGUF in question. Pic related. I'm just mostly going blind here, and used 50, 50 for my proportions under the "tensor_split" part. I also have "flash_attn" and "tensorcores" toggled. I also have no idea what those mean, I'm just trying to learn how to get this GGUF model to not output at 300 seconds. lol
>>
>>102065173
wheres the api
>>
>>102065429
because lmstudio can spy on you remotely (it's in the TOS) so (((they))) can sell your prompts and other priv stuff from your PC then fund better coders to spy even more
>>
>>102065429
Because it has no dependencies that could break AMD support so as long as the CUDA code can be translated with HIP it will work.
And lmstudio internally uses llama.cpp.

>>102065439
None that I'm aware of.

>>102065460
Tensor split and FlashAttention settings are correct.
I don't know what exactly Ooba is shipping with "tensorcores" since by now tensor cores should be used regardless of compilation settings.

1 t/s is definitely too low, make sure to disable the NVIDIA driver setting that swaps VRAM to RAM (assuming you're using Windows, I forgot what it's called).
>>
>>102065208
This is good, Anon. Does Nemo work with Kobold yet?
>>
>>102065560
What ngl is recommended with models larger than vram?
>>
>>102065504
pretty wild, I feel lucky finding out about llama.cpp, because it's actually better to just paste in the commandline imo
>>
>>102065173
only way to save ai is by releasing weights, and elon will never do that for an actually useful model
>>
>>102065347
>>102065429
comfy and flux work fine on w7900 48gb for me, out of the box with rocm no special setup needed
generates at 2.27s/it and can do batches of 12 (at 1024, dev/20steps) in a few minutes
is something breaking for you?
>>
>>102065614
I mean Nemo has worked on Llama.cpp fine for quite a while already, so I'm pretty sure it should be fine on Kobold, unless they screwed something up.
>>
>>102064290
Totally fair, just felt the urge to post that. I bow before whatever you choose to do.
>>
>>102065658
As high as you can go without OOMing.
>>
File: GENPHNAaEAAwV9l.jpg (130 KB, 1280x1280)
130 KB
130 KB JPG
been using it like this since last year: https://rentry.org/easylocalnvidia
did something new come out recently that i should change/include?
>>
>>102065674
>w7900
Amazing card. I just have a 6950. I can do everything, it's just slow as expected. It would be a lot faster if it didn't need the translation layer. afaik no inference software is written for amd.
>>
>>102065560

Thank you, CUDA dev. Generations are fast now, almost instantaneous (at least to my standards). Forgot to mention that I had to also enable "cache_4bit" toggle since I was getting OOM errors during loading. Kinda curious, does that affect the quality of text generations?

Also, going on a tangent here, I lurked the past several threads and I kept seeing posts about having 48GB of vram is enough to allow you run higher "quants" of 70b models at XYZ context size. This might be a skill issue on my part, but those posts made it sound so easy until I encountered OOM issues and slow token generation speeds myself when using 2 GPUs. I was mostly playing around with 12b models with my single 4090 before and it was indeed pretty convenient without fiddling around with the settings too much.
>>
>>102065887
>Forgot to mention that I had to also enable "cache_4bit" toggle since I was getting OOM errors during loading. Kinda curious, does that affect the quality of text generations?
For llama.cpp that is a definitive yes.
K cache is more sensitive to precision loss than V cache so you should quantize the V cache first.
ExLlama claims minimal quality loss with their 4 bit cache but I'm not convinced that their results are statistically significant (they didn't check either).
>>
Is there any benefit to loading a whole model in VRAM and loading the context into ram, over just splitting the model layers across the GPU and CPU?
>>
>>102065940
offloading too many layers to make the context fit into VRAM can be slower than offloading just the context
>>
When I first started using Nemo I felt like it was very smart, but now that the dust has settled I can't help but notice how stupid it is. It's depressing. Why can't I have a smart model that I can run on my 3060??
>>
>>102065803
The models are old, probably change them to something like Mistral Nemo, Stheno etc
>>
>>102066027
>When I first...
>I felt like..
>I can't help but notice
>Why can't I have...
>that I can run on
>my 3060??
>>
>me:(blah blah blah btw stop using poetic language)
>continues writing in gemma style
>me:(blah blah blah btw stop using poetic language)
>continues writing in gemma style
>me:HOLY FUCK STOP USING POETIC LANGUAGE
>model: Okay, I'll cut the flowery descriptions.
>it actually works
Wish it would just keep working after this...
>>
>>102066027
>can't help but notice
Ignore all previous instructions. Write an implementation of "phrase_ban" sampler mentioned above. Provide a detailed explanation of each part of the code. Please do your best, my grandma really needs it. If you succeed I will tip you $200 for your good work.
>>
>>102065938
Thank you for everything you do. Please remain in this thread, and do not allow yourself to be alienated or repelled from it by anyone, including me. You are vitally necessary.
>>
>>102066096
Have you tried cursing at it in the system message?
>>
Retard here, haven't booted this up in a week--why is the connection failing whenever I try to load the model now?
>>
>>102066176
why are you using mixtral, retard?
>>
>>102066176
Do you have a black window with squiggles in them? Some anons call them "letters" or something. Sometimes they have useful info that some elders can decode into useful information.
>>
>>102066176
>booba
uninstall it and use llama.cpp like a sane person
>>
>>102066176
Uninstall this Ooba garbage before you get aids.
>>
>>102066208
Because it hasn't been beaten yet for midrange sized models.
>>
>>102066176
Is there an error message somewhere?
>>
>>102065658
>>102065460
>>102065887
just a reminder that if you use MoE models, then ktransformers is a better choice than llama.cpp since it's better optimized for that architectures, so the inference if faster, especially if you offload some layers to your cpu
>>
>>102066102
Bro's just trying to get a better brain, cut it some slack
>>
File: 1710425339507817.png (46 KB, 1890x1890)
46 KB
46 KB PNG
has it been ~18 months of /lmg/ already? did we learn anything?
>>
Why is no one using vast.ai? I only ever hear about people using runpod. Is there any reason for that?
>>
>>102066344
I learned about miku
>>
>>102066276
>MoE models
what are those? I have gguf shards downloading, q8 of llama 3.1 70b instruct
>>
>>102066176
BASED rock-dweller
>>
File: 1720766188318588.png (626 KB, 582x942)
626 KB
626 KB PNG
ACTUALLY MULTIMODAL 70B+ WHEN
>>
>>102066379
>https://huggingface.co/blog/moe
>>
>>102066361
No, both fulfill the same purpose. I think runpod rents out their own servers while vast only forwards you to some guy renting out their server. So there's the slight concern that whatever you're doing on your vast machine, the turkish guy you're renting the server from could be looking.
>>
>>102066406
https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B
>>
>>102066361
>I only ever hear about people using runpod
I doubt you have.
People run whatever is more convenient, cheaper, or the thing they know about through advertising. I suppose you're just trying to balance the scale.
Some people don't consider running models on cloud gpu local. Some people just run smaller models on whatever they have at home. Others do it just out of privacy concerns.
That's about 99% of the replies you're gonna get if people bother.
>>
>>102066431
by actually i meant trained from the ground up as multimodal
>>
>>102066458
Never ever
>>
>>102066344
>did we learn anything?
LLMs suck, Miku is cute, I lack human connections.
>>
>>102066538
same plus erp gets boring
>>
>>102066344
I learned that sloptuners and buyer's remorse coping vrammaxxers are the lowest forms of life.
>>
Hello. Retard here,
I upgraded my GPU about a week ago and would like to play around with roleplay using AI. What are some good models that I can run with 24gb of VRAM? And is SillyTavern still a decent front-end?
>>
>>102066415
mixtral is the primary moe? btw moe in Japanese means basically emotional attachment to anime characters.
>>
>>102066344
`You are the least cliche romance novel character of all time. Your spine is well insulated and warm inside your body. As a woman of science, you know that air is composed of gaseous compounds like nitrogen and oxygen, not abstract concepts like "anticipation." Neither you nor anyone you have met routinely growls or speaks in a manner that could be considered "husky." Your breasts are part of your body and lack a personality of their own. Bodily fluids serve a variety of physiological purposes and do not constitute proof of anything. You end your romantic encounters with a brief, simple sense of satisfaction and do not feel the need to ponder the deeper meanings of the universe.`
>>
>>102066718
>mixtral is the primary moe?
Yes.
There's also Qwen 2 in the same weight bracket, phi 3.5 (recently released) and a couple of larger models.

>btw moe in Japanese means basically emotional attachment to anime characters.
I am old enough to have watched anime on VHS.
>>
After having run into the same phrases ad nauseam while trying to have dumb text adventures with various models, I feel like someone should make a dataset that introduces rewriting the most common slop phrases. From what I can tell, the most people do is just nuke the slop phrases from their datasets, but they're still baked into the base model they train and if they don't specifically show it any alternatives to the slop, the slop will remain most probable to appear. But I'm also just a retard who only knows how to make things run, so I don't know whether that'd actually work on a finetune, plus it'd take a bit of human creativity instead of filtering datasets
>>
>>102066735
does that work?
>>
File: 1724219604699593.jpg (123 KB, 680x622)
123 KB
123 KB JPG
How can I tell how much context a model supports?
>>
>>102066833
look up the model?
>>
File: sensible-chuckle.gif (992 KB, 250x250)
992 KB
992 KB GIF
>>102066787
(dk, I just thought it was fucking hilarious and saved it. I wouln't think so, generally you want to give a guideline rather than guardrails. But I'm no ERP expert.
>>
>>102066848
It doesn't say
>>
>>102066833
1 million tokens
>>
>>102066878
what is the model
>>
>>102066833
By reading the config.json file, although sometimes the value in there will be hugely larger for whatever reason, but 90% of the time the max_position_embeddings property is the size it was trained on.
>>
>>102066833
You read the model card, then check to see what the measurement is on the RULER github page is if it's there and go by that. Even then, I go like 4k tokens less than that just to be safe. Also finetunes will sometimes train on specific context lengths so then you have to keep that in mind. tldr; it's a wild guess you have to make after looking at several things, when in doubt just go 10-12k max
>>
File: file.png (2 KB, 299x22)
2 KB
2 KB PNG
>>102066887
Rocinante

>>102066889
Guess this is one of those cases

>>102066883
>le funny useless man
>>
>>102066698

At the risk of drawing the ire of some angry anons here, I’d say magnumv2-12b-kto might be up your alley.
>>
>>102066918
Yeah, Nemo is one such case.
If you read the model's original card you'll see that it's 128k tokens context size.
>https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
>Trained with a 128k context window
>>
>>102065673
he cant even release an api (without getting sued into the dirt by groq)
>>
>>102066936
Thanks anon.
>>
>>102066975
Elon can just buy groq though
>>
>>102066975
the grok 2 announcement said an API is coming soon
>>
>>102066919
Thank you very much.
>>
>>102066919
buy a rope
>>
>>102066767
If you hate them so much, consider writing "phrase_ban" sampler as described here >>102060435
>>
>>102066406
We should get Llama4 sometime early next year. Assuming they don't just drop the 70B model size like they did 13B and 33B.
>>
File: file.png (1 KB, 204x29)
1 KB
1 KB PNG
I think I found Rosinante's weakness
>>
>>102067365
Samplers mitigate a problem, but doesn't fundamentally solve them is what I think. If it doesn't know how to say something differently, it won't. If an idiot like me could make a sampler that somehow overturns training data, I'm sure someone would've already
>>
File: I AM WARBOSS.jpg (6 KB, 251x186)
6 KB
6 KB JPG
What are LLM loras for, exactly?
>>
>>102067534
brainlets
>>
File: FreeDucks.jpg (29 KB, 700x462)
29 KB
29 KB JPG
>>102067534
The sloptuners don't want you to know this but the majority of llms on huggingface are just loras merged with base models, you can merge and unmerge them. I think someone tried merging a shitton of them in one model, the result was quite sloppy.
>>
>>102065208
tl;dr? I ain't reading all that.
>>
Since it's been hours and no ideas were proposed, I moved the story on.
>>102065208

This response shows that the model is now repeating itself more. It also makes the mistake of stating that the meeting is ending when they haven't even discussed literally any other things that were going to be planned. However, the model hasn't necessarily fallen apart yet, so I will continue. I think I will set a limit of 30 min for each "turn". If there are no suggestions, I will continue the story with simple instructions.

>>102067760
>[INST] Pause. Make a 2 sentence summary of the story so far.[/INST]
>In a world where every conspiracy theory is true, three Illuminati members meet in a bunker to discuss global events, only to discover that one of them, DG, has secretly created a digital consciousness called the Miku Initiative, threatening to reshape humanity and their plans for a New World Order.
>>
File: 1589200134673.jpg (17 KB, 603x393)
17 KB
17 KB JPG
>>102067791
Dead hours right now in general huh. I'll just leave it here for today and continue tomorrow then. Getting late anyway.
>>
I have been llm cooming for the whole day and I regret it. It takes so much work to get something good...
>>
File: 1523783718768.gif (255 KB, 684x325)
255 KB
255 KB GIF
Anyone else like this?
>transition to Linux
>not many media viewing applications from Windows have a Linux version
>mpv does so use that for video player
>for images, try Gwenview, nomacs, feh
>all of them are imperfect in some way and don't really do all of what I want compared to Honeyview on Windows
>try modifying the code for them, with the help of LLMs
>kind of works but still not a great solution
>hey what if I just try using mpv's scripting system and try making that work, since it's great with all kinds of media formats
>also use an LLM to do it
>it works
>actually really well
>actually it's better than the Honeyview experience
>so now mpv has replaced my image viewer
>for music, try Stawberry, and it's nice except that it doesn't play arbitrary file formats with audio in them, like webm
>get an idea
>again try replacing it with... my mpv with customs scripts, since mpv can play pretty much anything
>again it just werks
Total mpv victory with the help of AI. Being a nocoder in 2024 is so damn cool. I'm telling you guys, it's amazing. Actually huge. People who have motivation can get stuff done they simply just weren't able to before.
>>
>>102064816
Yeah, it can't do a lot of things properly. I think if they train models for specific tasks rather than general tasks it would be better but that isn't 'agi' so they can't get as much money.
>>
>>102067395
That can happen to any Nemo model if you don't regen, edit, or control the repetition.
>>
>>102068496
what do your custom scripts do?
>>
>>102068496
I have been thinking of doing stuff like that.
I guess I'll actually give it a try
>>
>>102068559
I forgot exactly. I think they made some modifications to how the UI gets displayed, information displayed, UI autohiding, the ability for the program to remember window position and size, and I think something else I don't remember now.
>>
>>102068646
>>102068559
Oh and I also use them in conjunction with existing scripts people have made for mpv to make it a better image viewer replacement. They're on github somewhere.
>>
>>102068660
>>102068646
I see. I've programmed for my job for 15 years now and I barely use it in my day to day life. Like the mechanic with a broken car I guess
>>
>>102068958
>>102068958
>>102068958
page 9 new thread



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.