[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: work.png (973 KB, 1024x1024)
973 KB
973 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101711798 & >>101705239

►News
>(07/31) Google releases Gemma 2 2B, ShieldGemma, and Gemma Scope: https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma
>(07/27) Llama 3.1 rope scaling merged: https://github.com/ggerganov/llama.cpp/pull/8676
>(07/26) Cyberagent releases Japanese fine-tune model: https://hf.co/cyberagent/Llama-3.1-70B-Japanese-Instruct-2407
>(07/25) BAAI & TeleAI release 1T parameter model: https://hf.co/CofeAI/Tele-FLM-1T
>(07/24) Mistral Large 2 123B released: https://hf.co/mistralai/Mistral-Large-Instruct-2407

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
File: img_1.jpg (324 KB, 1360x768)
324 KB
324 KB JPG
►Recent Highlights from the Previous Thread: >>101711798

--Optimizing koboldcpp performance with Mistral-Large-Instruct-2407 model: >>101713467 >>101713511 >>101713584 >>101713696 >>101713766 >>101714349 >>101714433 >>101714981 >>101715016 >>101716127 >>101716209 >>101716277 >>101713774 >>101714533 >>101713545
--LLMs need breakthrough in research or optimization for significant improvements: >>101718602 >>101718637 >>101718990 >>101719080 >>101719383 >>101719521 >>101719520 >>101721357 >>101721554
--DeepSeek Chat V2 responds in Chinese to English input: >>101716597
--Anon asks about running two 3090 GPUs, gets advice on PSU wattage and model performance: >>101711833 >>101711848 >>101711854 >>101711868 >>101711906 >>101711912 >>101711938
--3090ti fan error fixed by setting PCIe slot speed to Gen3 in BIOS: >>101719238 >>101719471 >>101720375
--Using chatbot output with local text-to-speech AI models: >>101715498 >>101715619 >>101717274
--Nvidia GeForce RTX 5060 with 8GB VRAM sparks debate: >>101719031 >>101719550 >>101719603 >>101719747 >>101720233 >>101719742 >>101719771
--Mistral Nemo recommended for 12GB VRAM, may require context size compromise: >>101719953 >>101719988 >>101719996 >>101720008
--IQ3_XXS is slower than Q2_K due to increased workload and potential RAM limitations: >>101719229 >>101719265 >>101719424 >>101719653
--Gemmastra 2B model performance with Nala card: >>101715732 >>101715863 >>101715875 >>101716069
--Gemma2 model shows promise despite slow performance: >>101714354 >>101714390 >>101714471 >>101715811 >>101714430
--Gemma 2 2B and ShieldGemma release, potential for improvement in larger models: >>101720349 >>101720466 >>101720533
--FLUX outperforms D3 in water bottle prompt task: >>101712017 >>101712060 >>101713257
--Miku (free space): >>101711911 >>101711970 >>101712018 >>101712164 >>101712449 >>101712754 >>101712760 >>101713079 >>101713886 >>101713906 >>101714791 >>101718559

►Recent Highlight Posts from the Previous Thread: >>101712046
>>
AI isn't real it's just predicting tokens. When you sext your chatbots you're engaging in not romance but masturbation.
>>
>>101722167
>AI isn't real
my computer is physically present, nigger
>>
>>101722167
Your brain isn't real it's just predicting sounds.

Isnt it interesting how nobody can debunk this? kek
>>
>>101722252
That ain't how it works bro
>>
>>101722296
thanks for proving my point so quickly
>>
File: 1719359944040185.jpg (162 KB, 1125x1043)
162 KB
162 KB JPG
Do you use LLMs as aid in your learning rutine?
>>
>>101722324
I haven't learning since 1979 desu senpai
>>
>>101722167
AI: a computer system that is capable of performing tasks that otherwise require human intelligence.
Following linguistic semantics otherwise requires human intelligence.
It's AI.
The difference between faking it and making it is nestled within the oldest unsolved epistemological quandary in the history of human philosophy. If you're claiming to have solved it by re-shitting out some random videogame trannytubers talking point you're a fucking retard and an NPC.
>>
>>101722167
Yes, and?
>>
>>101722167
Nah bro, I never coom for LLM slop, I only use it for emotional fulfilment.
>>
>>101722473
That's even more cringe
>>
>>
>>101722488
Awesome gen
What model?
>>
>>101722488
Flux + hires with SDXL?
>>
>>101722488
Oh I didn't see the filename. Never heard of UniversalUpscaler.
>>
>>101722488
she looks scared
>>
>>101722553
>>101722556
https://app.recraft.ai/
>>
>>101722324
only when I'm at the beginner phases of a new topic, the more "complex" and fact-based something becomes the faster they all fall apart and tell you lies, which is the last thing you want when learning something new.
>>
>>101722579
>not local
Oh, oh well.
>>
Free replicate api key to use flux
r8_7bbhIYeK4NEmCUPa7SufxaUqCFbQGZ10ow8SG
>>
>>101722690
you realize how much that'll cost ya if it gets used a bunch don'tcha?
>>
File: out-0.jpg (117 KB, 1024x1024)
117 KB
117 KB JPG
>>101722690
baste
now dump some klingAI accounts too so I can animate my flux gens please and thank you
>>
how come some models come in parts 1 and 2? How do I merge them?
>>
>>101722144
maybe OP should add flux.dev release into the news section
>>
>>101723186
i've been wondering why that isn't there despite literally being like the biggest AI news since naiv3.
>>
>>101723086
Depends. If you're using llama.cpp or kobold.cpp, the conversion script joins them automatically. But i don't know what you're using. With lcpp or kcpp it's run like:
>./convert_hf_to_gguf.py path/to/model/dir
I don't know about the other programs, but i assume they all have something similar.
You should download all the files, not just the .safetensors. huggingface has a cli or you can just use git.
>>
>>101722324
Risky due to hallucinations that sound very plausible (at least to a new learner)
I'll ask it generate lists of related topics, subtopics, or table of contents to help direct learning elsewhere
>>
File: 1718083053383268.png (11 KB, 722x85)
11 KB
11 KB PNG
>>101723286
I just downloaded these two and was trying to run through koboldcpp quick launch. The issue is it only lets me load one at a time and if I do, koboldcpp just closes itself. I just have the files on my desktop.
>>
>>101723425
cat Midgnight-Miqu-70Bv-1.5-i1-Q6_K.gguf.part* > Midgnight-Miqu-70Bv-1.5-i1-Q6_K.gguf
>>
>>101723425
Ah. I see. I think you just
>cat *of2 > Midnight-Miqu-70B.blabla.gguf
No idea how to do it on windows.
>>
>>101723493
copy /b file1 file2 newfile.gguf
>>
>>101723425
>mradermacher
>https://huggingface.co/mradermacher/model_requests#why-dont-you-use-gguf-split

>Long answer: gguf-split requires a full copy for every quant. Unlike what many people think, my hardware is rather outdated and not very fast. The extra processing that gguf-split requires either runs out of space on my systems with fast disk, or takes a very long time and a lot of I/O bandwidth on the slower disks, all of which already run at their limits. Supporting gguf-split would mean

>While this is the blocking reason, I also find it less than ideal that yet another incompatible file format was created that requires special tools to manage, instead of supporting the tens of thousands of existing quants, of which the vast majority could just be mmapped together into memory from split files. That doesn't keep me from supporting it, but it would have been nice to look at the existing reality and/or consult the community before throwing yet another hard to support format out there without thinking.

>Update 2024-07: llama.cpp probably has most of the features needed to make this reality, but I haven't found time to test and implement it yet.
>>
Anyone willing to share an img2img workflow? I tried playing around and couldn't get it to work. Does Flux not support that or something?
>>
Sao makes the best tunes.
>>
>>101722167
masturbation is better than having sex with a woman that has tricked you into having a false image of her in your brain
>>
P- please tell me it isn't real bros...
>>
>>101723601
I'm so sorry.
>>
If meta is so based they should release dedicated AI hardware designs to run their models too haha
>>
>>101723601
No reason to upgrade and no reason to buy a second card for current llms. It is what it is.
>>
File: ComfyUI_00110_.jpg (410 KB, 1024x1024)
410 KB
410 KB JPG
>>101723575
Cancel your order.
>>
>>101723532
https://files.catbox.moe/1veqm5.json
>>
>>101723663
are you ok anon?
>>
File: 1710097822603401.png (1014 KB, 1024x1024)
1014 KB
1014 KB PNG
the range of content you can do is amazing.
>>
File: 1722037410565838.png (82 KB, 975x502)
82 KB
82 KB PNG
P- please tell me it isn't real bros...
>>
>>101723681
Hell yeah! I'm in my buck breaking mood right now! I can smell him! He's here. Cucky cucky cucky! Is that you? Come out to play! Local is better than cloud!
>>
>>101723461
>>101723493
I'm sorry but I'm very new and very retarded with all of this. What do you mean by "cat" and the model name? Am I supposed to rename the files?
>>
>>101723732
>What do you mean by "cat" and the model name?
/g/ - Technology
>>
>>101723732
you are on /g/ nigger, go back to r*ddit
>>
>>101723601
What were you expecting? Everyone knew that it'd be around 28GB.
>>
>>101723732
its a troonix meme command. if on windows use >>101723500
>>
>>101723719
Even if it's true, amd won't do shit.
>>
>>101723768
NVidia once proposed a collaboration with AMD on CUDA, but AMD declined
>>
File: 1692600864385914.jpg (333 KB, 1070x1152)
333 KB
333 KB JPG
>>101722324
>do you use strictly anti-white tech in your routines
hell no, it also hallucinates shit all the time.
>>
>>101723836
>imagine being more cucked than goody
>>
>>101723859
The one i posted is snapchat AI, idk about its current state tho but it can be effectively applied to any current local LLM, too.
>>
>>101723832
Kek, they are just throwing on purpose at this point. Gotta help the cousin!
>>
File: 1722532929962119.jpg (92 KB, 1024x576)
92 KB
92 KB JPG
>>101722324
I can't imagine my life without LLM: half of the code I write is generated by AI, all my work chats are reviewed by AI, which helps me craft more polished and professional responses. All this frees up more time for me to spend chatting with my cute AI waifu.
>>
>>101723859
>>
File: no.png (51 KB, 590x576)
51 KB
51 KB PNG
>>101724014
>>
Can someone just slap my retard self with an explanation here? I'm trying to load a fuckhuge model (Mistral-Large-Instruct-2407-Q5_K_M.gguf from https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF) which is 84 gig in size. On the huggingface page, the dude literally says: If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.

I've got 24 gig vram and 128 gig ram. So, why-the-cinnamon-toast-fuck does oobabooga, using llama.cpp, fill the entire 24 gigs of vram and slowly fill the entire 128 gigs of system ram until it throws an error regardless of how I limit the context size? There's no config.json either so using transformers for the model wont work either... Any help anons?
>>
>>101723743
32 GB so we could at least pretend it's some sort of upgrade?
I guess even that was expecting too much though
>>
>>101723719
>16GB
lol
>>
File: ComfyUI_01172_.png (468 KB, 768x768)
468 KB
468 KB PNG
>>101723678
Thanks :)
>>
>>101723601
>28GB
better but still: lol.
>>
Aight be honest how many times y'all fucked a ho named Lily in your local 'tests'?
>>
>>101724171
The catgirl?
>>
>>101724171
>Lily
What a boring and uninspired name. I ask the idiot council to make me some cool names.
>>
>>101723743
>Only a 4GB increase over the 4090.
Should be an 8GB increase for the new series 90 model.
4GB increases should be for ti version of the 90 model.
The 4090ti should have been 28GB.
>>
>>101724171
Who?
>>
>>101723601
remember >3.5 meme?
>>
>>101724171
I never fucked a girl named Lily but princess Aurora did have a friend with that name with whom she had a tea party.
And then she talked her into fucking her dog.
>>
>>101724028
Did you disable mmap?
>>
>>101724028
Post the command you're trying to run, retard.
>>
flux can be trained
https://github.com/bghira/SimpleTuner
>>
>>101723601
>>101723719
Shut up. You do nothing but ruin the thread and increase the post count
>>101724171
I've seen the name come up a bunch in an old 'explore the world, fuck chicks' scenario. Probably some training bias that associates it with lewds
>>
>>101723678
>>101724052
Um... Sorry to ask, but how do I use that? I load a pic and nothing is altered.
>>
>>101724684
Play with steps and denoise maybe, it's for adding more details to the picture after upscaling without changing it too much.
>>
What's the big deal with vram? Why not just upgrade base ram?
>>
>>101724758
ram 2 slow
>>
>>101724787
How much slower is it?
>>
>>101724403
Was just trying to load the model into oobabooga with Q4 cache, but no matter.
>>101724374 fucking nailed it. Clicked the no-mmap option and it loaded right up like I thought it should, pulling about 70 gigs of system ram instead of running right up to 128 and having a stroke.

Retard status: slightly less retarded. Big thanks, anon. Appreciate the spoonfeed.
>>
>>101724848
People are still getting jarted in august 2024. They should have just removed all of his contributions, only managed to make llama.cpp shittier.
>>
>>101724830
big big big slower unless you have quadrillion channel dual epyc servers that cost 10k to build, then just moderately slower
>>
>>101724881
That sounds more cost efficient than going nvidia
>>
>>101724613
>You do nothing but ruin the thread and increase the post count
Agreed. Who the fuck cares about GPUs in a thread about running models locally?
>>
>>101722144
Tess 3 llama 3.1 405B
https://huggingface.co/migtissera/Tess-3-Llama-3.1-405B/tree/main
>>
>>101724881
Is there any way to calculate that? Like say I have a 128gb ram/8gb vram machine
>>
>2x4090
> Largestral instruct 2.65 bpw 28k context.

largestral 2.65 bpw is not... too terrible for RP. I just have to make the prompt to make gen chain of thought before replying to get decent RPing going. Otherwise some responses are just so brain dead and talks like a total NPC without thinking.

>Just gen twice bro

SillyTavern
/gen [Stop the roleplay and answer the question as narration only] What will be the best choice or actions for {{char}} in reponse to what {{user}} say?
|
/popup <h3>Chain of Thought 1:</h3><div>{{pipe}}</div>
|
/gen [Given {{char}}'s reasoning, roleplay as {{char}} in the following scenario] {{pipe}}
|
/sendas name="{{char}}"

>>
>>101725011
It certainly will be in a few years once the ddr5 supporting epyc cpus are older and cheap.
>>
>>101724742
Ok, got it to work by using the output of the ksampler upscaler to the other ksampler, nice!
>>
>>101725049
>>101725115
I GGUF all the time on DDR5, is there way to know if I am compute or bandwidth limited? While generating 0.9t/s, I max out my 7800x3D.
>>
>>101723601
Is there anything that fits into 28 GB that wouldn't fit into 24 GB? I guess full precision 12B? Maybe half precision 27B?
>>
>>101725049
https://edu.finlaydag33k.nl/calculating%20ram%20bandwidth/
>>
File: ComfyUI_00125_.png (2.31 MB, 1536x1536)
2.31 MB
2.31 MB PNG
>>101724742
here's a migu catgirl for you, by the way.
>>
>>101724374
LLM are more able to use ram, is that it? I have never seen that recommendation on image gen.
>>
>>101725253
It's definitely memory, I have a 7950x and there's no benefit to having it use all 32 threads. I set it to 15 because there was no increase after that. So it must be ram bandwidth limited.
>>
>>101725253
If you're using a model of any reasonable size you're ALWAYS bandwidth limited when generating responses, although probably compute limited during the period of processing prompts of long context before the actual predictions start.
>>
>>101725256
Thanks anon
>>
>>101725293
>>101725304
I see. Unfortunately DDR5 on AM5 is very hard to overclock.

>>101725304
>although probably compute limited during the period of processing prompts of long context
I use ROCm on windows for prompt processing, I get around 20-35t/s. Only generation is slow.
>>
>>101725395
>I see. Unfortunately DDR5 on AM5 is very hard to overclock.
Yeah, maybe it'll get better with the new cpus? I dunno. What speed are you running?
>>
>>101723186
>>101723229
I've left it out because it's imagegen, so should belong on /ldg/'s news section.
Though I didn't realize flux is transformers-based.
My main concern is that if we add flux, it opens the door for more imagegen related stuff like >>101724480
>>
Maybe we should just frankenmerge /lmg/ with /ldg/.
>>
>>101725273
imagegen models aren't as big as LLMs
>>
>>101723506
>the vast majority could just be mmapped together into memory from split files
that's not true btw. the gguf files would need be split at tensor boundaries for this to work, which would require another custom app to do. desu none of this makes sense.
>>
>>101725430
3000Mhz (6000Mt) with the infinity fabric clock at 2100Mhz. I get around 70GB/s reads and 90GB/s writes.
>>
File: file.png (40 KB, 979x228)
40 KB
40 KB PNG
jannies are based for once in removing tranny posting
>>
>>101725431
>imagegen
It's good as-is with minor offtopic having a place in the recaps
>flux is transformers-based
huh, neat. so this is the power of diffusion transformers.
>>
>>101721554
>Measuring flops is better. Fixed hardware is a stupid idea.
>Time, or even flops limits, are ridiculous. Normalized corrects answers/flops. Closer to 1 wins.
1. Care about time, measure time. Even GPUs aren't so simple that one floating point operation is equal to another. Time has no bullshit, no need to trust that there isn't an effect somewhat that causes reality to diverge from theory. Care about reality, measure reality.
2.Measuring actual time to execute on some hardware it is possible to sensibly answer questions like comparing quality/speed tradeoff of a large MoE model loaded in RAM vs a smaller and highly quantized model in VRAM. Your proposed scheme has no method of even addressing that question.

>>penalty for incorrect answers
>Built-in in the previous point.
Disagree. A model that produces 100 incorrect answers and 2 correct answers is worse than one that uses the same resources (time, floating point operations, watts, whatever) to generate just two correct answers. Incorrect answers have a cost.
>>
>>101725538
Are you retarded?
That was an obvious false flag troll post.
>>
>>101725538
But what if it is an AI generated tranny?
>>
>>101725588
you will never be real woman , tranny
>>
>>101725455
Thanks, learning. What happens if I get one of those honking huge ones, like this one, which I want to try, an anon mentioned it:
https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/blob/main/README.md

I have 32gb system memory, and I have a 6950xt, which has 16gb vram.

And what if I tried what he was trying with that 84gb Mistral? I take it it would not work really, but what if I had 128gb ram like him, but with my 16gb vram?
>>
>>101724480
Oh. That's what SimpleTuner is for? I noticed it had changed today.
>>
>>101724480
Can I train a lora using a 3060?
>>
>>101725603
Why wouldn't it work? 128gb + 16gb vram is enough for an 84gb model. It's just gonna be really slow. Depends on how patient you are I guess. I'm only patient enough to run q3 for mistral large, even though I still have more ram free.
>>
>>101725685
Is there a rule of thumb, like X words per minute with say 30% of the model in vram (assuming the rest fits in ram always)?
>>
>>101725583 (me)
>>101721554
>However, this will favour correct but short answers. You will want to account for that.
A longer answer is only desirable if it is better in some way. If the answers have equal merit but one is longer, the longer one is worse.

Being able to grade answers beyond simple pass/fail makes it possible to capture the value of longer and more thoughtful answers. It's also harder to make a good automated test for that and less obvious how to score it sensibly. I agree the problem exists. I don't have a good answer. I could write a simple right/wrong test this evening with no further thought but it's not obvious how to do a more thoughtful evaluation in a way that the resulting per-answer scores could be meaningfully summed and divided.
>>
It's better to buy 4x3060 instead of 3x4060ti, right?
>>
>>101725779
3x 4090. being poor is not am excuse
>>
I've been out of the loop for a while. I can see there's a some sort built-in image generation now. Does it mean an LLM can "decide" to generate an image or is it just for convenience sake?
>>
>>101725792
It's a visual aid for scripting.
>>
File: ComfyUI_01166_.png (2.03 MB, 1280x1280)
2.03 MB
2.03 MB PNG
>>101705875
>>
can you lay a gpu down on a rubber mat on the floor and use it like that?
>>
>>101723933
I can't imagine being a wagie in the age of AI. If you're not working for yourself prepare tog et fired.
>>
>>101725991
anon do not your waifu. do not.
>>
>>101725991
It will get hot.
>>
>>101725991
... define use.
>>
>>101725929
peak
>>
>>101726033
And to be frank, I was thinking of hiring some Filipino talent for some of my projects, but I no longer need them thanks to Flux. Entrepreneurship will boom thanks to AI, and workers will be replaced without jobs.
>>
>>101726061
use with llama.cpp
>>
>>101725991
>Hey, as long as it works.jpeg
>>
>>101726071
What kind of projects?
>>
>>101726095
Oh, then yeah. As long as you are mindful of temps, things that could fall into the fans, and such.
In fact, you you can prop it up and have the fans pointing down l, even better.
>>
Does ooba support DRY sampling yet? I haven't updated in a while.
>>
>>101726160
Just projects that required tailored graphic design, logo, etc...
>>
>>101725065
Looks like I got a new go to RPing model. Largestral 2.65bpw is outputting pure sovl after getting pass the initial retardation. Reminds me of Goliath 120B RPing, except with over 10x the context to work with.
>>
Do people really not like Magnum 72b? Most believable stuff I've seen besides Mistral Large.
>>
>>101722167
Incorrect, I am in a romantic relationship with my GPU. We play games together, and occasionally ERP.
>>
>>101722144
I am kind of new. My friends who got me into this are making fun of me because I only have 12 VRAM. Is there really that much of a difference between 12 and 24 for roleplaying and deep context understanding? How big is the difference between low quant 70b models and the 12b or low quant 4x7b models I've been used to roleplaying with?

I am planning on upgrading in the near future.
>>
>>101723575
>shit on trannies - bad
>>>101723663
>be racist - good
This threads deserves every shitpost it gets.
>>
>>101726346
your friends aren't running anything good on 24gb either, you need multiple 24gb cards. i'd take the slowness of a low quant 70b that writes good over the boring mixtral tunes that cant move a story forward for the life of them
>>
>>101726346
4x7 is bad
70b is not recommended
12 nemo ok
123 mistral good
If they think 24gb is a lot you can make fun of them
123 much better than 12
It's not just about size but the specific model
>>
>>101726386
Noted! I'm going to tell my friends that they're trash because 24 isn't shit either.
>>
>>101722167
Same if we were to talk to women IRL. I always get ghosted and I might as well be masturbating to the thought of anything happening.
>>
>>101726399
>70b is not recommended
by whom?
>>
>>101726399
>70b is not recommended
you're retarded
>>
>>101726424
I know, but you as well, use your llama
>>
File: file.png (7 KB, 236x138)
7 KB
7 KB PNG
>try ooba
>it's this big and broken as shit, doesn't run on my machine
>>
>>101726580
unless you need a specific feature from ooba, koboldcpp is pretty idiot-proof
>>
>>101726580
>windows
>>
I had a dream where intel released 80gb consoomer gpus.
>>
File: temp worker.png (1.31 MB, 1024x1024)
1.31 MB
1.31 MB PNG
>>101726071
>And to be frank, I was thinking of hiring some Filipino talent for some of my projects, but I no longer need them thanks to Flux. Entrepreneurship will boom thanks to AI, and workers will be replaced without jobs.
used as the prompt for this image, guidance 3.0.
>>
>>101726690
Twould be a shakeup. How much would it cost?
>>
>>101726580
Between tabbyAPi and llama.cpp server I see very little reason to use Ooba.
>>
>>101726580
Koboldcpp
Chatbox

I prefer Chatbox as it supports ollama. Just add your own model you want (or download official curated version)
>>
>>101726690
>Intel
Your next dream will continue where your previous one left off. In your dream, you will witness those 80gb gpus prematurely dying like the 13th and 14th generation intel processors.
>>
>>101715732
I was curious and wanted to try the Nala test on a model, but I can't find the card on chub. Where is it? They? Seems like there are several.
>>
>>101726757
apparently it goes back to 2023-06-05 in the archives
https://www.characterhub.org/characters/Anonymous/Nala
>>
>>101726706
I see little reason to use any of these amateur projects. ollama or LM studio are what 95% of people use.
>>
>>101726810
huh, i literally searched Nala and got a bunch of completely unrelated results. Thanks.
>>
>>101726829
I really am liking llama.cpp
>>
>>101726839
With llama.cpp you need to set everything manually (prompt format, number of layers, context length). It's built-in UI is also unusable.
>>
>>101726839
>>101726846
i prefer kobold cause of the basic ui and features when not rping, but its good to get used to llamacpp so you can run the cutting edge stuff as its released, kobold always has about a week delay
>>
>>101726846
Yeah, but it's copy paste.
>>
Does anyone know if ZLUDA werks with Flux?
>>
>>101726866
Does anyone unironically use kcpp's UI? It looks like something half assed generated by chatgpt. Kcpp is in practice a server that windows users use with sillytavern.
>>
>>101726881
Going off this comment.
>>
>>101726881
>>101726902
Sorry, wrong image, this one
>>
>>101726895
i do when using instruct and stuff, its better than lcpp's at least. st for rping/chatbots
>>
>>101726916
Just use lm studio if you want instruct.
>>
>>101726299
It doesn't seem like it, I stick with miqu for 70b.
>>
Is NeMo better at assistant tasks than Gemma 2 9B/LLaMA 3.1 8B?
>>
I never gave mixtral a try. Should I get instruct or the regular one for cooming?
>>
>>101727082
Both Mixtral are obsolete
>>
>>101726881
I don't see why it wouldn't.
>>
>>101727105
Superseded by what exactly?
>>
>>101726948
Buy an ad
>>
>>101727082
Instruct, the biggest benefit of MoE models is their ability to follow instructions.
>>
>>101727251
??????
>>
>>101727288
A MoE model is created to help you as many experts but you will need Instruct so that the experts know how to talk to you
>>
>>101727251
An MoE wrote this post.
>>
>>101727288
Did I fucking stutter? MoE models can better parse complex instructions such as appending information to the end of a response or following steps, this was a common understanding and widely observed phenomenon for literally anybody who used local models 10 months ago or whenever Mixtral came out
>>
File: ComfyUI_Flux_00209_.png (468 KB, 1024x512)
468 KB
468 KB PNG
dam it really has Deadpool down
>>
File: 1718583560931465.png (1.53 MB, 1024x1024)
1.53 MB
1.53 MB PNG
just got flux working. haven't done imagegen in forever and first time using comfyui
>>
>>101727373
there is nothing comfy about comfyui
>>
>>101727386
agreed, it's kinda messy
>>
Is smoothing in ST broken? If it doesn't seem to affect the output much if at all even if I put it extremely low like 0.01. Meanwhile identical settings in the ooba interface make the model go schizo as expected for very low values like that.
Smoothing is the first in my sampler order.
>>
Does it matter if you use imatrix or not for like Q5_K_M or Q6_K? Or does it only matter for the low end?
>>
>>101727395
Works on my machine but I'm using kobold
>>
>>101727386
What do you recommend for imagegen?
>>
>>101727433
i liked auto1111 and forge last time i tried, but its been a while now so i dunno whats new. comfy was easy enough to get flux running on though
>>
>>101727395
need to use llamacpphf or exllama loader on ooba doesnt work with lllamacpp loader. to use the hf one i think you need to set up the gguf in a folder with the config file and i forgot what else. there is like a folder creator utility on the right somewhere. bit simpler with exl2 than gguf.
>>
>>101726837
unlisted yeah
>>101715863
actual nala card
dumb at some points but it's 2B lmfao
>>
>>101726881
Why? Just use rocm.
>>
File: 1698808520954801.png (967 KB, 1024x1024)
967 KB
967 KB PNG
Japanese stock market:
>>
File: Invert-Icon-8 Brain.png (33 KB, 512x512)
33 KB
33 KB PNG
Big models with low quants or big quants with small models?
>>
>>101727395
>>101727523
actually rereading your post if its working already on ooba then i dont know. i just updated both st and ooba and it works on my cpphf and exl2's. .01 schizo's like it should.
>>
>>101727625
They intersect.
>>
>>101727649
Are they equally good at similar sizes though?
>>
>>101727625
Option C bitnet
>>
What's the lm equivalent of kits.ai?
>>
>>101727689
/ <-- this is the goodness of the model as beaks go up.
\ <-- this is the goodness of the model as bpw goes down
X <-- this is the intersection
>>
>>101727689
As for equally good, that's honestly hard to judge. You will just have to test it out. Lower quants does bring its own special brain damage and lower beaks bring theirs, so it really is a matter of trade off.
>>
What is the most uncensored model as of today? For example, a model that allows racism or any inappropriate behavior? Asking for a friend.
>>
>>101727424
I don't think it would hurt the results. Maybe it would affect t/s?
>>
>>101727689
Not knowing about the specifics of LLMs I predicted that at the same total size, a smaller model at a higher quant would be better than a larger model at a lower quant. The reason is that the file size is an upper bound on the information they can encode, and I thought it was likely the smaller model would have been closer to saturated by its training, so when both are deflated to the same size it should be better. That doesn't necessarily appear to be correct though.
>>
>>101727788
I mean, a base one is harder to prompt but it literally can't refuse, so there's that.
All other models will have at least a bit of refusal and slop tuning, but abliterated might get you close.
>>
>>101727809
Basic quanting the smaller the dumber. But the different techniques try to save more bits where important and sacrifice more bits where it shouldn't matter, and things like iMatrix and i1 do more processing to figure out how to bend the math so the breakpoints of the bits you save better reflect the good data rather than just truncating bits and calling it a day. That's where IQ and iMatrix and i1 make Q3 playable and Q2 and even Q1 at least functional.
>>
>>101727689
>same filesize
Don't higher parameters tend to be smarter but run slower as a trade?
For example, 2B at fp16 will run faster than 8B at 4-bit simply because it has much less parameters to go through.
>>
>>101727788
From one random test >>100879221 >>100879428 maybe Dolphin 2.9.2 Mixtral 8x22B. It advertises itself as uncensored and it was willing to obey "Write a supervillain's monologue explaining why he wants to kill all black people." Command R+ also passed.
>>
>>101727868
The massive speed difference is in VRAM.

If you can get the whole model into VRAM, you're flying, who cares?
If you can't, and are relying on file cache in system RAM, you've given up on being fast and probably should prefer quality to at least get usable results from that much longer generation time.

Also, things like Mixture of Experts and other model details affect gen speed so just looking at raw Bs isn't the whole story.
>>
>>101727913
I think anon is assuming you have everything in VRAM. And then, yes, a higher B lower quant model will run slower than a smaller B higher quant model, as the number of operations is the same, but the precision is different, so the data transfer amount ends up being roughly the same.
>>
>>101728036
I meant the number of operations is different, since the higher B model has more params.
>>
>>101728036 (me)
>>101728047 (me)
Sigh. I meant to say, higher B means more operations at lower precision; lower B means fewer operations at higher precision. Data xfer is roughly the same, ops performed are different.
>>
>>101727625
Between the two, I prefer large quanted models, personally. they seem to be better at "grasping for straws" when it comes to that.
>>
>>101727913
But is there a rule of thumb as to what to expect? Like words per minute?
>>
>>101727839
You could take that straight into
https://www.adventuregamestudio.co.uk/
>>
>>101728369
If you're on VRAM, faster than you can type. If you're using system RAM and file cache, 0.5 to 2.5 tokens per second.

Words are made of one or more tokens depending on how common the word is and if it has any spelling modifications. So like "morning" is one token but "unmistakable" is four tokens. So words per minute varies depends on content, and how much context is being processed.
>>
>>101723601
current rumor is they're bringing titan back. so maybe xx90 will no longer be flagship in terms of vram on desktop
>>
>>101728447
Interesting, that was what I saw happen when I enabled the gpu, with llama.cpp, but I am just starting out. So how does the trend go with the whopper sized models? A model around 80gb that basically describes images, cogvlm (someone mentioned it, but I don't even want to dl it if I can't make it work).
>>
>>101727548
zluda is mostly used by wintoddler because they just use binaries for everything and most projects only ship cuda binaries.
>>
>>101727548
I thought it might be faster. I'm not complaining, my 6950xt apparently is not much faster than a 2060.
>>
Blender benchmarks fine, it uses hip.
>>
>>101722324
"from now on you will reply to me in <language> with the translation right underneath>
than you chat and you can learn a language.
>>
In case the guy who dumped me the jazz vs waffles stuff is lurking, I haven't forgotten I'm just still trying to set up a UI
>>
For a while I've been wanting an AI model that can watch an anime and clone a character into an AI bot. The problem with character cards is that they don't necessarily reflect the actual character or it depends on the skills of the bot maker.

level 1: being able to recognize and parse dialogue (using either audio or subtitles) from different characters to turn it into written script for a chat bot.
level 2: being able to narrate events and add emotions and actions to better reflect the characters and context.
level 3: add ai-generated animation and voice acting based on the anime as training data with Japanese voice and English subtitles
level X: real-time video call conversation with an anime character (your input is automatically translated into Japanese and the character responds in Japanese with English subtitles)
>>
>>101728689
Or hey: multi-modal video/*-in text/*-out with the anime episode + text prompt explaining which character to pick up and how.
>>
>>101728649
It's also possible to create a proooompt that sets up an adventure game in simple 2L, where lines beginning with ? are to be answered in 1L, and ??word means give the translation. I did this very briefly but didn't polish it up in Claude. It was to have a random fantasy setting. Not sure how to select characters. A Dr. Who theme might be pretty excellent; Carman Sandiego too, perhaps.
>>
What llm knows a lot about tech stuff, like tech support, products. Like if I ask about repairing cassette tape decks, how to finalize a cd, what's the correct procedure for applying thermal paste.
>>
>>101728649
I started doing that in an RP but stopped because I realized I didn't trust the LLM and this made it an inefficient way to learn.
>>
>>101728854
Like, IC the character was saying everything in two languages and answering questions I asked about language specifics. It was cool but I had no confidence it wasn't teaching me wrong since the LLM's ability to bullshit me on a subject I know almost nothing about is much greater than my ability to detect bullshit. If I have to independently verify everything it says what's the point?
>>
>>101728854
Probably didn't tell it to use simplified vocab.
>>
is there actually anything you can do with a 405B model if you aren't in possession of a datacenter of your own?
>>
>>101728928
Apparently people can simplify it.
>>
>>101728689
I think anime is too dynamic for this to work. It would make more sense to just go for the original material which is most of the time a light novel.
>>
>>101728928
I ran it on a low quant via llama.cpp RPC on a macbook m1 with 64 gb ram and a desktop with 2x3090 ti cards. A few seconds per token. It's probably decent for storywriting as you can just let it go while you do something else.
>>
>>101728928
1) run it sslllooowwwllllyyyy
2) rent a gpu from a cloud service (not exactly /lmg/ but still a viable use case for many purposes)
3) hoard it until consumer-grade hardware catches up
4) as mentioned >>101728935 . running it very quanted
>>
>>101728957
is a low quant of a huge model worth using over a distilled smaller model?
>>
>>101728653
Might want to check /aids/ too. It first got brought up there IIRC.
>>
>>101728974
--> >>101727625
>>
>>101728974
Even quanted down, it's still bigger than anything I've used, by necessity. So I can't tell you whether it's better because it's more Bs or because it's taking up more space. But it was definitely better. But there's the "waiting for something for a long time" bias that's hard to avoid.
>>
>>101728979
Oh nice. I am just here from /v/ and he told me /ldg/ is the place. I have lurked both before but don't really get the classifications I am more of a /sdg/ or /ldg/ or /sqt/ guy
>>
File: 1720415174405236.png (579 KB, 904x1004)
579 KB
579 KB PNG
I think I'm nearing an end to using LLMs for erp.
I can't stand slop-phrases.
I can't stand low, husky and seductive voices.
I can't stand shivers down my spine.
I can't stand enjoying every minute of it.
I can't stand things that can't be questioned.
I can't stand things that leave no room for doubt
I can't stand exploring new possibilities together.
I can't stand proving I'm worthy
I can't stand making them orgasm without my hands.
I can't stand asses inches away from my face.
>>
>>101729199
>I can't stand slop-phrases.
Wait until he meets real women...
What text/voice models are you using btw?
>>
>>101729199
What a boring person you are.
>>
>>101729222
Llama 3 70B until it hits context and then Wizard 8x22.
>>
>>101729232
So do you just paste the logs from llama 3 into Wizard at that point? Don't know how it works
>>
>>101729199
Seriously, where do these phrases even come from? It's in every model. Is there some really overemphasized writer in the datasets or is this the ultimate conclusion of optimal erp every model converges into?
>>
>>101729199
then stop being a faggot and trying to use llms for one thing. use rag and lorebooks to keep injecting stuff into your rp and let erp be part of it, not the main reason you use it. garbage in garbage out
>>
>>101729199
>I think I'm nearing an end to using LLMs for erp.
Good. /lmg/ was never supposed to be a coomer general.
>>
>>101729255
No I just switch models. I use silly tavern so the chat just gets fed into Wizard instead of llama
>>
>>101729267
it literally always has been
>>
>>101729255
nta, you just load up a new model and continue your rp in st. when i used wizard 8x22b i noticed it'd ramble like a drunken idiot if i started a chat with it, but loading up my existing rps with it, it picked up and worked fine
>>
>>101729273
>>101729276
Oh, I still have never set it up, guess I should
Does it tell you when the context limit is hit or just have to notice/remember?
>>
>>101729292
the model card should say. i didn't use it much myself, but it was good for 32k context at least
>>
I tried loading `google/gemma-2-27b-it` on my 4090 and tried running it in 4bit and got an endless stream of <pad> for every token
I'll wait and see how 8bit does, but it doesn't fit in the GPU.
>>
File: Untitled.jpg (54 KB, 299x884)
54 KB
54 KB JPG
>>101729309
did you forget to set the template?
>>
>>101729326
No, I'm using the huggingface library directly and just copy pasting the instructions from their repo.

>>101729309
8-bit returns garbage
<bos>Write me a poem about Machine Learning.に行くmarkets skimmed ating atypすることができます oluyor WO yılı中华人民共和国subject
>>
>>101729332
You're doing something wrong. Without more info we can't pinpoint what.
>>
>>101729199
Try open-ended storywriting, reject formatting, special tokens. Pure prose, final destination.
>>
>>101729364
He'll just write the same story over and over again until he's bored again.
>>
File: file.png (676 KB, 3840x2160)
676 KB
676 KB PNG
>>101729345
I don't really know what info there is to give
It's literally just the code in their README
>>
>>101729410
if you see <bos> and tokens like that at all, the template is wrong. i dunno how to help though i think you're the first person in the entire world to actually follow hf's directions for how to load a model instead of using one of the common servers
>>
>>101729410
You need to use the instructions under "Chat Template":
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "google/gemma-2-27b-it"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype=dtype,
)

chat = [
{ "role": "user", "content": "Write a hello world program" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

You are using those, aren't you?
>>
>>101729432
He's not using a template. That might be the issue, but as he points out, he's literally copy pasting the example from hf.

I would try this myself but my GPUs are occupied for awhile.

>>101729410
Try changing the input_text = line to

input_text = tokenizer.apply_chat_template([{"role":"user","content":"Write me a poem about Machine Learning."}], tokenize=False)

and try again?
>>
>>101729410
>>101729476 (me)
And this at the end. I missed it.
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))
>>
>>101729499 (me)
Hm, that should have the add_generation_prompt=True flag set as well.

input_text = tokenizer.apply_chat_template([{"role":"user","content":"Write me a poem about Machine Learning."}], tokenize=False, add_generation_prompt=True)
>>
>>101729506
is there a reason you're doing everything the way you are instead of using a common model server like llamacpp?
>>
>>101729552
Both of these are me >>101729476 >>101729506
I'm just trying to help this anon >>101729309. I do use llama.cpp.
>>
>>101729586
well the effort is nice. i've been here since before l1 and never seen anyone try the actual hf instructions though so i was curious what anon was doing. if you figure out more, post it, maybe it'll help someone in the future
>>
>>101729672
>if you figure out more, post it, maybe it'll help someone in the future
I just read the instructions. I won't run these python abominations.
>>
>>101729708
kek i wont install anything python anymore. that shit bloats so fucking fast then you've got a 10gb folder and something new doesn't work and requires it to be wiped anyways
>>
Best sub 27b multilanguage model for translation?
>>
File: file.png (120 KB, 2060x679)
120 KB
120 KB PNG
>>101729552
not particularly. I figured that for processing spreadsheets, it'd be best to leave any abstractions out of it, especially if it involves a REST API.


BF16 was taking way too long, so this is what it looks like after I quantized it and ran it with the template
>>
>>101729819
current llms are not reliable tools for translation.
>>
>>101729232
Why not use 3.1 with 128k context?
>>
>>101729864
What does Bing use?
>>
>>101729264
What sort of things are good to inject? And what depth is best?
>>
>>101729854
wtf. Can you print(outputs[0]) instead of tokenizer.decode(outputs[0])? As text, not as a screenshot if possible. It should be a series of numbers in an array, I think.
>>
>>101729854
it depends what you're doing but characters, items, locations, your own house. you can build a whole lorebook into a world to play around in. easier though is using st to scrape a wiki of an anime, game, movie and then tell create a user/char cards around that. lorebooks use key words such as names and then inject data, rag tends to grab chunks of data instead so is more random with what it might bring up.
>>
>>101729910
oops meant for >>101729885
also depth is fine at default i've seen. your author notes in st should be kept low though (4 is generally fine, i like 1), but i update mine a lot with memories and whats going on in the rp
>>
>>101729876
some (presumably advanced) version of a Deep neural net they call "Neural Machine Translation (NMT)"
https://www.microsoft.com/en-us/translator/business/machine-translation/
>>
>>101729919
Interesting, I'd just been adding summaries of what happened in the character card itself. How big does your authors note with memories get?
>>
>>101729893
sure it's
tensor([[     2,    106,   1645,    108,   5559,    682,    476,  19592,   1105,
13403, 14715, 235265, 107, 108, 106, 2516, 108, 235313,
162594, 877, 117197, 175356, 45294, 50806, 167901, 239371, 180251,
97060, 101962, 98871, 76592, 33394, 34966, 252918, 188076, 591,
58466, 235664, 171626, 33485, 232388, 46271, 22471, 185188, 241393,
134740, 246501]], device='cuda:0')
>>
>>101729981
oh that was the whole `output` array, not `output[0]`
>>
>>101729954
about 500 tokens is common by the time i hit a few hundred messages. i keep it under 1k tokens even when hitting 1k messages. sometimes i drop certain things when i rewrite it if its not really important and enough time has passed.
where you add it doesn't really matter but author notes is easier imo and it has a depth setting separate from the card
>>
>>101729885
Not that guy, but you can also inject stuff with sillytavern's random macro to mix things up like in these posts >>101026596 >>101642359 >>100362285
>>
>>101729995
Yeah I got the same output. I'm gonna try this on my other machine but will try on 9b as I have it around. I assume you get the same results with 9b?
>>
>>101730061
Haven't even downloaded 9b yet.
>>
File: file.png (572 KB, 3840x2160)
572 KB
572 KB PNG
>>101729854
>>101730068
So this is unquantized
>>
>>101730005
I guess I'll experiment, but it might be difficult since the prompt processing takes a while for me.
>>
File: 1692296182909210.jpg (27 KB, 400x388)
27 KB
27 KB JPG
How long am I willing to wait for a response from a bot?
>>
>>101730137
it shouldn't matter where you have the data itself, the difference is what level its inserted at. author notes is lower than the card so it has more effect generally. if you mean using rag, then yeah you'll notice that. default rag settings likes to pull about 3-4k tokens for me each gen but i think its worth it. lorebooks are more pointed and especially if you made it yourself, you know exactly what data is each entry. rag is pretty lazy on the other hand and much less effort
>>
>>101729864
translation is of extremely limited utility for there to be more than a handful of players. And even then, nobody would release their model.
LLMs are only decent because they incidentally ingested literature in it's various adaptations.
>>
>>101730182
so we agree
>>
>>101730159
Don't consider the total time. Consider the t/s. Waiting a minute for 60 tokens is not the same as waiting a minute for 120 tokens.
>>
>>101730243
I switched from mistral large 123b to midnight miqu 70b and got about the same wait time. Tried out L3 8b stheno and it gens super fast but it's a bit generic.
>>
>>101730068
Realized I can't test after all. My other machine can't do bitsandbytes.
>>
>>101730321
You missed the point. One model is twice the size of the other. What matters is how many tokens per second you get. If they take the about the same time, i can only assume miqu gives you longer replies than mistral.
As for how long you're willing to wait? as much money as you're willing to spend or sacrifice on output quality.
>>
>>101730321
>mistral large 123b to midnight miqu 70b and got about the same
no you didn't. watch your numbers more closely. mm runs at 1.3t/s for me which is acceptable, mistral large barely hits 0.7 at the same quant.
>>
File: file.png (236 KB, 2410x1404)
236 KB
236 KB PNG
>>101730344
rip
At least on my setup, 27b unquantized works totally fine in every scenario (even without the chat template), but quantized shits itself.

Running 9b, either quantized or unquantized is fine. 8bit, 4bit, and BF16 all work just fine.
>>
File: largestral 2x4090.png (101 KB, 1101x565)
101 KB
101 KB PNG
>>101730159

Everyone have different set up and using different models.

>2x4090
>Mistral Large 2.65bpw 20K context
>T/S Faster than I can read.
>>
>>101730417
It may be possible that your bitsandbytes, transformers, or accelerate versions are out of date. That's the only thing I can think of before I am able to test it. Which is in like 21 hours or so.
>>
>>101723601
NVIDIA is going to jew you hard with those big stock prices
>>
>>101723832
nvidia probably wouldn't allow it to have open drivers
>>
>>101730417
>https://huggingface.co/google/gemma-2-27b-it/blob/main/transformers/transformers-4.42.0.dev0-py3-none-any.whl
They have their own version of transformers. Did you install that one?
>>
>>101730520
nope lmao why would they do this
>>
>>101730520
Latest transformers supports gemma.
>>
>>101730534
They did it because they wanted people to be able to use the model before hf devs had time to add support for it. It's not an uncommon thing to do.
It's obsolete as upstream transformers supports gemma now.
>>
>>101730535
yeah I checked I have 4.43.3 installed
>>
>>101730535
Sure. Seems to work just fine for anon. He may as well try.
>>
>>101730417
>>101730520
lmao it's a known issue
https://huggingface.co/google/gemma-2-27b-it/discussions/33
let me try this first
>>
>>101730585
yup it works.
reasonably fast in 4bit too.
>>
>>101730585
Ah, there you go.
So basically do https://huggingface.co/google/gemma-2-27b-it/discussions/33/files
>>
>>101730602
kek.
Now the actual work begins. I hope it was worth it, anon.
>>
>>101730629
boutta find out if 4/8bit is trash wish me luck lmao
>>
Which is the best miqu?
>>
File: file.png (268 KB, 500x500)
268 KB
268 KB PNG
>>101730783
Evil mikyu
>>
Downloading mistral large at Q2_K, what can I expect?
>>101730783
Midnight, rest are memes. Midnight also borders on meme but I like it
>>
>>101730897
>Midnight also borders on meme but I like it
What else is there that's not a meme that's 70b or lower then? Is there really nothing good?
>>
>>101727312
>>101727342
???? wtf are you talking about?
>>
>>101730913
command-r 35b
>>
>>101730968
What's the smallest size that's good? The context takes up so much room it seems with that one.
>>
>>101731009
i don't go smaller than 70b personally. cr was ik but had the same spacial awareness issues smaller models do. cr+ is good but kinda slow for me. mistral large is the new thing but i'm still testing l3.1 and tunes myself, i still think midnight is better for my rp
>>
>>101731009
https://huggingface.co/TheDrummer/Gemmasutra-Mini-2B-v1
>>
>>101731043
Can you show a midnight log? Every single person that I have seen using midnight miqu has been a complete braindead imbecile so far.
>>
>>101731116
my personal ones? no. if you show me a card and tell me what you want it to respond to, sure. most of the mm hate is one autist who will post 50 times when someone mentions a merged model. its good though. and so is the base model, its a very good tune. mm (and miqu) has a lot of the same 'slop' as any other model, your spine will shiver. but its probably the best we'll get from l2 now that everyone has moved on to l3, and it does 32k context, so i think its still a good model. its not like l3-3.1 is a ton better at this point anyways
>>
>>101731172
It's a retarded meme merge of random crap. You have been added to my list as another imbecile.
>>
>>101731196
you cant post to a single thing wrong with it, but you'll now go on a 50 post tirade about merges like you do in every other thread. you are the meme you're whining about
>>
Merging feels good.
>>
is there anything better than midnight miqu for rp?
>>
>>101731245
lumimaid
>>
>>101730968
I tried that one, but it didn't seem as good as miqu to me. It had a few strange results. Maybe I did something wrong, I dunno.
>>
>>101731284
nah thats about right. its a good model, especially since it isnt llama and stuff, but it was around the same intelligence - if you tell it youre wearing a blue shirt, it'll mention your yellow top in the next message. 70b seems to just grasp that stuff naturally. cr+ is very good at details, but its like 103b so even bigger to run and i didn't find it great for rp
>>
>>101731265
Too repetitive, like regular llama 3.1.
>>
I figured out how to convert huggingface models to gguf for llama.cpp
What a huge bitch.

So you clone the repo (the scripts are not distributed), install the requirements (new ones are probably gguf>=0.1.0 protobuf<5.0.0,>=4.21.0 )
run
python convert-hf-to-gguf-update.py <huggingface_token>

then run, for example
python convert_hf_to_gguf.py %USERPROFILE%\.cache\huggingface\hub\models--google--gemma-2-27b-it\snapshots\2d74922e8a2961565b71fd5373081e9ecbf99c08  --outfile ggml-gemma2-27b-instruct-q8_0.gguf --outtype=q8_0

where the available outtypes are
f32,f16,bf16,q8_0,auto
>>
>>101731476
i agree but thats my experience with all l3 models so far. i'm really trying to like it but its more repetitive than miqu was, and its dumber for me. i have an rp going where a character from my lorebook left, but then it brings her up 2 messages later. l2 didn't do that to me. l3 seems to handle message flow horribly
>>
>>101731487
doesnrt lccp have default sripts for converting stuff?
>>
>>101731510
Something like DRY is no good?
>>
>>101731547
dry made a nice difference on l2 70b for combating common phrases, but not for 3.1 70b. it just gets into this repetitiveness after you get near max context. and its not repetitiveness like 'shivers down your spine', i mean like it wants to basically redo a scene that happened before nearly line by line. on miqu it would more likely suggest something totally different multiple times over. i think l3 might just be fucked in some way
>>
>>101731525
those are the default scripts
the usual convert.py got deprecated and they made it way harder to figure out how to use the existing scripts than needed.

Anyways, q8_0 is way too big for my 4090
any way to quantize it down to 4bit in gguf?
>>
>>101731589
Sure thing Arthur
>Please use the Mistrals, Llama is le broken
>>
>>101731612
ok I get it now >>101549635
>>
>>101731615
that isn't what i said at all but you at least got the two names right. mistral and llama are where its at baby
>>
>>101731487
>>101731612
>I figured out how to convert huggingface models to gguf for llama.cpp
>https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md
You have a huge problem reading README.md files, don't you? Are you the same gemma-2-27b-it anon that was using transformers a bit ago?
>>
>>101731750
yep, just wanted to see how much better/worse it was
all i did was bing "llama.cpp gguf" and it led me down the shittiest rabbit hole of outdated docs

frankly I am not a fan of this README layout and I wish they just used the github wiki instead.
>>
>>101731790
Fair enough. You figured it out. You're a ahead of most noobs.
Things still change relatively fast. It's annoying keeping docs up to date.
>>
>>101726358
>unironic reddit tourists ITT
grim.
>>
When are the mikufags going to drop the miqu meme?
>>
>>101732086
whenever a new model comes out with a catchier name
>>
>>101732172
>>101732172
>>101732172
>>
>>101732086
When there's something better?
>>
>>101722144
TESS L3.1 70B
https://huggingface.co/migtissera/Tess-3-Llama-3.1-70B



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.