[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: Ge0_D0fbMAAFI19.jpg (784 KB, 3204x4096)
784 KB
784 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>103525265 & >>103515753

►News
>(12/14) CosyVoice2-0.5B released: https://funaudiollm.github.io/cosyvoice2/
>(12/14) Qwen2VL support merged: https://github.com/ggerganov/llama.cpp/pull/10361
>(12/13) Sberbank releases Russian model based on DeepseekForCausalLM: https://hf.co/ai-sage/GigaChat-20B-A3B-instruct
>(12/13) DeepSeek-VL2/-Small/-Tiny release. MoE vision models with 4.5B/2.8B/1.0B active parameters: https://hf.co/deepseek-ai/deepseek-vl2
>(12/13) Cohere releases Command-R7B: https://cohere.com/blog/command-r7b

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/tldrhowtoquant

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/hsiehjackson/RULER
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: 1691305744399524.png (128 KB, 800x933)
128 KB
128 KB PNG
►Recent Highlights from the Previous Thread: >>103525265

--Paper: CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models:
>103535406 >103535921
--SillyTavern alternatives and Horde setup discussion:
>103528320 >103528393 >103528436 >103528449 >103528495 >103531379 >103528632 >103528674 >103528694 >103529152
--Machine translating Japanese games in emulators: OCR challenges and alternatives:
>103535278 >103535335 >103535525 >103535593 >103535605 >103535766 >103535817 >103535884 >103535777 >103535799
--Anon seeks offline AI waifu companion:
>103530400 >103530414 >103530426 >103530445 >103530449 >103531933
--New models added to llama.cpp, including Qwen2VL and GigaChat:
>103530184 >103530256 >103530311 >103530347 >103530384 >103530395 >103530413
--Discussion on neural network models and upscaling techniques:
>103530495 >103534679 >103534738 >103534778 >103534803 >103534853 >103534858
--Phi4 weights testing and performance discussion:
>103529570 >103529626 >103529658 >103529824 >103531089 >103531393
--LLMs generating novel words and their internal representations:
>103532043 >103532124 >103532221 >103532326 >103532321
--Anon asks for llama 1 download and gets help with GGUF and Hugging Face models:
>103525965 >103525982 >103527006 >103527047
--Debating the merits of 70b models at low bit depths vs small models at high bit depths:
>103530010 >103530017 >103530028 >103530095 >103530226 >103530333 >103530555 >103530132 >103530164 >103531408
--OpenAI's $2000/month subscription model and PhD-level intelligence claims scrutinized:
>103526958 >103526983 >103527039 >103527070 >103527254 >103527417 >103528331
--Anon hesitant to buy 5090 for LLM inference due to high price and power consumption:
>103530725 >103530782 >103530817 >103531756 >103531784 >103531799 >103531859 >103531844
--Miku (free space):
>103527704 >103536763

►Recent Highlight Posts from the Previous Thread: >>103525267

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
>>
>>103536775
>ISO/JIS enter key
Disgusting. I should push the Miku back to save her from the horror.
>>
https://huggingface.co/Apollo-LMMs/Apollo-7B-t32
https://huggingface.co/papers/2412.10360

I just want live video stream into the model, that would be cool.

>, the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis.
Based
>>
File: HunyuanVideo_00432.mp4 (472 KB, 640x480)
472 KB
472 KB MP4
>>
File: 1707467750791450.mp4 (224 KB, 1504x934)
224 KB
224 KB MP4
>>
I decided to revisit Nemo using different presets and I'm actually having quite a bit of fun with it
>>
>>103537558
Nemo is great.
Truly the champion of it's weight class, and it performs super well all the way down to 4bpw.
I wonder what the fuck they did to make this model and if that approach scales.
Maybe it's really just a question of data instead of any "gimmicks" or fancy techniques.
>>
>>103537227
for the ones doubting, this is legit btw, oyest of veys
>>
>>103536775
>(12/14) CosyVoice2-0.5B released: https://funaudiollm.github.io/cosyvoice2/

I keep getting a special token error when I try and run anything with this. Has anyone tried it?
>>
After more testing, I'm more or less convinced that the impression of Llama 3.3 apparently being more permissive in terms of content was just a happy coincidence. It doesn't handle well snuff, gore, non-consensual, taboo content in particular if "include names" is disabled in SillyTavern. If anything, the trick of changing the role from "assistant" to something else seems less effective than previous Llama versions.
>>
Anons, what are you REALLY using this shit for? There is no point running models locally for most tasks, the mainstream offers vastly outperform whatever shit you can run locally... unless you are a privacy schizo or a cooomer.
>>
>>103537827
>unless you are a privacy schizo or a cooomer.
I am both, thank you very much.
It's also a neat toy to play around with.

>>103537746
Have you tried the usual steering tactics like adding a list of tags as a preffil to the assistant's message, changing the role to narrator, etc?
>>
>>103537827
>unless you are a privacy schizo or a cooomer.
like everyone else here?
>>
>>103537827
Coomer. Doing (e)rp and nothing else. Output is decent as my computer can handle large models ok. Currently find most fun in creating new cards, trying different prompting and trying out new models nowadays.
If I didn't have money-draining GF maybe I would try claude, better gpu setup or something.
>>
>>103537852
You can eventually coax Llama 3.3 into writing what you want if you prefill assistant messages with suitable text (including simply {{char}}:), but it really wants to avoid violence, gore and so on. It doesn't have too many issues with consensual sexual content in that case.

Removing {{char}}: from the prompt turns it into a cucked assistant, almost no matter what you write in the system prompt or the prior conversation history.
>>
>>103537827
(E)RP, that's it.
I have tried to use LLMs for other stuff like data processing, classification or transformation... but the end result is never perfect, even if you use cloud models, so I end up having to review all results which become a bottleneck if I'm dealing with huge amounts of data.
>>
>>103537827
SARR IT WORKED
POST HAVE MAKE YOU BUTIFUL LADY
>>
I just woke up so I'm retarded and trying to figure out this chat template for that russian model while it downloads
>{%- set system_message = bos_token + messages[0]['content'] + additional_special_tokens[1]

Is it trying to tell me that the format is
role<|role_sep|>
blah blah blah
?
Basically backwards ChatML style?
>>
>>103537827
>privacy schizo or a cooomer.
I'm both
>>
>>103538106
well I'm too lazy to update my shit and apparently nothing I have can load it despite it effectively just using the deepseek architecture. no russian nala test today.
>>
>>103537641
downloading the model right now. will tell later
>>
>>103537852
>>103537901
>>103537932
>>103538009
>>103538183
wait so this is /g/'s coomer general?
>>
>>103538237
There doesn't seem to be any difference between this general and the chatbot general.
I'm using an embedding model for semantic search, I started the project 4 days ago. After it's working for that, I'm intending to find a chat model that can run in a RAG context in 6 GB VRAM (looking at
Llama-3.2-1B-Instruct, but I don't see where to download it without kneeling to the zuck)
>>
>>103538279
https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF
>>
>>103538297
thanks!
>>
>>103536775
The growth of LLMs will slow drastically, as the flaw in "neural scaling law" becomes apparent: most people have no use case for a second brain! By 2030, it will become clear that the LLMs impact on the economy has been no greater than that of the internet.
>>
HN told me that there has been huge improvement in local models in the last year? How true is this? can they effectively ERP yet?
>>
Are they any locally hostable models that are usable for language learning or anything else useful like answering questions ala chatgpt?
Also, how do people access their models remotely? Has anyone bridged one to a chat app like Signal or Whatsapp?
>>
eva bwos we got a version bump
https://huggingface.co/EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.1
>>
kill yourself and buy an ad, in that order.
>>
>>103538509
>DELLA linear merge of v0.0 with an unreleased checkpoint from a different run. Reduced overfitting, better long context comprehension and recall, less repetition, more stability.

Ooh, sounds promising. Will test and report back.
>>
>>103538206
I get "KeyError: '' ". I dunno what's wrong..
>>
Gemini 2.0 flash experimental.

Heh.
>>
File: 1715367001153147.jpg (954 KB, 800x1237)
954 KB
954 KB JPG
>>103537558
>different presets
What are you using?
>>103537827
RP and co-authoring/editing practice. I'll occasionally use it to summarize news/wiki articles and essays I don't care about, but since I can't into RAG it's not ideal.
>>
>>103536845
>Meta Apollo
>We employed the Qwen2.5 (Yang et al., 2024) series of Large Language Models (LLMs) at varying scales to serve as the backbone for Apollo. Specifically, we utilized models with 1.5B, 3B, and 7B parameters

Meta literally used Qwen as a base for their video understanding model, lmao. Chinks literally cant stop winning...
>>
>>103538783
>Chinks literally cant stop winning...
Papers are like 90% chink already. For Apollo too.
As long as they give me the foxgirls I will kneel.
>>
>>103538551
the download script doesn't work if you download only the 0.5B model. The CosyVoice-300M folder has the voices. I was able to generate something but I wasn't able to generate english yet.

I thought this was more speech-to-speech but it doesn't seem so aimed at real time inference.
>>
anyone know a way to stop outputting descriptions in asterisks? I could imagine things but feels like descriptions made without asterisks are more verbose and interesting.
>>
>>103538851
take the asterisks out of the starter and example messages, the model won't use them unprompted unless it's very overbaked
>>
File: Screenshot 2024-12-16.png (54 KB, 945x614)
54 KB
54 KB PNG
Nashvillebros is this true?
>>
>>103538579
I hecking LOVE LOVE LOVE CoT!
>>
>>103538851
What >>103538873 said, plus if that's not enough, you can outright tell it not to use asterisks in the system prompt.
>>
>llama.cpp stopped supporting regular make
>my cmake is horribly fucked up because of some unholy shit I did to compile some other project I never actually used and I can no longer build
aieeeeee gerganov whyyyyyy
>>
>>103537827
I use it as a helpful assistant.
I suck at approaching subjects I’m interested in and QwQ really shines at planing stuff so it is really helpful with that.
I don’t care if it is lacking in knowledge since you can just feed it as context if needed.
>>
>>103539211
Same here. I had to download the binaries.
>>
what's the best general model that I can use for chatting, got 24gb vram?
>>
>try OuteTTS
>works nicely out of the box
>try to voice clone
>no output
I HATE TTS
I HATE TTS
I HATE TTS
>>
>>103539276
The following models should fit in 24gb.
>qwen 32b q8
>mistral 22b q6
>mistral 12b q8
>>
>>103539351
Err, that should be
>qwen 32b q4
>>
>>103539306
>OuteTTS
Why not gpt-sovitsv2?
>>
>>103539369
naaaah bruh this shit too complicated, theres literally no exe lol
>>
>>103539369
sovitsv had way too many steps to produce anything
>>
File: Lms.jpg (33 KB, 509x494)
33 KB
33 KB JPG
Why do people here hate LM Studio? I'm new to this I've been using without problems. Am I missing something?
>>
>>103539403
It's proprietary slopware that does not contribute to the upstream projects it's built on.
>>
File: 1714205445058131.png (382 KB, 885x1146)
382 KB
382 KB PNG
Bros, we're gonna eat good very soon. Trust the plan, it's all converging as nvidia tries to stem the tide for VRAM premiums.
https://x.com/_lewtun/status/1868703456602865880
>>
>>103539461
BitNet LLM 2.0 LCM BLT CoT breakthrough soon.
>>
>>103539403
>Why do people here hate LM Studio?
/lmg/ is privacy/openness focused and has a history of working directly with and contributing to the open source projects that were at the genesis of the current ai boom (and even in the before times).
A lot of the current user-friendly AI systems are just wrapping these without giving back or even acknowledging these bits exist.
Also, if you're here and worried about privacy, you aren't gonna be running some exe you downloaded from the internet and feeding it personal or proprietary data.
>>
>>103539498
memeing a miku.txt prompt into llama cpp's repo during its inception does not count as having contributed to open source projects. Literally no one still browsing this thread can code.
>>
>>103538763
>What are you using?
Just Temperature 1.0 and Repetition Penalty 1.1
>>
>>103536845
Finally, another honest paper.
>>
>>103539211
>>103539220
Kek, same happened to me as well. I eventually solved it. Don't remember exactly how. Oh wait nvm I do remember. I eventually found that I needed to delete the nvcc in my usr/bin folder, and run with these commands.
export PATH=/usr/local/cuda/bin:$PATH
cmake -B build -DGGML_CUDA=ON -DGGML_LLAMAFILE=OFF
cmake --build build --config Release --target llama-server llama-quantize llama-perplexity -j 8


I actually tried to get ShatGPT to help me at first but it didn't find the issue. I found the issue myself. Maybe Claude would've gotten it.
>>
>>103539403
jesus christ I hate zooming reddit cucks like you so fucking much you have no idea
greetings from serbia
>>
For those fans of L3.3, any thoughts on its NSFW/NSFL capabilities? It seems to get sloppier and dumber when you cross that line.
>>
>>103539750
Best lolis by a long shot, nothing comes close at all.
>>
>>103539750
Haven't actually tried base L3.3 for such purposes, but Eva is absolute peak, so at the very least, it makes for a damn good base for a tune. Clever to the point of wit, coherent, requires very little wrangling.
>>
>>103539750
3.3 is trash, there's just one anon samefagging with a clear intention to troll.
>>
>>103539750
The base model? Seemed fine to me, but I'm the guy that was using the assistant and user names switched out so that might have gotten past some censorship biases in the logits. I know for a fact at least that using user/assistant made my model dumber in a NSFW adjacent scenario I tested, so I just switched it out and forgot about it. Then I moved to testing Eva and haven't loaded up base since.
>>
>>103539878
Compared to what? Mistral Large is too slow on my machine, Qwen is even worse, and Miqu is too old and dumb.
>>
Is Nemo still SOTA for VRAMlets?
>>
any new great models for nsfw stuff at around 12b?
>>
File: 1461486806692.jpg (20 KB, 470x362)
20 KB
20 KB JPG
>>103539403
They don't like it because you don't need to be a complete nerd to install and run the program.
If they admit it's serviceable then they admit they wasted their time dorking around in Python when an exe could have done the job in 15 seconds.
>>
>>103539509
I read or at least skim every single /lmg/ thread.
>>
>>103539918
yes...
>>
>>103539484
With test time training liquid network lora
>>
>>103539509
oobabooga used to be a /lmg/ anon, you know?
>>
>>103539962
we know
>>
>>103539962
being able to read the documentation and use an API is NOT coding.
>>
>>103539820
Eva ?
>>
>>103538509
first impression - it's fine, but it feels like it lost a little of the eva sovl
should be more easily wrangled but I find it a little less fun than v0.0 on a quick vibe check
>>
>>103540037
https://huggingface.co/EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.0

v0.1 is out now, but doesn't seem to have any quants yet.
>>
What's the best coom model for 90GB of VRAM? I got two decommissioned A6000s for (relatively) cheap from work.
>>
>>103540060

>>103540053
>>103538509
>>
File: file.png (19 KB, 499x144)
19 KB
19 KB PNG
>>103540053
>doesn't seem to have any quants yet.
https://huggingface.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF
right on time
>>
>>103539750
Doesn't merge well with Tulu due to differences in prompt format so it's DOA
>>
>>103539890
>I know for a fact at least that using user/assistant made my model dumber in a NSFW adjacent scenario I tested, so I just switched it out and forgot about it.
What technique is this referring to? Just treating the instruct model as a text completion one without any user assistant turns?
>>
File: file.png (125 KB, 604x546)
125 KB
125 KB PNG
ollama wonned again
>>
>>103540227
niggerganov should just ack himself at this point, or sell the llama.cpp project to ollama.
>>
File: angryshikanoko.webm (3.87 MB, 1920x1080)
3.87 MB
3.87 MB WEBM
>>103539969
>>
>>103540112
I'm pretty sure it's about using the instruct format but without the user and assistant roles the model was trained with.
So if your format is
><specialtoken>assistant<specialtoken>
><specialtoken>user<specialtoken>
You could change it to be
><specialtoken>CharacterName<specialtoken>
><specialtoken>UserPersonaName<specialtoken>
or
><specialtoken>Game Master<specialtoken>
><specialtoken>Player<specialtoken>
or
><specialtoken>Narrator<specialtoken>
><specialtoken>Dude<specialtoken>
etc etc.
>>
>>103539929
which ones are you using at 12B? The ones I use aren't exactly new
unironically check the first 5:
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
>>
>>103540262
no no no
we let them feel safe, then a surprise license rug pull
>>
>>103540112
What the other guy said.
>>103540270
I'm skeptical it wasn't trained a bit with other names. The model should be much more rigid if that was the case, but it feels to me like it's quite flexible.
>>
>>103540227
I hate pgvector and ollama so fucking much
>>
>>103539947
Kek this, I just want to coom and for the thing to just werk, I want a solution. I don't want to fiddle around running stupid commands when it can be done in a few clicks.
>>
does fucking around in the CLI for three hours compiling shit make the orgasm feel better?
>>
>>103539461
That can only fix its reasoning, not its lack of trivia knowledge.
>>
buying a 5090 for AI erp would be the beta buxx experience for perma virgins
paying to get laid, but instead of paying a wife for real pussy, you pay for technology to masturbate
>>
>>103540364
#LLM2.0, rag is the future, resistance is ignorance.
>>
>>103540287
Is TheDrummer/UnslopNemo-v2 (chatml) really an upgrade from mistralai/Mistral-Nemo-Instruct-2407?
I've seen people here say in the past that all Nemo finetunes are dumber than Nemo itself.
>>
>>103540374
Just like bitnet huh
>>
>>103540392
But RAG exists and sort of works. Bitnet is just vaporware.
>>
>>103540423
Bitnet models exist and sort of works.
>>
>>103540335
From experience, it just intensifies the "oh man, all that effort for that?" feeling you get afterwards.
>>
>>103540430
That's a stretch but alright, stay coping on LLM 1.0 while the world moves on!
>>
>>103540370
im rich tho? so i can afford it
>>
>>103540335
I don't use AI for coom.
>>
>>103540451
fag
>>
Llama next year, Gemma next year, Qwen next year, anything this year? Also, local Suno where
>>
>>103540465
Official Phi-4 is this week.
>>
>>103540437
Meanwhile you are waiting for the innovation to come and it just doesn't. Have fun waiting for your world to supposedly move on while people actually enjoy big models that exist (either through API or if they're not a poorfag with their PCs).
>>
>>103540492
He meant interesting models though.
>>
>>103540501
Phi-4 is VERY interesting
>>
>>103540513
no it's not
>>
>>103540513
We'll see when Nala anon tests it.
>>
>>103540374
>rag
Way too limited. Doesn't allow the model to make inductive jumps in what information it needs.
>>
>>103540527
He already did tho
>>103505034
>>
>>103540454
Feeling ironic today aren't you.
>>
>>103505034
It's over...
>>
>>103540370
where can I pay for a wife? I'm in the EU
>>
>>103540575
>>103540556
Who cares? There's no point using AI for coom.
>>
>>103540556
massive skill issue
>>
>>103540387
Depends on the use case really. You will have to see for yourself, so far it has been pretty good for me for erp
>>
>>103540628
it's a three word prompt, anon. it's not possible for skill to be involved either way.
>>
>New Models: Megrez 3B Instruct and Megrez 3B Omni with Apache 2.0 License
https://huggingface.co/Infinigence/Megrez-3B-Instruct/blob/main/README_EN.md
https://huggingface.co/Infinigence/Megrez-3B-Omni/blob/main/README_EN.md

Merguez 3b kek
>>
be aware of the m$ shills for phi...
They're only trying to do the needful
>>
>>103540665
It absolutely is, as evidenced by his tests always having super broken text formatting which he himself said weren't due to the models.
>>
>>103505034
starts with plain text dialogue, ends that paragraph with plain text narration, next paragraph has quoted and italics dialogue...
yeah, no model above 3B is that broken dude's doing something clearly wrong like conflicting instructions or something.
>>
>>103540701
>Values and Safety: While we have made every effort to ensure compliance of the data used during training
>>
HunyuanVideo mogged
>>103540020
>Google Veo 2
https://x.com/GoogleDeepMind/status/1868703624714395907
https://xcancel.com/GoogleDeepMind/status/1868703624714395907
>>
>>103540923
Cool, where are the weights?
>>
So, even after nearly 2 years, we still don't have a local model that even remotely matches the intelligence of gpt4 for ERP?
>>
File: if only you knew.png (370 KB, 744x719)
370 KB
370 KB PNG
What is the conventional wisdom for caching? I am asking about "Use 8 bit cache to save VRAM" and "Use Q4 cache to save VRAM." in oobabooga.
Finally figured out that I need Flash Attention for them to work. (Which is afaik something that I should enable by default as it has zero drawbacks)
I experimented a bit, but didn't really notice too much of a difference, does the differences become pronounced when you are many thousands of tokens into a chain?
Going from none to 8 bit, 8 bit seems to save a larger chunk of VRAM, compared to going from 8 bit to Q4. Is it a good halfway balance?
For reference I tested these on Mistral Nemo, is cache more/less important for other models?
>>
>>103540988
Correct.
>>
>>103541013
And using GGUF under llama.cpp, I forgot to add.
>>
>>103541022
How long do we have to wait
How long do we have to suffer
>>
File: lmg.png (1.84 MB, 1387x778)
1.84 MB
1.84 MB PNG
>>103536775
>>
>>103539509
I've got 2 github accounts with contributor status on llama.cpp, mikupad, ooba and sovits (and multiple lmg-related personal projects up)
>>
File: 1722784233319186.png (269 KB, 627x802)
269 KB
269 KB PNG
>Shipping document suggests that a 24 GB version of Intel's Arc B580 graphics card could be heading to market, though not for gaming

oh no no no no, njudea and ayymd vramjewsbros, how do we respond without assasinating the ceo of intel?

are intel ballsy enough to eat up the entire homebrew community with this?
>>
>>103541132
based. if every anon were like you, this thread would be a better place, even if all you did was make a commit fixing a typo.
>>
>>103539403
notice how you (a nigger shill) couldnt respond to the truth >>103539436

>>103539731
бaзиpaн
>>
>>103541324
If they release it for like under $800 we are so back
>>
>>103541324
Ohh, shinyyy
>>
>>103541324
>though not for gaming
I am worried that it will be released with an unhinged prosumer price (not as unhinged as nvidia but still unhinged probably)
>>
>>103541324
https://web.archive.org/web/20241216204028/https://www.pcgamer.com/hardware/graphics-cards/shipping-document-suggests-that-a-24-gb-version-of-intels-arc-b580-graphics-card-could-be-heading-to-market-though-not-for-gaming/
>>
File: file.png (82 KB, 242x218)
82 KB
82 KB PNG
>>103541324
papa's parting gift to nvidia?
>>
File: 1720030134404974.png (87 KB, 581x348)
87 KB
87 KB PNG
>>103541354
https://x.com/GawroskiT/status/1867887152295784955
>>
>>103541324
It's going to be priced out of relevance probably.
>>
>>103541371
They gave us 12GB for $250, just imagine...
I really hope they do it, would force nvidia to start competitively pricing again.
>>
>>103541370
>>103541383
Do we not have any legit hardware engineer fags on this general? Someone throw together a realistic BOM for making a 128GB card running at 2TB/s with enough ARM cores to keep up with CPU inference. How hard could it be?
>>
>>103541395
you dont need to be a hwe to know you need to print a new pcb to have what you want

at best you can resolder bigger GB memory chips onto only a few gpus that can handle that higher GB memory, along with flashing another bios, a chink on ebay or something was doing that some time ago
>>
>>103541013
>>103541031
Good question. I was always under the impression that 8bit cache increases perplexity, but I honestly have no idea. Would like to know as well.
>>
>>103541324
I bought a 3090 for $300 a month ago. You can't compete with 2nd hand 3090s
>>
File: 1731626560619193.png (26 KB, 922x250)
26 KB
26 KB PNG
>>103541013
Can "Quantized KV Cache" on koboldcpp be used on only CPU with GPU context processing?

flashattention is needed for it to work and it seems like flashattention is only available on CUDA/CuBLAS...
>>
>As Michael Jackson's sultry voice crooned in the background, Daisy found herself echoing his lyrics without thought, lost in the haze of sensory overload. "Billie Jean is not my lover," she sang softly, her breath mingling with Anon's, "She's just a girl who claims that I am the one…"
lol what
>>
>>103541414
hardware hackers used to design and fab their own shit. is that out of the question these days?
my retard brain says you could marry a bunch of cots shit onto a bigass pcie board and beat the shit out of the big players.
just like 24 ddr5-8800 slots and some arm chips with sve vector units and enough hardware glue/fw to make it all interface with the host.
write some llama-cpp support in and eat good for relatively cheap.
>>
>>103541455
where do you live? its nowhere near that online
>>
>>103541395
Not an engineer but:
Buying (G)DDR some number memory modules and whatever arm cores you want are easy.
Designing and manufacturing a complex board that has all these memory modules soldered and traced to CPU, building a NUMA system and physically tracing the paths so that these cores can communicate with each other, and a complex memory controller that handles all of this communication are completely out of the realm of anyone here's garage.
>>
>>103541474
Nope, Njudea limits vram capacity in firmware, they also flash e-fuses on the dies to shit them up for segmentation sake
>>
>>103541473
kino
did it make up the whole michael jackson playing in the background thing out of nowhere?
>>
anyone running debian unstable or another distro with 6.12 kernels? Any improvements or pitfalls for llm stuff?
>>
>>103541463
I dunno if it takes effect but I am able to load flash attention + 8bit/Q4 cache on the CPU.
Do you get an error or something?
>>
>>103541536
6.12 offers some big improvements for epyc I think, so unless you have that there's no reason to upgrade.
>>
File: 1663182175279.webm (2.6 MB, 480x364)
2.6 MB
2.6 MB WEBM
>>103541473
Based MJ enjoyer.
>>
Phi4 feels like a naive and innocent girl ready to do anything to please you.
>>
>>103541482
Western Europe. It's not on ebay but on "refurbished" websites. For some reason the resellers of refurbishment sites are retarded as fuck and don't price it to market. I've scooped up 3 RTX3090s for between $300-$400 this way.
>>
>>103540321
why is it such a meme to hate on ollama, are y'all just mad that it makes local ai more accessible so you're not special anymore?
>>
>>103541424
It doesn't seem to affect much, wtf?
https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md
Though this is for exllama, I think it's exceptional in that 4 performs better than 8.
>>
>>103541474
oh i am laffin

you know at the memory bandwidths that we're talking about here you need to care about signal delays caused by the speed of light right? the level of tryhard you have to be on to get just to have decent signal integrity is absolutely out of reach for hobbyists
>>
https://files.catbox.moe/k3lss1.json
made a link to share my overwrought schizo placebo context / instruct preset for llama 3.3 (mostly eva (v0.0)) across devices and figured I might as well dump it for the thread too
for samplers I use temp=1.2 minp=0.135, yes I know that's a high minp, trust the plan
>>
>>103541665
ask cuda dev he had multiple melties about them before
>>100110227
>On the other hand, if you look at the issues on the ollama Github I think bug reports by those users would be a net negative for llama.cpp.
>>100209871
>And if you look at the ollama Github issues you get the impression that its userbase consists of absolute retards, so no real loss there.
>>100393063
>I don't bother posting about ollama on /lmg/ but I definitely have a more favorable opinion of koboldcpp since its devs actually provide benefits to the upstream project.
then posts this
>>101207663
>I wouldn't recommend koboldcpp.
>>
>>103541714
Haha, experimenting with you-prompting as well?
>>
>>103541714
sloppa
>>
>>103541742
I've always prompted that way tbdesu
>>103541751
be nice
>>
>>103541768
preddit seems more your speed, sis
>>
Are there any good vision models that can read japanese? I'm willing to buy extra GPUs for this.
>>
>>103541788
I can read Japanese. Pay me.
>>
>>103541665
I do not lurk the thread, but I started with ollama for this project >>103538279
It would not work with the top embedding model on MTEB that fit on my GPU, that is, stella. It also did not have a tokenizing endpoint. I figured those things out in the wrong order (I patched ollama to provide the tokenizing endpoint, then I patched the ollama-python library to access the tokenizing endpoint), before I figured out that the trust_remote_code=True part of the sentence-transformers example actually did something meaningful, and I had to throw away all of that work. I switched to
substratusai/stapi, which also did not have a tokenizing endpoint, but after adding it, it worked.

In conclusion, ollama seems to be dumb. I wonder what the few non-coomers in the thread would do if you wanted a) local and also b) do all inference tasks over an API
>>
File: 1717446855440379.png (329 KB, 450x408)
329 KB
329 KB PNG
>>103541665
Are you just that retarded?
>>
>>103541806
YOU WOULD USE LLAMA.CPP SERVER OR VLLM. Sweet Jesus yall are fucking something
>>
>>103541736
lol ok so it's just a schitzo guy with loud programming opinions doing his thing, got it. ofc a more popular project that's easier to use is going to get worse bug reports
fwiw on my machine ollama consistently beats llama.cpp on inference speed by about 15%, i'm sure that's user error, but in the real world what matters is how much i get out of it for how much effort i put in, i'm sure i could sit around and track down why llama.cpp is slower but ollama just works and i have better things to do than troubleshoot
>>
>>103541817
>>103541806
OR EVEN BETTER, LLAMAFILE SO YOU HAVE A SINGLE BINARY WITH NO INSTALL THAT WORKS ACROSS PLATFORMS/OS's

t. someone who has built stuff with support for ollama/llama.cpp/llamafile/transformers.

Ollama wants to be special and tried forcing usage of their own API which fucking sucks, and their documentation also fucking sucks.
I say this, having reviewed the API docs for all the large providers. Ollama was near the fucking worse.
Fuck ollama.
>>
>>103541736
>its devs actually provide benefits to the upstream project.
This is the real problem with every other project. do they acknowledge and give back? This is regardless of whether they have a legal requirement to or not.
If yes, they will be well liked. if not, they're going to be on a lot of shit lists.
>>
>>103541856
>LLAMAFILE SO YOU HAVE A SINGLE BINARY WITH NO INSTALL THAT WORKS
who cares when you're gonna be launching it with a single shortcut anyway, jartroon?

and koboldcpp has le 1 exe while actually having features before lama.cpp itself most of the time
>>
24GB vramlet reporting in. The EVA based on Qwen 32B seems a lot better than the 70B 3.33 at IQ2, which is the level of compression needed to fit. It's not terrible but you can tell it's brain damaged and falls into repetition easily. Which is "no shit" since it's Q2, but posters keep claiming that Q2 70B > Q4 32B, so congrats to them, they succeeded in wasting my time downloading that shit.

Can't be too mad though because the 32B Qwen version is still really good. And I'm saying that as someone who had no success with Qwen in the past ever
>>
File: pepe question.png (213 KB, 716x641)
213 KB
213 KB PNG
How to get llamacpp_HF working? Getting "ERROR Could not load the model because a tokenizer in Transformers format was not found."
And I already have oobabooga_llama-tokenizer
>>
>>103541856
At least there was growth back then, 4chan has been slowly dying off for years now.
>>
>>103541918
>oobabooga
>>103541808
>>
How did you guys get ollama to work? I downloaded it and tried to run it but it doesn't work.
>>
>>103541817
>VLLM
I like it, but it sucks that it never showed up in google searches for self-hosted HTTP API that can serve sentence-transformers. Anyway, this is what happens when you just start building something without trying to get bogged down in the details of what's optimal.

One of the issues I had with stella, though, is that it has an embedding quirk in that the embedding length is always equal to the maximum context (it will trim or pad). It looks like this provides the tokenize API directly, which is great.

Is this, https://docs.vllm.ai/en/latest/models/adding_model.html the page for getting it to work with >>103538297?
>>
>>103541856
lol what, i wrote a rag tool with ollama without using any libraries and i found their api to be very easy to work with and had a wrapper built for it in like 2 hours and i haven't had any issues at all with it since
>>
>>103541736
>>I wouldn't recommend koboldcpp.
I can understand why he doesn't recommend it. Kobo is simply slower. In the past it was like 5% slower, but now with their fucked up speculative decoding 30%. 30%! That's one third. All they had to do is to copy over the settings from llama.cpp, but no, it would be too difficult for our retard users who probably don't know what speculative decoding is anyway.
>>
>>103541957
>a rag tool with ollama
Welcome to LLM 2.0 sir.
>>
>>103541895
One (1) single solitary retard was telling you that Q2 is worthwhile, the rest of us kept telling you that Q4 is the lowest you should go if you don't want severe drain bamage.
>>
>>103541665
>why is it such a meme to hate on ollama
They actively refused to acknowledge or credit llama.cpp. It wasn't until a few months ago after pushback did they even put any mention llama.cpp in the repo page, and even now it's just one line under "Supported backends".
Built a wrapper around llama.cpp, made it look like they were the ones innovating when they were building on top of others' work, then they got big. Now reddit and the other normies worship them as heroes of open source AI and jerk them off
>>
>>103541953
https://docs.vllm.ai/en/latest/getting_started/examples/gguf_inference.html
>>
>>103542027
Sounds more like you're mad that the model you're aggressively shilling got exposed as shit by someone who can actually run it.
>>
>>103542043
>aggressively shilling
Bro, this is /lmg/. I'm talking about a model I like to/with others who also like it, i.e. using the thread for its intended purpose. What are you doing here, other than having a meltie?
>>
>>103542043
Except that if you go back and look at the previous threads, you'll see that he does infact argue against q2.
>>
>>103541976
That wasn't actually me who posted about koboldcpp.
I was for the longest time using an insecure tripcode so the Petra/blacked Miku/fake ggerganov spammer cracked it.

From the perspective of contributions back to llama.cpp I think koboldcpp is the best downstream project (particularly because I count 0cc4m as one of the koboldcpp devs), llamafile is second place, all other projects basically don't matter.
>>
llama.cpp is another example of what happens if you select cuck license. Corpos get to use it for free and don't have to contribute back. A fucking fork get all the funding and you get cucked.
>>
>>103542088
Will llama.cpp get anti-slop/string ban sampler? Or am I stuck having to pick between slop and slow speed?
>>
>>103542034
Thanks, this seems close to what I was looking for. I don't find anywhere in https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html that you can command VLLM to download the model for you via the API. That is fine, but I'm just making sure that I'm reading it right.
>>
>>103542091
Could it be that ggerganov is petr*? He is a confirmed cuck, after all.
>>
>>103542144
That's the million-dollar question.
>>
>>103542082
>>103542027
samefag
>>
why is gguf such a slow piece of shit format
>>
>>103542027
Obviously Q4 is better than Q2, that's not interesting. The question is Q2 70B vs Q4 32B. That one isn't obvious and I could make up reasons why it should go either way. However, my conclusion is that Q2 is a meme. Shocker I know but maybe I save someone else some time

>inb4 "just run 70B Q4 at 0.5T/s"
no
>>
>>103542088
>implying he isn't the blacked Miku poster
>>
>>103542165
Q2 very much IS a meme when it comes to creative writing, regardless of model size, yeah. That's why I kept saying that anything below Q4 isn't worth it. And sure enough, if you can't run a 70B model, the Qwen-based 32B-s are the next best choice. If you haven't tried it yet, I recommend Evathene; I liked it more than Eva itself. That slight Athene flavor adds something to it IMO.
>>
>>103542184

>>103415136
>>103415171
Nah, he's based.
>>
>>103542163
Faster than EXL2 nowadays if you use updated llama.cpp with speculative decoding.
>>
>>103542221
why are you recommending a 72B model to someone that can't run a 70B?
>>
>>103542221
...disregard that, I'm a retard, Evathene is also in the 70B ballpark.
>>
>>103541976
>>103542088
>llamafile is second place, all other projects basically don't matter.
Actually, I need to correct myself: GPT4All has also made extensive upstream contributions so I would put them in second place and llamafile in third place.

>>103542091
If I had started the project I would have made it (A)GPL but I don't plan to turn this project into a career.
In 2024 I made six figures from llama.cpp-related part time work and this almost certainly would not have happened without a permissive license.
So I can see the appeal.

>>103542112
I don't know the status of this specific sampler in llama.cpp.
More generally my stance towards samplers is that I would like to see objective evidence for their effectiveness and that therefore better methods for evaluating them are needed.

>>103542184
When I troll a thread it looks more like this:
https://desuarchive.org/a/thread/168206398/
>>
>>103542237
Being a moron and forgetting which 32B I used to use, mainly.
>>
>>103542223
>Faster than EXL2 nowadays if you use updated llama.cpp with speculative decoding.
Prompt processing takes twice as long and inference is basically the same (slower when you use spec decoding in tabby/exl2 as well), what are you talking about
>>
>>103542255
It's faster as long as all layers are on GPU
>>
>>103542223
every time someone posts this i fall for it, spend an hour building llamacpp and downloading a Q4_K_M model to compare against exllama 5bpw, and it's still slow as fuck
not today
>>
>>103542323
Use speculative decoding on Qwen 0.5B for the small model. Make sure to put in the flags correctly so that you have same context length for both models.

Look into this github thread to see the exact flags you should enable to make it as fast as possible: https://github.com/ggerganov/llama.cpp/pull/10455

Faster than exl2
>>
>>103537827
Generate terabytes of magical girl hentai daily
>>
>>103542244
>I don't know the status of this specific sampler in llama.cpp.
>More generally my stance towards samplers is that I would like to see objective evidence for their effectiveness and that therefore better methods for evaluating them are needed.
It's not an actual sampler like MinP, XTC or temperature, it's just a function to ban strings instead of tokens.
https://github.com/sam-paech/antislop-sampler video here shows what it does and as you can see it's effective at it's job.
>>
>>103541918
So I figured that I probably need this file: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py
While installing requirements for it some piece of shit called jax-lib ate a lot of my internet quota and it was still going so I had to kill it.
So yeah that was that, I am sticking to non HF.
>>
Is there any smaller Mistral model that works as a draft model for Largestral?
>>
>>103542396
how are you even downloading models if a couple gigs of python wheels are fucking your internet quota
>>
>>103542410
mistral 7b instruct 0.3 should be the same tokenizer
>>
>>103542411
I go to public wifi with my old laptop to grab models there.
Then I move models to my desktop which has limited internet.
>>
>>103542422
actually this might only apply to largestral 2407, not sure about largestral 2411 which has the new [SYSTEM_PROMPT] tokens
>>
>>103542430
you sound poor, what setup are you running?
>>
>>103542435
just replace the tokenizer.json of the 7b with the one from the new large
it might still work maybe
>>
>>103542386
>as you can see it's effective at it's job.
did you actually measure things to the 0.00000 percent tho? how can you claim it works without a 1500 page long arxiv paper at the very least?!
>>
>>103542439
I run models on my second hand 3060.
Why I don't have proper internet now, is a bit complicated.
>>
File: 4740 - SoyBooru.png (413 KB, 722x1199)
413 KB
413 KB PNG
>did you actually measure things to the 0.00000 percent tho? how can you claim it works without a 1500 page long arxiv paper at the very least?!
>>
>>103541918
Put the .json files in a directory with the GGUF, there is "_HF creator" tab in ooba not sure exactly what that does might download more shit you don't need. oobabooga_llama-tokenizer only works for some models, you'll want the right tokenizer for the model architecture, look in the original HF repo or search "modelname tokenizer"
>>
>>103542513
exactly the feel i was going for, than you.
>>
>>103536775
>picture is miku climbing out of the computer screen to become "real"
>but the room she's climbing into is also anime, implying she's merely leaving one layer of anime world and entering another layer
hmm
>>
>>103542534
It's miku stepping INTO the screen to join with her LLM waifu
>>
How do yo utilize 70B models?
I could easily upgrade to 64gb ram,
You can't utilize your GPU if it doesn't fit in VRAM can you?
Sounds like it would be painfully slow pre-processing context and create tokens.
>>
>>103542589
...can I come too?
>>
>>103542384
>magical girl hentai
proofs
>>
>>103542627
>You can't utilize your GPU if it doesn't fit in VRAM can you?
...
llama.cpp cries in the corner
>>
>>103542627
It is. You get ~0.5t/s, maybe less.
t. knower, sufferer
>>
>>103542589
She is receiving Mikusex from Anon by passing part of herself through the barrier.
This is what is happening on the other side - the side we cannot see.
>>
Takes weeks or months to get the new shiny features other software implements asap
The preset system is also more convoluted than simply editing your txt config. I could also without issues run some models on koboldcpp that instead ran out of memory in LM studio.
You are better off learning to use a better frontend. It isn't that hard once you get used to it.
>>
>>103542384
Based magi-gal enjoyer.
>>
>>103542518
Thanks the model is acting weird, but it loads this way.
There are more new settings like xtc and dry that I haven't seen in base llama.cpp, I guess they might be fucking with it?
>>
>>103542695
>Based magi-gal enjoyer.
Thanks for saying what we were all thinking. Man of culture, right there
>>
>>103542700
>xtc and dry that I haven't seen in base llama.cpp
> XTC sampler has been merged into llama.cpp mainline
>2 mo. ago
https://www.reddit.com/r/LocalLLaMA/comments/1g5a3bs/xtc_sampler_has_been_merged_into_llamacpp_mainline/
> DRY sampler was just merged into llama.cpp mainline
>2 mo. ago
>https://www.reddit.com/r/LocalLLaMA/comments/1gby1uk/dry_sampler_was_just_merged_into_llamacpp_mainline/
have you considered that maybe that's just ooba
>>
>>103542716
Ok I am assuming that you are gonna recommend a replacement?
>>
>>103542728
llama.cpp mainline?
>>
>>103542716
was it really easier for you to find and link reddit threads than to search the llama.cpp repo?
>>
>>103542770
yes
>>
File: 1721327000126830.jpg (66 KB, 692x349)
66 KB
66 KB JPG
>>103535692
got it to add the button to the menu and show up on the left of the chat, so now it can stay up. ui's messed a bit, but its an ok start
>>
jesus fucking christ i hate python devs so much, spent the last couple hours trying to get fish-speech to work, shit fucking core dumps on inference
fishspeech.cpp when?
probably just gonna stick to xtts which works well like 95% of the time, anyone have tips on how to get it not to trail off and make weird noises sometimes?
>>
>>103542665
man I really like to try llama 3.3 locally.
Trying it with groq, it already feels like a big step ahead of something like nemo or magnum.
But 6k context limit and token/time limits makes it subpar and I tend to only use it escape slop looping.
Btw. is there an option in ST to link system prompt settings to a connection profile?
>>
>>103539820
>>103540053
I tried using your settings but it starts off with the same openings every time. The amount of times I've seen "At your...x" "Rolling onto her side" etc. Its very repetitive
>>
>>103542866
>jesus fucking christ i hate python devs so much
Worse part of this hobby by far is that trying anything new involves wasting hours of type fucking around with Python dependencies and half the time it still doesn't work
>probably just gonna stick to xtts which works well like 95% of the time
Samsies
>anyone have tips on how to get it not to trail off and make weird noises sometimes?
Have you updated recently? When I went back to xtts, I installed the latest version from pip and that seems to have been enough to eliminate most of the trailing demonic noises. Also, make sure you have a high quality sample voice
>>
>>103542866
>>103542919
Funfact creator of Python was actually my neighbor for years and I talked to him sometimes. He essentially has loathed what the language has become and has distanced himself from the language around the time covid hit. It was meant as a replacement for BASIC for kids to learn programming and make small scripts in. Not for it to be in production code.
>>
>>103542919
>>103542866
>thing will be deprecated in a future version. Use thing-lgbt instead.
41% of the time the script never gets updated once the deprecation is finalized.
>>
>>103542919
I just look for conda instructions, or look for a Dockerfile
It's an annoying waste of time and space to do for every different project but it saves a lot of frustration.
>>
>>103542980
lol i'm using conda, i even installed wsl for this shit and the thing just segfaults, now i'm trying to use the fish-speech.rs thing but that doesn't like that my MSVC is mismatched to my NVCC, teaching sand to think was a mistake
>>
>>103542903
I'm guessing you mean the low-temp config? I did mention in the original post that swipes start off almost identical, but diverge nicely if you give it a paragraph or two. It likes to echo your actions back to you before continuing from there, I guess.
I've realized since then that you can get less structured, more flavorful responses by completely zeroing Min-P and bumping temp up to 1.3-1.4. Give it a try, maybe it'll be more to your liking.
>>
>>103543140
>i even installed wsl for this shit and the thing just segfaults
That might be due to WSL more than Python. Stop being a pussy and dual boot.
>>
>>103543140
well i got it to compile and it "fails to generate semantic tokens" so that was a total bust, i wonder what i broke by bumping my cuda version kekw

>>103542919
what are you using for xtts, i was using a wrapper '
opendai-speech' it's got xtts2 in there and it seems fine, but i'll try whatever you're using to see if it's better
>>
>>103539993
There's no fucking way you're this new.
>>
>>103541788
>japanese
how about you check the previous thread...or the thread recap...or even the op?
>>
>>103543346
>what are you using for xtts, i was using a wrapper '
I'm just using the TTS package directly @ 0.22.0, the last official release
Apparently there's a fork coqui-tts that goes up to 0.25.1
Check what version you have installed and try to update
>opendai-speech
I think that's using the fork
Either way, try to run it directly from the command line or a Python script to rule out the wrapper
>>
>>103543322
dual booting is hell, i did that shit for years and it's so fucking bothersome, i'll keep my linuxes in VMs where they belong, i use arch probably like 50% of the time but it lives in a VM on a different desktop so i can win+tab between oses, doing anything with GPUs on linux is so annoying, plus i hate having to decide "oh i guess now it's gaming/CAD time, better reboot my whole computer"

anyway i got it to build natively and now it's doing ... something
>>
>>103543520
just do your gayming in linux. you don't play anything with anticheat spyware, do you...?
>>
>>103543544
playing older games / some pirated games is also a pain.
>>
Openai btfo
https://x.com/Lawrli/status/1868555485580067133
>>
Also Kijai made a distilled version of HunyuanVideo, good quality at low steps now. (The video fast one)
https://huggingface.co/Kijai/HunyuanVideo_comfy/tree/main
>>
>>103543590
It's kinda wild how little of an advantage OpenAI actually has right now.
The open chink model is worse than Sora for a lot of stuff, but that doesn't fucking matter when one is good enough and can be finetuned for literally anything and the other is anti-foxgirl propaganda
This is about to be SD 1 vs DALL-E 2 all over again
>>
>>103543544
naw i'm not a schitzo, i play games with anticheat spyware all the time because i have friends
>>103543510
thanks yeah it is using that fork and it's at the latest version so at least no fucking around trying to install it, it's already in the venv
looks like it works just as fine as the wrapped version, tho digging into the internals is interesting, i didn't realize that voice cloning is done by doing tts and then piping that into voice conversion, fascinating stuff, i'll try to clean up my source audios a bit and see if that helps with it trailing off sometimes, gotta go pick my girlfriend up now but i'll be back tonight and maybe give fish speech another try to see if i can do some actual comparisons, thanks for the help
>>
File: 1734398060257.png (40 KB, 1080x325)
40 KB
40 KB PNG
>>103543590
I hate Xitter, please post xcancel links only.
>>
>>103543614
>12GB VRAM is the minimum
damn, 8GB vrambros... it's not our day...
>>
>>103543742
go cry on bluecry instead
>>
Can anyone point me to a finetuning guide in which I feed the model text and the model uses said text to shape up its responses/personality?
So far I've created tons of JSON files for each text and I've prepared datasets and shit but somehow the result is always a model that doesn't really change much from the pretrained model aside from looping introductions or prompts that are actually supposed to give it a personality in the responses.
I'm training Llama 3.1 8b btw
>>
>>103543805
Post dataset
>>
My leenux PC just crashed because I forgot I had an LLM loaded up and tried to run another VRAM heavy thing (a 4k video kek). Is there any way to make it so it just crashes the program instead of the entire PC? I would guess no, but might as well ask.
>>
>>103543786
why would I cry about Xitter technical problems on bluesky?
>>
>>103543742
the fact that you need an account to see remotely anything at all there makes it even worse.
>>
>>103543814
It's rather personal to me, and it's also a shitton of gibberish that serves as "user": and "assistant":
Is there anything you want to check in particular?
>>
>>103543742
If you're so adamant about stomping on your own balls because muh elon, at least have the decency to stomp on them yourself.
>>
>largestral 2411 is now below 2407 on lmsys with and without style control
Arthur, are you okay? Are you okay, Arthur? What the hell happened to Mistral? Jump from 2402 to 2407 was HUGE, the new one is barely a sidegrade. They couldn't even get system prompt working properly 100% of the time on low context. What were they doing for 4 months? Gooning? It better be gooning. You better don't tell me they spent all that time making this finetune.
>>
Regarding LCMs, saw this comment:
>One think I am thinking about is that this could make jailbreaking those models "impossible" because, they can just "prohibit a concept", and no matter how hard you try to jailbreak it, they could just ban that "concept" so it would always refuse. Its like, it could look at its own output and like look at the "concepts" that text includes and just refuse it. If they implement this good enough, this would not send out any false positives too, outside of legit uses including those concepts, like maybe a summary of a book, or a scientific paper about that prohibited concept.
https://www.reddit.com/r/SillyTavernAI/comments/1hfjk3l/large_concept_models_and_their_possible_impacts/
>>
>>103544041
I'm not "exactly" sure why this "person" is "writing" like "this"
>>
can anyone recommend a 7B or around that model for generating good coom prompts for images based off my booru-like input?
>>
File: tipo.png (71 KB, 628x591)
71 KB
71 KB PNG
>>103544067
How about a 0.5B?
>https://huggingface.co/KBlueLeaf/TIPO-500M
No idea if it's any good. I downloaded it but i haven't tested it yet. He has even smaller models too.
>>
>>103543855
There are many variables and forces at play. Technical details such as training hyperparameters and attaining a good fit to the data are some. Even if you get those right, if the data is shit, then so will be the results.
If you share, then a quick skim+sniff test can be done to see if the data is remotely good. Literally nobody cares what you have in there. You can do some regexing of PII if it contains truly sensitive information.
If you do not share, then there's only vague advice that can be given. There is not a one size fits all guide to make a model that is your idea of good.
To change how a model writes in different scenarios - a difficult task - you need enough examples that tech it what text to produce, given an input. The data has to be wide enough in scope and example length to cover all the topics and interactions that you wish to have.
Without knowing how your data looks, and how much of it you have, it will be difficult to give you advice, Anon. If secrecy is important, then you will have to figure a lot of this stuff out yourself. Best to look at existing datasets that produce results that you find desirable, and then take inspiration from those.
>>
File: quotes.png (440 KB, 800x400)
440 KB
440 KB PNG
>>103544056
He's one of "those" people...
>>
>>103543855
If that's you pissanon, you'd better post that model after it's finetuned

Sincerely
A fellow pissanon
>>
>>103544106
thanks anon
>booru training data
perfect
>>
>>103544106
If you can get this model to run and navigate the bafflingly fragile gradio interface hats off to you.
>>
File: HunyuanVideo_00436.mp4 (620 KB, 640x480)
620 KB
620 KB MP4
>6 steps at flow 17

Yeesh. I hope this is a bad test and not indicative of the overall shitness of the fast model.
>>
>>103544211
Cursed
>>
File: tipo02.png (8 KB, 1364x335)
8 KB
8 KB PNG
>>103544150
Why would i?
I have no idea if that output is any good. Looks dumb. That's the 200M model.
>>
>>103544211
>6 steps
Probably too low.
>>
File: HunyuanVideo_00437.mp4 (604 KB, 640x480)
604 KB
604 KB MP4
>>103544232
>>
>>103544211
>>103544236
live footage from ohio *3x skull emoji*
>>
File: 1733785826835018.gif (63 KB, 638x546)
63 KB
63 KB GIF
yo where do I get the llama.cpp weights for QWQ?
>>
File: HunyuanVideo_00438.mp4 (698 KB, 960x544)
698 KB
698 KB MP4
Here's 10 steps at 960x544 and flow 15.

idk, feels kind... shitter.
>>
>>103544279
hf.co
Should i get a bigger spoon?
>>
>>103544296
The official repo only has the .safetensors format. Can I use that with llama.cpp?
>>
>>103544279
How do you get to the point of knowing what QwQ is, knowing you can and are about to run it on cpp and still not know how to find the weights?
The order of events to get to that point is completely out of whack.
>>
>>103544315
If you're not a retard, you can convert it yourself.
Here's the link otherwise. Open wide
>https://huggingface.co/bartowski/QwQ-32B-Preview-GGUF
>>
>>103544324
I do this shit at work, but this is my first time doing it as a hobbyist.

>>103544327
Is that version legit?
>>
>>103544345
Fuck how can I get a job in your industry without knowing what a gguf model is?
>>
>>103544351
I design and train the models, not use them.
>>
>>103544362
this explains a lot
>>
>>103544345
>Is that version legit?
You either trust it or not. If you don't learn to convert yourself.
ls your llama.cpp dir, there's a script called convert_hf_to_gguf.py. You'll never guess what it does.
Is it difficult being permanently confused?

>>103544362
pfffffffffff
>>
>>103544362
this>>103544367
>>
Best model for cooming/rp on 8gb gpu + 32gb ram? (slow generation is ok)
>>
>>103544429
If slow generation is okay, I suggest you upgrading to 128gb ram and using largestral.
>>
>>103544315
yes
https://rentry.org/tldrhowtoquant/
>>
>>103544448
You mean Miqu? Or did someone leak mistral-large outright? Been away from the scene for a bit.
>>
File: se.png (345 KB, 1799x1040)
345 KB
345 KB PNG
I know you guys tricks and I'm not stupid to download yet another big model.
But since it was meme'd enough eva 3.3 ended up on openrouter.
I used their master.json settings from the huggingface.
It certainly isnt shying away from anything even with the llama base.
But thats only 8k context right?
>>
>>103544467
>leak
open wide
https://huggingface.co/mistralai/Mistral-Large-Instruct-2411
>>
>>103544476
holy hell
>>
>>103544476
2407 is superior.
>>
>>103544483
you may also be interested in Llama 405b and deepseek v2.5 1210 if you're truly that far out of date
>>
>>103544493
Big L is so not worth it.
>>
File: file.png (108 KB, 1032x878)
108 KB
108 KB PNG
we're so doomed
>please read this long ass string of random ass info with no proof of anything!
davidau won tho
>>
File: shills_just_wont_stop.png (85 KB, 1266x483)
85 KB
85 KB PNG
>>103544474
You can see the claimed context in the config.json on the model page.
picrel, last line
>>
>>103537827
Unironically just so I can pretend there is a cute girl literally living inside my GPU.
>>
File: file.png (8 KB, 390x52)
8 KB
8 KB PNG
>>103544513
a goritrillion contexts!!!
>>
File: se2.png (387 KB, 1863x1108)
387 KB
387 KB PNG
>>103544474
well that went better than expected. not sure if the model is actually smart. but like many finetunes will make the char not open the package.
you faggots might have me tricked again.

>>103544513
hmmm, isnt it the settings of their traning?
https://huggingface.co/EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.0
sequence_len: 8192
or am i being retarded?
>>
>>103544528
>or am i being retarded?
you are, them training on 8k just means it's not better at high ctx than what it's based on, meta also did not train at 128k for the entire training regime, likely just a bit at the end.
>>
>>103544545
hmmm, but many of the finetunes usually do shit the bed early at 8k. so i'm suspicious of that.
thanks in any case, will check out further.
>>
>>103544525
There needs to be a sequel to the classic buttobi CPU for GPU girls
>>
>>103538430
Come back in another year when decentralized training is perfected. we are still waiting for local uncensored c.ai. As long as we aren’t training our own models we will eat slop.
>>
>>103544507
>Big L is so not worth it.
It definitely isn't with the state of enthusiast hardware. If infinite processing were free, I'm sure we'd find a place for it in our rotation. Its pretty smart if you wait for a reply to drip out.
>>
>>103544572
Problem is it's only very marginally smarter than L3.3 70B and so dry as fuck that it feels dumber anyway
>>
File: deepthought.png (423 KB, 639x448)
423 KB
423 KB PNG
>>103544572
>>
>>103544545
training at the end is different from finetuning at lower context. The latter carries the risk of catastrophic forgetting. tbf who the fuck knows how this shit works, if it works it works, but I wouldn't be surprised if these models become subtly worse at long-context tasks.
>>
>>103544509
>davidau
Holy shit... What I have learned from that article is that he is adept at blending facts and misconceptions in an undigestable format.
>>
>>103544362
HOLY MOTHER OF NEPOTISM
>>
>>103539993
Nah this nigga retarded as hell
>>
>>103544362
The fabled clueless AI-adjacent expert "researcher".
>>
>>103544509
these people are everywhere now.
>>
>>103540335
It takes like 30 seconds to copy paste the commands (maybe a minute extra for adding flags you want/need), then a minute or two or compiling (more if you compile with all KV quants)
If it takes you 3 hours to do that then you're functionally retarded and I'm really sorry for you, luckily binaries exists for special people
>>
>>103541463
Doubt it, iirc cuda anon only wrote quantized kernels for CUDA, so no cpu support
>>
>>103541463
As far as I know, yes.
You can load the model in RAM, and let the quantized context live in vram.
As the other anon pointed out, the CPU flash attention kernels might not be all that well optimized.
>>
Wait EVA is actually good what the fuck I thought it was just a meme
>>
File: file.png (289 KB, 769x907)
289 KB
289 KB PNG
We did it, anons. We saved the internet.
>>
>>103544820
gguf?
>>
>>103544474
It works fine up to 32K at least which is what I use.
>>
>>103539929
Haven't seen anything great after Rocinante 1.1.

Violet_Twilight-v0.2 is decent. Need to test more. I think it might be a bit insipid.
Flammades-Mistral-Nemo-12B is borderline good but gets very romantic and lovey dovey with ease.
Starcannon-Unleashed-12B was decent but I forget most details.

Drummer hasn't made anything good since 1.1
>>
>>103544872
I'll endorse this anon's message.
>>
File: HunyuanVideo_00442.mp4 (454 KB, 960x544)
454 KB
454 KB MP4
Distilled Hyvid looks like deep fried ass and I'm tired of pretending it doesn't.
>>
>>103544820
sakana Ai...they're on hf but just a few tiny llama-derived models
>>
>>103544884
Yea, I switched back to the regular one.
>>
>>103544820
I didn't understand a single word, but I'm still convinced it's a grift and a nothingburger.
https://sakana.ai/namm/
>>
>>103544820
There's like one guy in Japan who's decent with anything to with AI and that's Kohya. The country as a whole is technological black hole stuck in the 90s
>>
>>103544918
they're actually okay with anything that doesn't require a pc, since nintendo made those illegal
>>
>>103544904
>NAMMs are simple neural network classifiers trained to decide whether to “remember” or “forget” for each given token stored in memory. This new capability allows transformers to discard unhelpful or redundant details, and focus on the most critical information, something we find to be crucial for tasks requiring long-context reasoning.
>>
>>103544820
>Vector Database 2.0
WOW WE ARE SO BCK
>>
>>103544960
Sounds dumb and like it won't work as intended.
>>
>>103544904
I remember them from a while back, this is actually mostly foreigners and not japanese at all. They got jp money though..
Founders:
>David Ha (Google Brain, Goldman Sachs) Llion Jones (Google Research, Transformer Co-Creator) Ren Ito (Mercari, Ministry of Foreign Affairs of Japan)
They got money from nvidia etc.
Their blog posts read like the shitcoin pages. Like hyping something up but written in a way that makes you go "i dont understand but that must be cool!"
I wouldnt bother checking anything they write, but who knows
>>
>>103541895
>>103542165
I have 24 GB of VRAM and 64 GB of DDR4 RAM. Running Nemotron 70B Q4_K_M with 35 layers offloaded and 17408 tokens of context I get about 1.2 tokens per second. (If I drop to IQ4_XS I can offload 45 layers and get around 1.6 tokens per second.) If you get are getting 0.5 tokens per second trying to run a Q4 70B something is wrong.
>>
>>103541938
Nevermind, this sucks. It's not built for cuda compute capability 6.1, so I get the kernel error, and it threw me into DLL hell except with .so files trying to build it for that level.
I was able to build ollama from source after patching it without nearly this amount of bullshit.
>>
>>103545043
And I got the model working with ollama in 3 minutes.
>>
>>103545084
BrO dOnT uSe OlLAmA itS bAd Bro DonT uSe iT plEase bRo donT dO ThiS to The CoMMuNitY thEy AreNt AuTistS bRo PLeAse Bro OllAmA sUckS bRo
>>
more like ollame lmao
>>
Olleddit.
>>
oshit
>>
>>103545125
btw in case he shows up again, it also doesn't work with stella because of https://huggingface.co/dunzhang/stella_en_400M_v5/blob/main/config.json architectures:NewModel
and I know I'm supposed to rewrite the classes in
https://huggingface.co/dunzhang/stella_en_400M_v5/blob/main/configuration.py#L23
https://huggingface.co/dunzhang/stella_en_400M_v5/blob/main/configuration.py#L23
to make a plugin for VLLM and I don't want to.
>>
What quants of 70B are anons using? I tried IQ4_XS and it's alright but not really sure.
>>
>>103541455
in my country the lowest on ebay used 3090 is 600$
>>
>>103541324
Intel has a chance to seize the consumer market. Will they take it or will they overshoot and be slaughtered by Nvidia in the enterprise GPU grift?
>>
>>103541455
and the next 800$
so i would prefer buying a new 24gb b580
>>
>>103540988
Largestral is better than gpt-4 (but not 4o)
>>
>>103542165
What Q2 quants were you using? Anything below IQ2_S is absolute garbage. I believe the difference between IQ5 and IQ2_S is only slightly bigger than the difference between IQ2_S and IQ2_XXS. Exponential degradation hits IQ2 hard.
>>
Why dont more people use control vectors? It fixes most issues with most models.

https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0
>>
Also I tried, Phi 4, quite sloppy but its really smart for its size, for sure smarter than nemo.
>>
>>103545467
It was getting interesting but then I realized it was only for RP/storywriting.
>>
Intel better release that b580 24gb soon because I can’t hold on for much longer!
>>
Lets say I wanted to get a card 100% dedicated to context and context processing...would i be able to get some previous gen 32gb/48gb monstrosity with slow vram and still come out ahead for total time to process the prompt?
How bad are previous gen cards for that sort of thing if you don't intend to do any inference of the main model on them? eg. 3090s for most of the model and an m10 32gb just for context/prompt processing?
>>
>>103545467
I don't know what they are
>>
>>103545618
I am totally down for taking my money to intel if they can release these cards, my only concern is how much pain will I have to go through to get things working? With AMD, you're basically buying a device made to look similar to a NVIDIA device but frustratingly lacking in software support. So what does an intel card look like? How quickly can I go from shoving the card in my PC to running a model? Will be bound by Linux or windows? I'm prepared to be a little inconvenienced, if only to stick it to NVIDIA but if it's straight up incapable of what NVIDIA can do under any circumstances then that's a deal breaker.
>>
>>103545644
If some dumbass on reddit could do it then I also can.
>>
>>103543818
I know absolutely nothing about your situation nor how vram is used in linux.

My suggestion: See if there is no-overcommit parameter for vram.
I think there was a parameter like that for ram
and that you could use it if you wanted programs to just be killed if there wasn't enough ram to go around instead of consuming ever amounts of swap.

If you do find an answer then please post.
The less effort solution in you case is just to keep mental track of whether there's an llm in vram.
>>
>>103545467
as all of these things, people tried it and it didn't catch on because it's shit and made the model dumber
>>
>>103545467
No UI has implemented support so only being able to use them on the command line is a killer.
>>
>>103545710
>>103545710
>>103545710
>>
>>103541088
>"M'lady." *tips penis*
>>
>>103545631
Hmmm. Let's break this down step-by-step.
>you'll need some anon with multiple gpus
>they'll need to know what layers to put on which gpus
>they'll then need to downclock the gpus that are doing the prompt processing to simulate m10s
>>
>>103545467
You have to restart the server and process context again every time you want to adjust the settings
>>
Can someone point me towards a good training guide, retard proof?
I have tons of plain text files that I've made processors to, but I simply can't get the training to work.
I would also like to learn how to make it so my model can be fed new information, not to add to it's model or training, but to use as a reference for the responses that it'll give to a user, utilizing its training as a reference to analyze that new information.
An example would be
>webscraper gathers data from a news website
>LLM that's been trained in politics gathers that data
>every conversation that the LLM gets that day uses the news in conjuction with its current trained model
>>
>>103544960
Even if it did work, you'd just be kicking the can down the road just like with all the other HUGE OPTIMIZATIONS (insert your basedface of choice here)
Transformers scale like absolute ass, we need something better
>>
>>103541463
>>103544774
I added both CPU and CUDA implementations for quantized KV cache.
I think the koboldcpp documentation is just poorly worded.
>>
>>103544476
>>103544489
>2407 is superior.
this



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.