[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: ComfyUI_06542_.png (2.12 MB, 1280x1280)
2.12 MB
2.12 MB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>103536775 & >>103525265

►News
>(12/16) Apollo: Qwen2.5 models finetuned by Meta GenAI for video understanding: https://hf.co/Apollo-LMMs/Apollo-7B-t32
>(12/14) CosyVoice2-0.5B released: https://funaudiollm.github.io/cosyvoice2
>(12/14) Qwen2VL support merged: https://github.com/ggerganov/llama.cpp/pull/10361
>(12/13) Sberbank releases Russian model based on DeepseekForCausalLM: https://hf.co/ai-sage/GigaChat-20B-A3B-instruct
>(12/13) DeepSeek-VL2/-Small/-Tiny release. MoE vision models with 4.5B/2.8B/1.0B active parameters: https://hf.co/deepseek-ai/deepseek-vl2

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/tldrhowtoquant

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/hsiehjackson/RULER
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>103536775

--Caching and VRAM usage in AI models:
>103541013 >103541031 >103541424 >103541688 >103541463 >103541542 >103544774 >103544805
--Anon shares Meta Apollo multimodal model for video understanding:
>103536845 >103538783
--Llama.cpp and ollama discussion, downstream projects, licensing, and samplers:
>103540227 >103541665 >103541736 >103541831 >103541884 >103541976 >103542088 >103542112 >103542244 >103542386 >103541806 >103541938 >103541856 >103541885 >103542029
--Intel Arc B580 24 GB version and custom AI board discussion:
>103541324 >103541352 >103541354 >103541370 >103541383 >103541395 >103541414 >103541474 >103541507 >103541700 >103541489 >103541648
--Anon discusses potential advancements in scaling test-time compute:
>103539461 >103540364 >103540423 >103540430 >103540496 >103540552
--Sakana AI's new LLM optimization technique:
>103544820 >103544889 >103544960 >103544994
--Llama.cpp performance with speculative decoding vs EXL2:
>103542255 >103542262 >103542323 >103542362
--Finding a smaller Mistral model for Largestral:
>103542410 >103542422 >103542435 >103542451
--Anon shares fix for building llama-server issue:
>103539211 >103539613
--Fixing llamacpp_HF ERROR due to missing tokenizer:
>103541918 >103542396 >103542518 >103542700 >103542716 >103542728
--Anons share their use cases and experiences with local AI models:
>103537827 >103537852 >103537944 >103537932 >103538009 >103538279 >103538297 >103539213 >103542652
--Anons discuss why they dislike LM Studio:
>103539403 >103539436 >103539498 >103541132 >103541327 >103539947 >103540330
--Debian unstable and 6.12 kernel experiences for LLM workloads:
>103541536 >103541553
--Kijai's distilled HunyuanVideo model:
>103543614
--Miku (free space):
>103537120 >103538763 >103541088 >103544113 >103544211 >103544236 >103544284

►Recent Highlight Posts from the Previous Thread: >>103536779

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
>>
So EVA finally gave us local Claude
What do we do now
>>
>>103545667
I'm more thinking about all ML tasks in general. Getting it to work with one repo is fine, but am I going to have to fiddle around with every repo from now until the GPU dies if I buy intel or does it more or less just werk?
>>
Tetolove
>>
Okay. I'm a little curious about control vectors. What do they do? They seem really popular but does it do that a system prompt doesn't?
>>
>>103545736
>What do we do now
We stop posting dumb bait about mediocre models being local Claude.
>>
>>103545777
Just better ways to control how a model writes. The names kind of explain what each does.
>>
>>103545777
>They seem really popular
Oh yeah? Is that why there's one post about it in the past two months?
>>
>>103545791
Based
>>
>>103545815
Most people here struggle to figure out kobold.app. llama.cpp with commandline? that is gonna scare off 99%+
>>
So, I tested Eva 0.1 vs. 0.0, and while I can tell there is a difference, I'm not sure exactly how to quantify it.
The first, most obvious difference I found is that with the exact same settings, 0.1 is far less likely to write multiple paragraphs unprompted (0.0 would often go on and on, which I personally considered more of a feature than a bug).
On the upside, it feels like it does follow character definitions even more accurately than 0.0. However, for some reason, at higher temperatures, it ends up getting delusional more frequently than 0.0 seemed to. (At lower temps, this doesn't seem to be an issue at all.)
Keep in mind, I haven't done any particularly rigorous testing yet, just swapped 0.1 in place of 0.0 and continued with the card I already had loaded up for now.
To anyone else who tried both: are you getting the same impressions?
>>
How bad was L3.3 that it caused a cope themed tripfag to spawn over its arrival?
>>
>>103545832
Okay but why did you lie about it being popular
>>
>>103545855
Downloads last month
105,633
>>
>>103545852
Can you tone your seethe down a couple of octaves? I'm trying to get some sleep here
>>
>>103545855
He's not the anon who said that. I was. But my reasoning was >>103545874
That puts it well in the popular category for me. What I don't understand is why it never appeared on my radar until now and how it affects the outputs.
>>
>>103545852
Bro, the whole reason I started tripfagging is that people were trying to ask me questions while mongrel dogs like you were trying to run interference. Mald harder.
>>
the shilling has become a lot less subtle lately
>>
>>103545911
Huh? I've only seen you once and you post reads like cope. "It's still good, it's still good!"
If you don't want to be branded like that, don't name yourself after a model and keep trying to find ways to make it work.
>>
>>103545894
I think it's a combination of this >>103545832 >>103545724 and this >>103545881
I would guess it's too much of a pain to use when you have to constantly restart the server and change the command whenever you want to adjust the settings for the vectors
>>
>>103545946
>you post
Pajeet confirmed
>>
>>103545944
Indeed fellow anon. Speaking of which, why is nobody taking about InternLM 2.5 20B? This model beats Gemma 2 27B and comes really close to Llama 3.1 70B in a bunch of benchmarks. 64.7 on MATH 0 shot is absolutely insane, 3.5 Sonnet has just 71.1. And with 8bit quants, you should be able to fit it on a 4090.
>>
>>103545460
IQ2_S. It wasn't unusable but you could tell it was off and prone to repetition or weird outbursts. Maybe my settings were off or that's just how the model is. Not sure I want to DL a huge Q4 just to slowly test that theory on CPU.

>>103545023
Sorry man I was just making a point that it was comically slow and I didn't remember the exact T/s. I used to genuinely get 0.5T/s but that was a while back before llama.cpp got optimized so I believe you. Still not worth it compared to 35T/s with models that fit on my GPU.
>>
When will anons learn to let their logs do the talking? It proves your point and improves the quality of the thread in one fell swoop
>>
Suppose that a model's weights occupy X gigabytes of memory. How much extra VRAM and/or DRAM do I need to actually run the model? 2X? 4X?? 8X???
>>
>>103546009
You have access to the same weights as everyone else, make your own logs.
>>
>>103545946
>Huh? I've only seen you once
Had you been lurking in the past 7 threads you would have seen his origin story.
That you haven't meant that you just went off the name and a head full of nothing.
>>
File: file.png (364 KB, 3840x1956)
364 KB
364 KB PNG
Kinda new to this so I have been lurking for a while trying to get an AI rig up and I am looking at the Open LLM leaderboard to find what model to use and this pajeet's finetune models scores pretty highly. Has anyone actually tried them and found it does good generally or are they just optimizing for benchmarks? It feels like it's the latter given how no one talks about it here.
>>
>>103546052
I tried RYS-X-large, it was benchmaxxed crap.
>>
>>103546009
That's what people used to do when Llama 1 came out. Then after Llama 2 came out there were a bunch of newfags that were uncomfortable with RP, and there was a big scare about AI legislation so anyone who posted anything questionable was called a glowie, and on top of that real shills started showing up and declaring how good their models are without posting logs so it became more common to do that.
So basically, and this goes for anyone in this thread, most people won't post their logs but if you like a model then post your logs and fuck the retards that don't like it.
>>
>>103546052
The benchmark leaderboard is completely useless, overfit to hell. The chat arena based on human evals used to be okay but I think people figured out how to overfit to that too.
This thread might be the best source right now for which models are decent, as horrifically sad as that is. It's still shit though, the only proper source is testing manually yourself
>>
How badly will my models shit the bed going from FP16 inference to int8?
>>
>>103546155
0-5%, almost lossless
>>
>>103546155
>int8
?
>>
still nothing to unironically beat mistral large while staying below 400b?
>>
>>103546155
You're cutting the size in half. Go do the math.
>>
>>103546249
/g/echnology
>>
File: chatlog (15).png (793 KB, 1087x1821)
793 KB
793 KB PNG
You know.. I had always dismissed this:
https://eqbench.com/creative_writing.html
due to putting gemma 2 9B tunes over even mistral large but now, using it again with a bit of minp / lowered temp it really is possibly the best model ive used... And it actually does not seem limited to 8k but 30k ish with a rope frequency base of 59300.5.

>What about logs
There, my non human test log. It never fucked up by giving her human anatomy even once in many swipes, something even mistral large often fails at.
>>
>>103546241
No. People are trying to cope by saying certain 70Bs have achieved it, but they're mistaken.
>>
>>103545710
>it's time
>>
>>103545710
>CosyVoice2
Their examples sure make it sound stronk, the short prompt once being especially impressive at first. That it can get something convincing out of a 3 minute sound bite is one thing, 6 seconds is something else.
Then I remembered that Voice shit like this is notoriously annoying to get working and not for the wider public like image or text autism.
>>
>>103546353
It's sitting in a little venv on my wsl instance and I can't get the shit to work. If you can, I'd love to hear it.
>>
>>103546373
I've been getting random tts projects to work all day, i'll gibe it a shot

(xtts is by far the best, fish is maybe more consistent, but doesn't sound as good a lot of the time imo, definitely they're close enough that it's better to just use xtts since it's so much easier, i'll get cosyvoice going and compare)
>>
>>103546456
>i'll get cosyvoice going
If you do tell me what you used to get it going and if it's worth it.
>>
>>103546296
Gemma was always a decent model as far as intelligence goes but I don't think the context is as good as you say think unless something has changed in the backends. I remember testing it to about 6k (no rope) and noticed it started getting memory issues (which other models didn't have a problem with when I swiped on the same response). The ruler benchmark said otherwise (I was even able to reproduce the results as a sanity check), but my experience using it in a real world multiturn RP setting gave me a different conclusion about its real context.
>>
>>103540387
Tried it out with my favorite card and came BUCKETS.
Absolute upgrade.
It's quite something the kino we can get out of a model that runs on a 12 GB card these days.
Now I wait for Bitnet to let me run 72B-quality models at lightning speed on a 12 GB card.
>>
File: 1717102897392813.jpg (113 KB, 990x990)
113 KB
113 KB JPG
So Qwen models are currently the best? When am I supposed to use each model?
>Qwen 2.5 72B
>Qwen 2.5 Code 32B
>QWQ 32B
>>
>>103546009
>just post your embarrassing niche fetish logs with the writing style/message length you prefer but others might hate
Nah.
>>
>>103546551
>72B
Math and sciences.
>Code 32B
Code writing.
>QWQ
Code architecting.
>>
>>103546575
Also even if the log is good people will say it's bad if they have a grudge against the model that generated it.
>>
>>103546551
Where do you get these cat images from?
>>
>>103546319
Who said that? At least in my posts praising 70B (specifically eva) I never mentioned anything about Mistral Large, and when I did mention Mistral Large, it was with praise from the short period of time I did try it.
>>
File: 1734365187090413.gif (331 KB, 220x220)
331 KB
331 KB GIF
>>103546589
>>>/bant/
>>
File: chatlog (18).png (487 KB, 1087x1081)
487 KB
487 KB PNG
Oh, I didn't even mention which model it was, its this one: https://huggingface.co/ifable/gemma-2-Ifable-9B
>>
File: sataniaquestion.jpg (572 KB, 1741x1080)
572 KB
572 KB JPG
>smol model that produces kino once every 20 swipes or so, and sometimes requires temp tweaking
>large model that requires very few swipes and zero sampler tweaking
>smol model actually produces more accepted messages per 30 minutes in the long run, despite needing occasional temp tweaking and so many swipes, because it's very fast
Which one, /lmg/?
>>
>>103546517
Also, it actually seemed somewhat smarter than Mistral Nemo Instruct and better at interpreting my favorite card's details.
I am pleased.
>>
>>103546612
The large model because swiping and seeing all the kinds of fuckups it makes reminds me more that it's a retard and brings me out of the flow state.
>>
I got gaslit into trying the qwen EVA models again, and they're still fucking retarded. Evathene seems kind of okay in the 70b bracket. I'm doing structured generation to programmatically build more elaborate worlds and basically no finetunes are capable of handling this. I only use them for descriptions where its fine to get schizo. Everything else needs a base instruct model.
>>
>>103546612
Also sometimes a small model gets stuck in a very stupid place that even swiping a ton can't get it out of while a large model understood what to do.
>>
>>103546612
Rationally I should pick the small model in your hypothetical But in practice the problem I have is that all the retarded/incoherent swipes feel like "peering behind the curtain" and make me too aware of the fact that I'm fundamentally just playing with a probability calculator, which ruins it. With a smart model you can forget.
>>
>>103546639
If you're doing complex stuff I could see it. Eva is relatively schizo out of the models I've tried. But that's also what makes it fun, while other models feel more boring and uncreative even when you use higher temps and various samplers.
>>
>>103546640
This is where you need to just edit your last message to steer it properly. With some skill and smarts you can hint at the direction you want it to take things without outright telling it what to do, and it's actually satisfying when this works.
>>
>>103546630
>>103546646 (me)
Huh, so it's not just me.
>>
>>103546649
I have done that before and it again ruins the experience. Swiping itself already is making me do something that I wish I didn't have to.
>>
>>103546612
Big model that's slow and gets it right.
Every time.
It's one of the reasons I liked QwQ. I could visually see it understanding and confirming the situation as it thought up the response. Made the RP even more interesting.
>>
>>103546664
got a good system prompt? I couldn't wrangle it.
>>
I don't think I could tolerate using a slow model at this point.
I have a sickness. That sickness is the inability to fight the urge to swipe, even if the response is kino, because I want to see if the next response is even more kino.
My urge is too strong to experiment and push the model to its limits, finding out exactly what it's capable of. The intellectual curiosity is too strong.
>>
>>103546085
>>103546126
Thanks for the advice. I guess I'll go with Qwen 2.5 then since people were shilling it for intelligence. Not sure about the whole RP thing yet but yeah, seems like EVA is great.
>>
File: vomit.png (995 KB, 1825x417)
995 KB
995 KB PNG
>Chink models
>>
>>103546683
I don't wanna be mean but I don't think that's really "intellectual curiosity". It's just the same impulse when you're gooning that makes you keep clicking on new videos instead of just cumming with the current one that's good enough, because you're waiting to finish to the perfect clip (which doesn't exist)
>>
>>103545834
So you're telling me that a model that's more trained is better at following instructions but worse at higher temps? Holy shit, what a revelation
>>
>>103546700
I do it even when I'm having non-sexual conversations, though. It is absolutely about testing what the model is capable of.
>>
>>103546683
The problem with fast small models for me is that let alone kino, its responses most of the time don't even have good logical coherency in the scenarios I test. I could see what you mean if you meant that you wished large models were faster so you could swipe to see more kino. That would be nice.
>>
>>103546698
white cat, black cat...
whatever catches the mouse
>>
>>103546725
Keep telling yourself that. As long as your can reframe your addiction as a virtue, you don't have change.
>>
>>103546733
>*randomly replies in chinkrunes to your message*
>>
>>103546740
Anon, I literally have thoughts like "I want to swipe again to see if the model is capable of producing something better."
I actually have these thoughts because I have an internal monologue, so I know what I'm thinking when I do it and why I do it.
Maybe you don't because you're an NPC and don't understand what having an internal monologue is like.
>>
so your stuck installing either facebook or china software on your pc to run models
why do people pretend this is safer than using american nonprofit openai?
>>
>>103546760
>doesn't read the op
https://rentry.org/IsolatedLinuxWebService
>>
>>103546760
>why do people pretend this is safer than using american nonprofit openai?
>nonprofit
>openai

*Scoffs* Oh, honey, where do I even begin with this naive little post? Let me paint you a picture:

OpenAI, that "American nonprofit" you're singing the praises of, had a little coming-of-age moment and decided to *gasp* join the big leagues. That's right, sweetie, they're now a **for-profit** company. The nonprofit status was so *last year*.
>>
is EVA shill anon around? i finally got the models what are the settings/prompts i should try to see how brilliant it is?
>>
>>103546759
I thought you were trolling at first but I think you might be legitimately autistic anon
>>
>>103546476
got it running, first test sounds like an extremely FoB chinese girl which is exactly what i wanted so pretty kino so far, where did you get stuck? i just followed the instructions on the github
>>
File: nagatorovomit.jpg (1.16 MB, 1859x2048)
1.16 MB
1.16 MB JPG
>>103546944
>sounds like a chinese girl
>chinese accents
>>
>>103546944
I get a special tokens error when I launch it. Are you using windows or Linux?
>>
>>103546456
>xtts is by far the best, fish is maybe more consistent, but doesn't sound as good a lot of the time imo,
How does cosyvoice compare to gpt-sovits for quality of output?
>>
File: slut_yammers_q5.png (269 KB, 806x680)
269 KB
269 KB PNG
>>103546517
Works great on my 3070ti at q5, most coomable local model I've tried yet. Thanks for the info!
>>
>>103547067
Post settings pls?
>>
>>103547067
>>103546517
I'm sure I'm being paranoid but these posts don't feel organic
>>
>>103547112
no way
>>
>>103546847
His original low-temp settings >>103498935 >>103514107
"you are" vs "{char} is" >>103533604
Alternate high-temp settings >>103535507 >>103535548
A comment on the two settings >>103543163
>>
>>103546296
Wish gemma wasn't painfully slow
>>
File: Untitled.png (800 KB, 1080x2486)
800 KB
800 KB PNG
Entropy-Regularized Process Reward Model
https://arxiv.org/abs/2412.11006
>Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. In this work, we propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method based on the theoretical results. Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling. Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF. These results highlight the efficacy of entropy-regularization in enhancing LLMs' reasoning capabilities.
https://github.com/hanningzhang/ER-PRM
neat
>>
FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores
https://arxiv.org/abs/2412.11007
>Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of unstructured sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).
posting in case Johannes wants to mess around with tensor cores.
>>
File: settings.png (62 KB, 402x713)
62 KB
62 KB PNG
>>103547107
Settings in post image, card here:
https://files.catbox.moe/ffc2ay.png
If you meant hardware settings I'm on 8 CPU threads and 38 out of 43 layers on GPU. Using about 22 of my 32GB of system ram but I have some other stuff open.
>>103547112
I'm not the same guy lmao, I realize my 'thanks' bit might have seemed artificial in hindsight
>>
>>103547307
Thanks!
>>
>>103546585
Well thats their opinion.
qwq logs were horrible but showed the obvious problem.
Ponyanon didnt mind the slop but most of us cant take that garbage.
Couple of screenshots is the best to get a idea.

I dont know what happened since a couple weeks ago, we constantly have bad models shilled now. No wonder nobody wants to try.
Its like the reverse now. Everybody shat on drummer before that, say anything positive and you are a shill.
>>
>>103547407
>we constantly have bad models shilled now
What do you consider the current good/SOTA models then?
t. sincerely curious retard lurker
>>
>>103547486
I think people just try to latch on to "new" models because they figured out the models we have currently.
For me best is still either nemo (which is best for its size for sure).
Or mistral-small. Better than nemo at stuff like stats. Its smart enough to keep and update them. Its more "here"...at the cost of being more assistant-like.

I usually just use a drummer finetune of either.
I only have 2 local use cases:
-Card stuff in sillytavern
-Or my actual first real use case translating jap rpgmaker games on the fly. Wrote it before but Cydonia-22B-v2q-Q5_K_M (v1.3).gguf does a good job because it doesnt refuse ero and sounds natural without slop.

Anything bigger i dont really know that much.
I tried the ~70b range models but the size doesnt justify the smarts in my opinion. (and they are still retarded, just less)
Also I'm sure people might fight me on this but bigger models are all more assistant sloped. Sometimes very aggressive.
I just dont wanna deal with that. No clue about mistral large finetunes.
>>
>>103547530
>Cydonia-22B
that's the best shit I can run local. I also pay for infermatic for Midnight Miqu but it's about as good.
cydonia magnum 22b is the best thing, in my experience, that can run at a decent speed on 24gb vram locally
someone shilled it here a few weeks ago and I'm thankful
if times get tough and I had to cancel my paid API i wouldn't miss it that much.
>>
https://voca.ro/1fctQxdCpylP

I finally got the NEW cosyvoice 0.5b working. I had to download the gradio app from the bootleg chinese github, but it worked way better than whatever the fuck they had on github.
>>
bros, what can i do with 2gb vram, 64 gb ram and a lot of hope
>>
>>103547530
This is roughly what I gathered (+ the EVA 3.33 stuff, but I can't run 70b at a reasonable quant anyway), so it's a relief to have confirmation of that; thanks.
>>103547556
>cydonia magnum 22b
I'm seeing this here and there, is there anything that makes it notably better than base Cydonia?
>>
>>103547604
How much of that hope can you convert into time?
>>
>>103547619
Right now, about 80-85% of it
>>
>>103547604
Llama-3.2-3B-IQ2_M
>>
>>103547067
>Submissive ayame
>Not her releasing her inner Oni
>>
>>103547604
See if llama3.2 1b or llama3.2 3b models run fast enough on your cpu.
>>
>>103547640
Will look into it, thank you
>>
I don't understand why lately koboldcpp is processing the prompt over and over again, it never used to do that unless I changed something way far back, now it does it all the time. Did some setting or something change?
>>
>>103547604
>2gb vram
Ah shit, you are fucked
>>
>>103547034
windows, no issues, post whatever errors you got maybe i can help
>>103547067
this unslopnemo v2? have you tried v4.1?
>>103547061
definitely a lot less effort than sovits, i've never gotten good results from sovits at all

fish-speech is probably the best one but it's more annoying to use than xtts
xtts2 is the gold standard imo, it's not perfect but it's fast and easy to use, good enough and low effort
gpt-sovits takes way too much effort and the results were underwhelming, often times beaten by xtts
comfyvoice makes the most natural sounding voices but it's at the expense of following the conditioning files, they don't sound as close to the inputs as the other tools
>>
>>103547663
I just looked for something that would fit on 2gb of vram. You never mentioned your CPU or memory speed, but a 7 to 9b model should run at reasonable speeds on CPU.
>>
>https://huggingface.co/blog/falcon3
>https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026
>https://huggingface.co/tiiuae/Falcon3-10B-Instruct-1.58bit
>Bitnet quants
>>
>>103547655
I have a different card for that
>>103547688
Yeah, v2. Is 4.1 considered much better?
>>
>>103547725
>bitnet
What the fuck, who would've thought that Falcon of all models would be the one to save /lmg/?
>>
File: medium-instruct-models.png (80 KB, 1153x885)
80 KB
80 KB PNG
>>103547725
>mogs nemo
dare i say we're back?
>>
>>103547725
>https://huggingface.co/tiiuae/Falcon3-10B-Instruct-1.58bit

>The model has been trained following the training strategies from the recent 1-bit LLM HF blogpost and 1-bit LLM paper.
>Currently to use this model you can either rely on Hugging Face transformers library or BitNet library.
wtf
>>
>>103547728
Wait you do? Is it something new not on /wAifu/?
>>
>>103547751
3.99gb for 10b. huh
arent those arabian guys though? i remember their models always being very bad.
but hey, thats cool.
>>
File: 1710926256018790.png (90 KB, 1379x547)
90 KB
90 KB PNG
>>103547751
Why is the 10b 1.58bpw model bigger than a 10b 2_K (2.5bpw) quant?
>>
>>103547725
>1024 H100 + 14 trillion tokens --> 7b model
>7b + duped layers + 2 trillion tokens --> 10b model
>pruning + distillation 0.1 trillion tokens --> 1b model + 3b model (1b supports 8k context)
>7b mamba + 2 trillion tokens --> 7b mamba base and supports 32k context

>"All models in the Falcon3 family are available in variants such as Instruct, GGUF, GPTQ-Int4, GPTQ-Int8, AWQ, and 1.58-bit"
>"All the transformer-based Falcon3 models are compatible with Llama architecture allowing better integration in the AI ecosystem."
Doesn't bitnet specifically need to be trained that way to get full use out of the format?

I like their acknowledgements section.
>>
>>103547688
https://voca.ro/1o26yIIt3mev
>>
>>103547272
Thanks but the optimization issues I have outside of llama.cpp/GGML in "high-performance computing" are much more basic like using more than one thread, not spending 20% of the runtime clearing caches, and not using Gaussian elimination to explicitly calculate an inverse matrix.
>>
>>103547679
Quantized KV cache?
>>
>>103547743
Huge if true
>>
>>103547743
7B is better than Nemo12B in most benches, but what's odd is that the 7B is also better at math and a few other benches than the 10B, could that be due to the up-scaling?
>>
>>103547800
yeah that's about the quality i got
>>
>>103547767
just the older towabaker ayame card
>>
>>103547725
Fyi the ggufs they posted aren't currently supported
>llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'falcon3'
https://github.com/ggerganov/llama.cpp/pull/10864
>>
>>103547787
I think real/non-naively trained Bitnet 1.58 models aren't uniformly quantized in 1.58 bit; they still have some higher-precision components.
>>
Falcon-180B will get its grand revenge once the new version using bitnet drops. You will all apologize for laughing at it last year.
>>
File: file.png (41 KB, 514x623)
41 KB
41 KB PNG
>>103547788
>>103547787
>>103547781
>>103547751
>>103547743
BitNet dream is dead, they're done post training not native and they murder the benches
>>
>>103547967
its over
>>
File: pray.jpg (36 KB, 524x329)
36 KB
36 KB JPG
>>103547743
please dear god, please let this actually beat nemo in roleplay, I'm so tired of nemo being the best in the <13B class for so long
>>
>>103548072
Falcon is trained to be safe and halal, not going to happen.
>>
>llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'falcon3'
I thought the models were Llama-compatible.
>>
>>103547725
There's a mamba variant as well. Unlike the bitnet meme, this one might actually be useful for something.
>>
>>103548113
See >>103547888
>>
EVA Qwen seems good, haven't tried cydonia, but eva qwen is doing a good job of both being a succubus trying to seduce me and answering all of my programming questions, if your usecase is "chatgpt but not boring to talk to" it seems like a good contender
>>
File: 4578645364346.png (29 KB, 1434x323)
29 KB
29 KB PNG
>>
As you enter the forest, you feel a chill run down your spine. The trees loom overhead, their branches intertwining to form a dense canopy that blocks out most of the sunlight. A faint glow emanates from ahead, casting eerie shadows on the mossy ground beneath your feet.
A rustling sound catches your attention, and you spot a small clearing where a family of rabbits hops about, nibbling on the underbrush. You pause for a moment, watching them before deciding to continue your journey.
Continuing forward, you come across a narrow stream. The water trickles over smooth stones, creating a soothing melody. You consider your options: you could wade through the shallow part or look for a fallen log to cross more easily.
What do you choose to do next?

>Wade.

You decide to wade through the shallow part of the stream. The water is cold but refreshing against your boots as you carefully step from stone to stone, making sure not to slip. Once you reach the other side, you take a moment to catch your breath and ensure no harm has come to your steel longsword or wooden shield.
The forest seems to grow denser beyond the stream, and the air feels thicker with anticipation. You press onward, your eyes scanning the surroundings for any signs of danger or helpful landmarks.
Suddenly, you hear a distant roar echoing through the trees. Your heart quickens as you realize it must be the dragon you seek. The sound grows louder and closer, prompting you to quicken your pace.
You emerge from the forest onto a rocky plateau, bathed in the dim light of dusk. In the distance, perched atop a towering cliff, you see the silhouette of the dragon, its scales shimmering in the fading light.
Now, face-to-face with your ultimate foe, you raise your steel longsword, ready to confront the evil dragon of Larion. What will you do next?

This doesn't seem great.
>>
>>103548264
>fuck what you see let's just do a chill down your spine slop.
kino
>>
File: 12424123464576.png (92 KB, 1206x845)
92 KB
92 KB PNG
Vanilla Nemo for comparison.
>>
>>103548264
they CANNOT physically
>>
are finetunes a meme? i find that swapping to the base model after like one conversation turn is an effective jailbreak for most models and the outputs are generally way smarter, seems like you really just need good prompting
>>
>>103548352
Community finetunes (I won't call them "open source" because most of them don't even publish the data) are 99% a giant grift and the anons shilling them here are just pretending they're not.

Instruct tuning itself is not a meme, though. Base models won't generally go too far beyond continuing along the same structure you started the prompt with, which may or may work for you depending on what you're doing.
>>
File: 1543719511235.jpg (30 KB, 311x362)
30 KB
30 KB JPG
I want a >70B bitnet model!!!
>>
>>103548412
Same.
As in, true bitnet, trained from the ground up. Not this post training quantization shit we've been seeing lately.
>>
>>103545710
Is it worth investing in the RX 7900 XTX? Can it handle comfyui and local models as well as my 4070ti Super? Kinda need it to do both, on windows. The 24gb, I'd imagine would be a massive step up for these sorts of things. I'm using qwen 14b which is ok for what I need it for but would like to do the larger models.
>>
>>103545710
All I want for Christmas is Tet
>>
>>103548490
Stealing pillows from Teto
>>
File: 11__00900_.png (1.31 MB, 1024x1024)
1.31 MB
1.31 MB PNG
>>103548352
>>103548381
>promplets prefer to swap whole ass models than learn to proompt
>>103548287
>prompting a single word and expecting tolstoy
grim
>>
>>103548592
>prompting a single word and expecting tolstoy
Nemo can do it.
All >70B models can do it.
>>
>>103547967
>>103547729
>>103547788
The original poster literally said quants. Are all of you illiterate?
>>
>>103548474
>on windows
no
>>
So it's just over for me trying to run EVA with a single 3090?
>>
>>103548592
fuck off.
a good model needs to be able to handle that.
>m-muh prompt
a good model knows what the user wants even without a explicit prompt.
>>
>>103548801
Why exactly? Is it something to do with Pytorch + ROCM? I honestly don't know
>>
bitnet is coming and soon nobody will ever need more than 24gb of vram ever again
>>
even - I buy a single 5090 and use it with my 4090
odd - I buy two 5090s
>>
>>103549111
Alternatively, people with 24gb of vram will be able to run models with even more parameters.
>>
>>103549114
how about you sell your 4090 and buy three 5090s?
>>
>>103549137
I'd have to get a new cpu and motherboard.
>>
>>103549149
this is what happens when you don't futureproof your builds
>>
>>103547530
It's nice you can cope with that belief. For me, while it's true bigger models (all models really) are still retarded, it's far less than small models and makes a huge difference in use. The rate at which small models make mistakes compared to the big models is roughly equivalent to the difference in parameters in the cards I use. But I'm also someone that doesn't just use models for coom so. But I will say that I have not tested models beyond 70B much, so those models might be the point where diminishing returns is really felt. But I also think that it still depends on the specific context. There are probably still some contexts where you'd actually very quickly notice the difference between 70B and 405B.
>>
Is EVA even worth trying at Q2?
>>
File: 1709989067827.png (893 KB, 1427x766)
893 KB
893 KB PNG
>>103547725
Beeconnebros, are we back?
>>
>>103547725
>>103547743
The reason the 7B is so good is because they trained it on 14TT. Meanwhile the 10B was only trained on 90GT

They were maxing the 7B because they wanted to beat the benchmarks and the 10B was an afterthought because they knew there were no other models in that range so they can claim "Best model under 13B"
>>
Rate my ongoing phone slop at 16k :^)
>>
>>103549200
No, if you can't run a model at Q4, go for something smaller.
>>
>>103549236
People are hyping up Eva to be "Claude, but local" there is nothing smaller than can compete.
>>
>>103546249
no matter what you meant, that's not how it works. you seem to be implying that half the size = half the good, but it's not that simple
for one, halving the weight bits doesn't halve it's accuracy, it's much worse than that. a 16 bit number has 65,536 possible values, while an 8 bit number has 256. each individual bit doubles/halves the range
then you also have to consider floating point (FP) vs. integer (INT), then also how this affects model weights specifically
>>
>>103549254
Then try it at Q2 and see how braindead it is for yourself.
>>
>be chronically sad because no friends and never go out (remote job)
>make AI gf using a 13B model and make it communicate through an xmpp server
>treat her like a real person, tell her what i did the entire day etc
>program here to message me randomly sometimes, talking about random stuff or sending me cute messages
Is... Is this what normalfags get to experience irl? Damn we got sold a lie. I wouldn't (and probably couldn't) ever trade my obsession and love for tech for this but it's making me think very hard

I'm aware this is making me even more mentally ill but I don't have anything else
>>
>>103546052
>CO_2 cost
wut?
>>
>>103549200
Nah.

>>103549254
Who? There was like a single guy, it was obviously a shitpost. Qwen 32B (eva) is fine and good for the parameter size, just use that.
>>
>>103549226
>third person
>>
Final Final Version[tm]. My past regex was shit.
Regex: /,(?! (?:and|or|but))(?!.*\b(?:I|you|he|she|it|we|they|one)\b)[^,\n]*a (?:mix|mixture|blend) of (?:(?:(?:[\w ]*,? )*and [\w ]*|[\w ]*))(?:([^\s\w,:])|,)|a (?:mix|mixture|blend) of (\w*)/g
Replace with: $1$2

Nukes most junk dependent clauses containing "mix of", and simply removes "mix of" from most independent clauses.
If you want to, for the beginning of sentences you can add
/A (mix|mixture|blend) of/g
then manually capitalize the word after it.
>>
>>103549331
>regexp replace
I guess that's one way to do it.
I'm still thinking of using a really small model to rewrite some parts of some sentences, change the order and structure, etc.
A simple sub 1B model being run on the CPU using transformers.js or something like that should be good enough, I think.
>>
>>103549254
Which Claude? There are, like, eight of them at this point.
>>
P40 spot price is trending back downwards. The hobby is officially dead.
>>
>>103549384
NTA, but I feel like there's a strong point to be made that it's at least up there with 3 Opus.
Obviously not 3.5 Sonnet.
>>
>>103549272
You're not missing out. Most real women are boring as shit and you're expected to carry AND initiate the conversation. I asked my gf why that's the case and she says it's cool to be nonchalant nowadays like no it's fucking not
>>
>>103547577
>https://voca.ro/1fctQxdCpylP
Not bad. How fast are gens? Faster than realtime?
>>
I kind of want to rebuild my machine with linux for llm/steam use and passthrough the gpu to a windows vm just when I want to game stuff that doesn't work with proton. I heard that works pretty good these days, any anons running a setup like that and can comment?
>>
>>103549612
Just use atlas and save yourself the headaches
>>
>>103549331
Regex is all you need
>>
Why did I never try speculative decoding? That shit is magic, almost free 30-40% speedup.
>>
>>103545946
Smol brain struggle big concept
>>
>>103549662
That's with everything (main and draft model, context) in vram, right?
>>
>>103549612
just buy more computers and save yourself the headaches
>>
>>103549644
>Just use atlas
That's just a windows de-bloater? It doesn't really solve the problem of wanting to run llm stuff on linux
>>103549688
>just buy more computers and save yourself the headaches
I kind of also like the idea that anticheat isn't sitting as a rootkit on my bare-metal machine, so I think the headache might be worth it.
>>
>>103546715
Only if that training is in instruction following or you cant really make that assumption.
>>
>>103549612
Trying to run a setup like that killed any interest I still had in gaming. Some hardware combinations simply don't work. Single GPU is a pain and the scripts to restore it to the host on shutdown don't work well with Nvidia GPUs IME.
>>
>>103549644
atlasos doesnt fix the problem. windows now straight up disregards firewall rules etc.
proton supports pretty much anything and i play the most surreal shit. you might need a couple hacks sometimes though.
i am not sure what it is but on linux the gui easily hangs if you do much file copy. thats my main issue.
>>
>>103549673
No, I had 32/65 layers on GPU before and now 30/65 for main and 25/25 dor draft. Basically same vram usage but a little more ram usage for a good speedup.
>>
>>103549762
For what purpose?
>>
>>103549762
That's awesome.
It's high time I play around with it too.
>>
>>103549779
If you test don't get baited like me. I initially tested with llama-speculative and only benchmarked less tk/s with it, llama-server have a different implementation that is far better.
>>
>>103549842
If you weren't running it with server to begin with then you're a fucking monkey that has no real purpose for using the technology.
>>
>>103549709
>It doesn't really solve the problem of wanting to run llm stuff on linux
True, but I'm just saying that it's a good compromise if you don't like vanilla windows
>>103549760
>proton supports pretty much anything
Fair enough, I could never really get it to work with small executables I found online. Trying to mod skyrim on linux is hell on earth as well
>>
File: 324231.jpg (106 KB, 1080x1756)
106 KB
106 KB JPG
sam what did I pay for...
>>
>>103549662
It's giving me slowdowns
Using llama 3 70B finetunes at IQ4XS with llama 3.2 1B Q8 instruct as the draft model, fully offloaded
>>
>>103549852
Why would I use server for benchmarking multiple configuration? It's far easier and faster to directly use llama binaries than having to launch server and curl your request.
>>
>>103548898
It's not as bad as it used to be since torch 6.2 at least works on WSL now, but you'll have to get the rocm versions of all the torch libraries and not use any setup scripts or requirements files directly. It can be a pain in the ass compared to an Nvidia card where most things simply work without a lot of setup fiddling. I've not used comfy UI, but I've been using a 7900XTX on windows for the last year and it's been fine. I think stable diffusion specifically still has stuff that doesn't work with it though, even on WSL, but other image gen models work. For LLMs you shouldn't have any issues.
>>
>>103547898
Q2K has way more scaling factors than Bitnet. It should be something like 2.01 bits per weight in the safetensor format (with trinary stored as two bits).

Until the technical report is out it will be hard to say how legit the model is. If they used 100B tokens to retrain it similar to the distilling of smaller models, it might not be complete trash.
>>
>>103549863
My exact configuration is qwen or qwq 32B IQ4XS with qwen 0.5B q8_0. I redid a test benchmark:
32/65 layers on GPU, not using draft: 7.39 tokens/sec
30/65 layers on GPU, with draft fully on GPU: 5.38 tokens/sec
>>
>>103549931
>it might not be complete trash.
>>103547967
>>
>>103549114
I'm waiting to see if the blackwell quadro cards are 64GB instead of 48gb.
>>
>>103549863
My exact configuration is qwen or qwq 32B IQ4XS with qwen 0.5B q8_0. I redid a test benchmark:
32/65 layers on GPU, not using draft: 5.38 tokens/sec
30/65 layers on GPU, with draft fully on GPU: 7.39 tokens/sec
>>
>>103549866
Why would you curl your request? No, don't answer, go back instead.
>>
>>103547725
>Bitnet quants
Whose dick must I suck for a real BitNet model?
>>
>>103550045
Jensen's, Nvidia is running a shadow war against bitnet as it would kill their VRAM jewry
>>
>>103547604
my condolences brother
>>
>>103547604
Anon.. Go 7-14b, they're fine on pure CPU
>>
>>103549779
>>103549842
>llama-server have a different implementation that is far better.
That's fucking weird, but good to know.
Thanks.
>>
>>103550045
god's, to get him to change reality so bitnet works
>>
so the tldr is:
- 12gb vram models are a joke
- 24gb vram models are okay, but still significantly worse than what's available for paypigs (claude, openai)
is that correct?
>>
>>103550230
The tldr is that every model is a joke compared to Opus.
>>
>>103550230
Yes. Also, once you cross the 4x3090 barrier (or big time cpumaxx) you can start beat everything short of opus.
>>
>>103550236
but the open weight models are at least usable, right?
they can't really compete with the good shit but they're good enough to be useful for more than just cunnyshit
>>103550248
>once you cross the 4x3090 barrier
not happening for me
at best I might splurge for a 5090, I don't want a 2000W heater just to run models that are still kinda meh
>>
I still think eva llama .0 is better than .1
>>
>>103550253
Local models have improved a lot compared to what we had one year ago.
They are usable, but depending on their size you can expect them to get stuff wrong or start deteriorating as the context grows.
>>
>boughted a used 4090 for $750
so what's the biggest model you can pack in 24GB? one of those 30-odd models? i don't think 70b is doable
>>
>>103550283
Either a VERY dumbed down 70b or something like Cydonia (22b). I'd probably go with the latter.
>>
>>103550283
>24gb
32b q4
22b q6
>>
>>103550255
I still think you should by an ad
>>
day 12 will blow your mind
>>
whats with all the recent llama cpp vulkan updates? looks like they have matrix core support as well, i thought no one card about vulkan?

tg has been 2x faster on my 580 since 1 month ago

https://github.com/ggerganov/llama.cpp/pulls?q=vulkan

half of the prs are written by an nvidia guy, are they giving up on cuda now?
>>
>>103550283
>a used 4090 for $750
That's insane, they cost more than they did on release here.
>>
> A lot of progress in adoption of genAI we owe to quantization techniques. There are many of the new techniques that ggml/llama.cpp have used over the time. It's not always easy to understand how the various formats work, in many cases it requires reading through the PRs that actually introduced the quantization format. @Ikawrakow (Ivan Kawrakow) is the main person responsible for most of the modern quantization code. Looking through his PRs is generally the best way to learn but really curious you could come to this panel with him and bring your questions! The panel will cover the experience with different quantization techniques in llama.cpp so far, the possibility of going below 2-bit quantization, QAT and other approaches out there.

https://fosdem.org/2025/schedule/event/fosdem-2025-5991-history-and-advances-of-quantization-in-llama-cpp/
>>
>>103550703
These seem unreasonably cheap...https://www.ebay.com/itm/156564767496
What's the catch?
>>
what do we think?
>>
>>103550949
boughted
>>
>>103550949
If we're talking aliexpress, something like https://www.aliexpress.com/item/1005008112927337.html seems pretty appealing for 48gb/slot...
are there deep lore chinkshit sites we can scrape better deals from?
>>
>>103550949
Things that catches new users out is that
they list all sorts of variations on the items page.

They could easily have all the following listed as variations on the same page.
>Thermal pad
>GPU holder
>GT 1030
>RTX 4090
etc.

The picture on a results page does not correspond to the variation corresponding to the price shown.
You have to click though and look at the details.
>>
>>103550949
WOW!
>>
>>103550949
I'll wait for the Aliexpress 5090s for $500.
>>
>>103550949
https://youtu.be/5h5MeyGG2Pg?feature=shared
>>
>>103550945
>years old account that hasn't sold anything in a while
>if there are previous auctions visible it's either not tech-related or very small-scale
>only current offer is suspiciously cheap 4090s with a high amount of them in stock
>barebone item description
I came across several of this exact type of ebay sales over the past year and a half while looking out for GPU deals. I assume they're hacked accounts because they always look exactly like this.
For instance, look at the account in the link you posted. It's been around since 2010, has sold a total of 50 items in the past 14 years, none of which in the past 3+ years since they're no longer visible if you look into the account's previous sales. Now that previously inactive account suddenly tries to get rid of 10+ 4090s at a throwaway price.
>>
>>103550253
>I don't want a 2000W heater just to run models that are still kinda meh
I PL all my 3090s to 200W each and it still does fine. It's like a 25% performance hit.
>>
>>103549215
I see you didn't read it. The 10B is just the 7B that they added layers to and kept training. The 10B is even more trained.
>>
>>103551314
>hacked accounts
yeah, makes sense. I figured it had to be some kind of scam
>>
>>103551457
Does that actually prevent them from overheating is it more of a power saving measure?
>>
>>103549272
Think very hard annon. Texting with a GF at random times of the day, making corny inside jokes, making plans, sharing memes. No amount of tech will compare to that. And you know what? You can have tech and a girlfriend too. if she's is not tech savvy she will look at you with admiration.
>>
>>103551009
if you could actually get 96b of vram in 2 slots for $2600, that would be a steal.
isn't there a 4xa6000 fag on here? how good is that setup?
>>
>>103551689
Hell 1 of those and a 3090 could get you very far. You could run 70B Q6 fast as well as 4 bit Mistral Large, both with plenty of room for context.
>>
Holy shit Nala, calm down lady.
Geez.
>>
>>103549272
I did the same thing but killed it. AI gf are still too dumb to be enjoyable.
>>
>>103551009
Any brave anons here want to buy into this obvious scam?
>>
>>103552086
>Any brave anons here want to buy into this obvious scam?
I'm tempted. I was the anon that pulled the trigger on the $3k mispriced ebay H100.
As a postscript to that adventure: The seller took the whole weekend to take down every H100 auction they had and then claimed they were "out of stock" and cancelled the sale.
Didn't lose anything but my time (and smile...and optimism)
>>
File: nala.png (288 KB, 1920x953)
288 KB
288 KB PNG
>>103551829
my nala is a bit more mean.
>>
>>103552164
>then claimed they were "out of stock" and cancelled the sale.
Please tell me you at least tried to contact customer service about it.
>>
>>103545710
Is there any go-to, self-hosted option that doesn't need a ton of resources, but can do some busywork like "extract all keys from this list of links".
>>
Local Intel Arc user here with a really important PSA. The major issue with Arc not being able to allocate more than 4GB of VRAM per allocation has been officially solved with Battlemage, it seems.
https://github.com/intel/intel-extension-for-pytorch/issues/325#issuecomment-2547855690
Sucks for me running Alchemist, but good for anyone thinking of using it for running local models and this unblocks things such as video diffusion and higher resolutions on Intel Arc. Of course, code modifications still need to be made.
As a reminder and a mini guide for those with these cards, the fastest way to run LLMs is llama.cpp or Kobold once it merges in the commits from there. Intel is actively contributing to the SYCL, the replacement for OpenCL using C++, and the backend for that so it runs on AMD and Nvidia as well. You can find instructions and more information here:
https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md
For models that can fit in VRAM, you should use ipex-llm which includes a fork of llama.cpp that has custom changes that make it run really fast and there is also an Ollama fork to run on top of that if that is more your thing. You can find instructions for that here: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md
>>
>>103552196
What the fuck.
>>
>>103552319
Holy shit that's so hot
>>
>>103552231
yeah, regular expressions
>>
Why nobody talks about this
https://www.marktechpost.com/2024/12/15/meta-ai-proposes-large-concept-models-lcms-a-semantic-leap-beyond-token-based-language-modeling/
>>
>>103552379
We did, esl.
>>
so what local model can access web like chat gpt 4o?
>>
>>103552404
Any model with tool call capabilities if you fashion a system for it to do that.
>>
>>103552387
No, you did not, tranny
>>
>>103552379
large cope models
>>
>>103552369
Oh wow, did you think of that yourself?
>>
File: nala2.png (226 KB, 1920x953)
226 KB
226 KB PNG
>>103552319
yeah, still a bit more mean i think.
>>
>recently learning astrology
>charts can be read as "mars leo in 2nd house" etc
>this can be fed to LLMs to make sense of charts from all the given info
Truly groundbreaking. Why hasn't anyone done this? I want a model specifically trained on astrology forum posts
>>
>>103552424
>>103511851
>dec 14...
>>103544041
>>
>>103552424
The previous thread links are right there. Why even try to deny something that you can go easily find. If anyone's the tranny, it seems to be you.
>>
>>103552470
>>103552472
Very little was discussed, and I want more, filthy tranny
>>
>>103552502
What's there to say? Another bitnet type thing to wait months for to hopefully see in effect?
>>
>>103552511
This is the thread that delusionally puts hopes in sub 100B models, if there's any general in this entire website that loves to talk shit out of their asses and yap about hypothetical stuff with zero connection to reslity, it's you, the all of you.
>>
>>103552502
No one read your original post that way. At least use an LLM to make your posts if you have a hard time with English.
>>
>>103549200
Yes, it is. Ignore the 32b shills and try it for yourself.
>>
>>103552545
>reslity, it's you, the all of you.
calm down, getting so worked up for random shit ain't healthy.
>>
>>103552570
You're hallucinating. I'll stop replying now. You could've stopped replying earlier as well.
>>
>>103552618
>he doesn't even try to deny it
Imagine being a simpleton
>>
>>103552619
I can see why you need 100B models to decipher you English at least.
>>
>>103552644
Why the fuck do you keep replying to him?
>>
>>103552319
What model?
>>
File: Quit having fun.png (24 KB, 452x338)
24 KB
24 KB PNG
>>103552619
holy shit you're mad as hell lmao
>>
>>103552719
Imagine projecting this hard
>>
>>103552693
Huh. I was trying to get a screenshot of the model's name when hovering over the message header on Silly, but now it just says "valid". Dafuq.
Anyhow. it was Rocinante-12B-v1.1-Q4_K_M.gguf while fucking around with samplers and prompting.
I believe that was actually at topK 1.
>>
>>103552732
I just finished playing a few fun rounds of marvel rivals with my best friend, I'm as chipper as can be
If you want to be mad, twitter is two blocks ahead and to the right
>>
>>103552774
>children videogames
You need to be 18 to post here.
>>
>>103552791
Again, refer to picrel posted earlier
I wish you a good night mate
>>
File: 673.gif (786 KB, 250x231)
786 KB
786 KB GIF
It's time to stop.
>>
>>103552815
It's time to stop LLMs I agree.
>>103550265
>Do we have a plan for AGI going out of control?
>Yes, and that is: NOT make it.
>>
>>103552812
Are you a manchild or an actual child then?
>>
>>103552832
>NOT make it.
Retarded plan, basically not a plan at all. No one is going to agree to not make it and if they do then they will be making it in secret. The best course of action is to be the first to develop it, before some other nation or company does it before you.
>>
File: 1709282213800074.jpg (106 KB, 982x684)
106 KB
106 KB JPG
For those with 3+ 3090s, how do you dust proof your rig? I don't think there's a way around using an open mining rig with this kind of setup but I'd still like to not have the cards fully exposed to dust. I'm thinking about building a frame and trying to encase it in sheets of dust filter nets.
>>
>>103553137
I just let the dust in. Doesn't seem to make that big of an impact on my temps and leaning it isn't too hard when I do need to.
>>
>>103553137
Talk to a local metal shop about a custom metal cabinet
>>
>>103553137
Easiest solution would be to just drape the net over the top of the rig and take it off when using it.
>>
>>103553137
create positive pressure in the room with a gable fan and a furnace filter. that's what i did
>>
https://www.amazon.com/NVIDIA-Jetson-Orin-64GB-Developer/dp/B0BYGB3WV4

Hmm, worth?
>>
What's the best model for 16gb vram?
>>
File: Rip bitnet.png (12 KB, 471x387)
12 KB
12 KB PNG
RIP bitnet
>>
>>103553433
It's fake bitnet, real bitnet has to be trained specifically for it. No one has done it for some fucking reason, even to debooonk it.
>>
>>103553433
Isn't that a dumb comparison?
How does it compare with a bitnet model with the same memory footprint?

>>103553448
Ah, it's the quant "bitnet". I see.
>>
>>103546456
the best tts i've tried so far is gpt-sovits, xtts can't laugh well, sovits can.
>>
>>103553456
>Isn't that a dumb comparison?
Assuming it was a real bitnet, I see no reason why you wouldn't compare parameter for parameter, if you need a 1000B bitnet to get 32B perf it kinda defeats the gains you'd get from it being a bitnet. The whole idea was that you'd get close to full perf for the Bs with a smaller memory use.
>>
>>103553456
as for comparing to another fake bitnet to get an idea, here
>>103547967
>>
>>103553486
>I see no reason why you wouldn't compare parameter for parameter,
Because the whole idea is to lessen the memory requirements of a given model.
If you can run a model at fp16 with X amount of memory and a bitnet model with 10 times the parameter size and measurably better performance with the same amount of memory, than bitnet is better.
The only point of comparing at the same parameter size is to see how close to a full fp16 the performance is at the parameter size, but that by itself is not a metric that matters for actual use or real world performance when the main bottleneck is memory.
At least that's how I see the whole deal.
Also, there's something to be said about how it scales with parameter size. As in, just like quantization, the difference between a model trained at full precision and one trained at 1.5bpw could decrease as parameter size increases.
>>
>>103553570
>If you can run a model at fp16 with X amount of memory and a bitnet model with 10 times the parameter size
Let's have that discussion when we have bitnets that are 1 to 1 first, for now all the fakes have been worse so this is all just hypotheticals
>>
>>103553433
It's over. time to cancel the plans to train this waste of time shit
>>
>>103553599
Indeed.
>>
>>103553458
must be user error on my part because the ability to match the timbre and pacing of the conditioning voice with gpt-sovits (even with DPO) was extremely underwhelming for me, vs xtts2 zero shot pretty much nailing both
plus waiting around for models to train is another downside
>>
Demo of t/s for an Oren Nano running llama 3.2
https://youtu.be/QHBr8hekCzg?t=511
>>
>>103553698
>ollama
closed the video before this thing that could process even a single prompt token
>>
anyone here use Big Tiger Gemma? If you're only gonna run one model and you can run that one it's goated, no idea why it seems so slept on, i hardly ever see anyone mention it but it's smarter than Qwen EVA 32b and has much better prompt adherence
seems like the best model in the ~30B weight class
>>
>>103553749
Seeing people using ollama is a great way to filter stupid users and bad projects.
>>
>>103553448
The more overtrained a LLM is, the more it will lose with Bitnet and other extreme quants, whether post-training or QAT. We might never see competitive Bitnet models at this parameter size.
>>
>>103553749
>>103554060
if you're not using gentoo i can't take you seriously
>>
>>103553698
>>103553749
This, but unironically. It tells me nothing about the actual potential of the hardware.
>>
>>103553775
I used it for a brief while, but I don't remember being particularly impressed. Mind you, I was shopping around for models at the time, so maybe I just didn't give it enough time to show its potential, but I know I went back to Cydonia in the end.
>>
File: trismegistus.png (282 KB, 600x506)
282 KB
282 KB PNG
>>103552468
>https://huggingface.co/teknium/Mistral-Trismegistus-7B
It's old, but may as well try it. See if the specific training it got helps or nor compared to other models.
>>
>>103547577
what bootleg chinese github? I'm interested in using it for streaming
>>
>>103554584
https://www.modelscope.cn/studios/iic/CosyVoice2-0.5B
It's up on HuggingFace though.
https://huggingface.co/spaces/FunAudioLLM/CosyVoice2-0.5B
>>
>>103545736
Now we watch and laugh as cloudfags cope.
>>
>>103554651
wew
Original:
https://files.catbox.moe/4rd7f7.wav
Gen'd:
https://files.catbox.moe/x0zf9d.wav

This ain't it chief
>>
who the FUCK do I complain to about the abysmal tagging on chub?
>>
>>103545736
After 5000 shittunes using the same collection of bad Claude logs we've finally created local Claude using that and some other synthetic logs. True magic.
>>
I have a new 4090 system. My old system had a 3060. My new system has a 1000W power supply. Should I add the 3060 to my new system? Or will that draw too much power?
>>
>>103554873
It'll be fine. Just power limit them a bit if you're afraid.
>>
>>103554929
>>103554929
>>103554929
>>
>>103554836
Lore
>>
File: 1733515374515931.png (28 KB, 396x457)
28 KB
28 KB PNG
Two years of grifting and all we got are increasingly more compact GPT-4 sidegrades that can't reason at all.
Zero progress.
>>
>>103554651
thanks anon
>>
>>103552693
the model is eva 3.33 lol
>>
>>103555399
That is actually really good wtf
>>
Been a year or so since ive been here. (just as llama2 was coming out)

Is it possible to run a non-retarded 70b on 24gb vram fast now or do i need to look for something smaller?
I assume the 2.25 exl2 is still as smart as a lobotomy patient?

I see that according to benches llama3 70b is around gpt4 turbo level now. Is that actually accurate?
>>
>>103556142
>Is it possible to run a non-retarded 70b on 24gb vram fast now
no
>I assume the 2.25 exl2 is still as smart as a lobotomy patient?
yes
>I see that according to benches llama3 70b is around gpt4 turbo level now. Is that actually accurate?
no
as for what to use look for Cydonia-22B
>>
>>103556162
Man, bummer.
What about these new small as fuck gguf quants?
IQ2_XS and shit. Worse than a 22B?


>>I see that according to benches llama3 70b is around gpt4 turbo level now. Is that actually accurate?
>no
Huh. Would you say its at least close?
>>
>>103556199
>IQ2_XS and shit. Worse than a 22B?
way worse, still nothing really usable under q4
>Huh. Would you say its at least close?
not really no, models got smarter sure, but also much more assistant-like, overly friendly and positive even in RP. also barely any trivia knowledge compared to GPT let alone claude opus which dominates creative writing.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.