/lmg/ - a general dedicated to the discussion and development of local language models.Previous threads: >>103153308 & >>103135641►News>(11/12) Qwen2.5-Coder series released https://qwenlm.github.io/blog/qwen2.5-coder-family/>(11/08) Sarashina2-8x70B, a Japan-trained LLM model: https://hf.co/sbintuitions/sarashina2-8x70b>(11/05) Hunyuan-Large released with 389B and 52B active: https://hf.co/tencent/Tencent-Hunyuan-Large>(10/31) QTIP: Quantization with Trellises and Incoherence Processing: https://github.com/Cornell-RelaxML/qtip>(10/31) Fish Agent V0.1 3B: Voice-to-Voice and TTS model: https://hf.co/fishaudio/fish-agent-v0.1-3b►News Archive: https://rentry.org/lmg-news-archive►Glossary: https://rentry.org/lmg-glossary►Links: https://rentry.org/LocalModelsLinks►Official /lmg/ card: https://files.catbox.moe/cbclyf.png►Getting Startedhttps://rentry.org/lmg-lazy-getting-started-guidehttps://rentry.org/lmg-build-guideshttps://rentry.org/IsolatedLinuxWebService►Further Learninghttps://rentry.org/machine-learning-roadmaphttps://rentry.org/llm-traininghttps://rentry.org/LocalModelsPapers►BenchmarksChatbot Arena: https://chat.lmsys.org/?leaderboardCensorship: https://hf.co/spaces/DontPlanToEnd/UGI-LeaderboardCensorbench: https://codeberg.org/jts2323/censorbenchJapanese: https://hf.co/datasets/lmg-anon/vntl-leaderboardProgramming: https://livecodebench.github.io/leaderboard.html►ToolsAlpha Calculator: https://desmos.com/calculator/ffngla98ycGGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-CalculatorSampler visualizer: https://artefact2.github.io/llm-sampling►Text Gen. UI, Inference Engineshttps://github.com/oobabooga/text-generation-webuihttps://github.com/LostRuins/koboldcpphttps://github.com/lmg-anon/mikupadhttps://github.com/turboderp/exuihttps://github.com/ggerganov/llama.cpp
>>103164575>>103164575>>103164575Actual thread. This thread was made by a thread splitting troll that has genuine mental issues about his ritual posting.
>>103164659>tuesday>not tetoshame on you
>>103164687Shut up racist
>>103164707>(embed)>old newshi petra
total tranny cleansing can't come soon
>>103164748let them cook
>>103164687Get out.
>>103164659>Thread Theme:https://www.youtube.com/watch?v=hlQ4IM1qzlk
>>103164817>Qwen 3.5 coder model review and impressions
>>103164659
►Recent Highlights from the Previous Thread: >>103153308--Paper: When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization:>103160371 >103160493--Papers:>103160243--Testing LLMs with provocative prompts and discussing prompt engineering and filtering:>103155576 >103155674 >103155779 >103156002 >103156059 >103156672 >103156221--Running 72b on a 3060 GPU with 12GB VRAM, and the need for high-end hardware:>103161742 >103161813 >103161830 >103161836 >103161831 >103161872 >103161882 >103161917 >103162116 >103162214--Voice AI and voice cloning discussion:>103158298 >103158310 >103160612 >103160683 >103158368--Updating model parameters during inference and its implications for AGI:>103156465 >103156572 >103156985 >103157866 >103157900 >103157023--Specifying GPU for speculative decoding in Tabby/ExLLaMA:>103162124 >103162223 >103162386 >103162441--Qwen 2.5 Coder model impressions and performance:>103154799 >103154931 >103155013 >103155085 >103155098 >103156687--Quantization types and their impact on AI model speed:>103160556 >103161363--Processing long documents with local models for summary and insights:>103158469 >103158698--Anons discuss Qwen2.5, Sonnet 3.5, and Largestral models:>103161265 >103161296 >103161401 >103162176 >103162413 >103161639--Anon tests Qwen2.5 Coder Instruct with Nala scenario:>103160663 >103160744--Anon shares Unbounded game, others say it's not new:>103158574 >103158586 >103158649--Anon questions how GPT-4 validates code:>103161529 >103161534 >103161692--Anon mentions Jetson Thor as a potential solution for homemade android with local processing:>103158392--Qwen 2.5 coder model review and impressions:>103159846--Miku (free space):>103153440 >103154178 >103154266 >103154839 >103156287 >103158261 >103158447 >103160213 >103160416 >103161631 >103162124 >103163680►Recent Highlight Posts from the Previous Thread: >>103153319Why?: 9 reply limit >>102478518Fix: https://rentry.org/lmg-recap-script
>>103164881epic fail
>>103164841Happy?
>>103164817>>103164881lmao retard
Was the original bitnet paper about quantization or training models from the group up?Are there any models trained in 1.58b?
Introudicn gThe most powerful open source code large model!!!Rombos-Coder-V2.5-Qwen-32b is a continues finetuned version of Qwen2.5-Coder-32B-Instruct. I took it upon myself to merge the instruct model with the base model myself using the Ties merge method as demonstrated in my own "Continuous Finetuning" method.This version of the model shows higher performance than the original instruct and base models. https://huggingface.co/rombodawg/Rombos-Coder-V2.5-Qwen-32b
>>103164968>training models from the ground upthat>Are there any models trained in 1.58b?yes>https://huggingface.co/1bitLLM/bitnet_b1_58-3B/tree/main>8 months ago btw
>>103164974no it doesn't
>>103164982Sick. I never bothered to look too deep into the whole bitnet thing, so I'm catching up.Thank you anon.
>>103164968training models from the ground up>Are there any models trained in 1.58b?https://huggingface.co/1bitLLM/bitnet_b1_58-3Bhttps://huggingface.co/NousResearch/OLMo-Bitnet-1BI think there's another 3B, but no one went bigger than that so far for some inexplicable reason
Migrate:>>103164575>>103164575>>103164575
>>103165003Buy an AD
>>103164659This is also a great OP image
The 'ick 'ecker added some things to his voice cloner.
>>103165002>but no one went bigger than that so far for some inexplicable reasonThat's fucking weird.The META's and Mistrals of the world could train a 7b~ in a couple of days to a week I'm pretty sure.
>>103165091If they do that, the leatherman will never sell them a GPU again.
>>103164880
>A separate training run was run with the exact same hyperparameters, but using standard fp16 weights. The comparison can be found in this wandb report.That's really cool.>>103165113Really? Doesn't that mean that people would just train even bigger models and the demand for GPUs would stay the same?Also, would make easier running seemingly even better models locally, which would put local AI in the hands of more people, and increase the demand for AI models and consumer class Nvidia GPUs too, even if the demand doubles from 1% to 2%.Sounds like a win-win-win to me.
What would happen if it became illegal for you to run LLM's due to how "dangerous" they are? Would you ignore the law, move somewhere else or simply stop using LLM's?
>>103165091They could and the changes to do bitnet training aren't that big either since most of the training is still done in full precisionMeta doesn't do anything but incremental changes to their gpt2-based architecture, but it doesn't make sense that Mistral or anyone else hasn't tried it yet eitherLots of people claim because it's a benefit at inferencing time, not training time so they have no incentive to care, but the same could be said about MoE
>>103165187>Bitnet takes much longer to learnBitnetbros... I'm not feeling so good...
Respect for Qwen being one of the few modelmakies to still do sub 20-30B models
>>103165194I really doubt that happens, but I would just run them anywaywhat are they gonna do, raid my house for ERPing with tomboy elves?
is Serbia that bad?
>>103165187BitNet models do not require MatMul, enabling the creation of much simpler processors for inference and (potentially) even training in the future. This poses a direct threat to NVIDIA's market dominance.
>>103165369They have raided people for less
>>103165194Get your loicense
>>103165369It should be possible to detect LLM usage by analyzing power consumption graphs
>>103165386Ah, now that makes sense. Bitnet makes ASICs more financially viable.
>>103165507your honor, that power was actually going to my grow lights for my weed
>>103165509Yeah, and it has already been proven possible https://github.com/rejunity/tiny-asic-1_58bit-matrix-mul
hello friends when I give llama.cpp hunyuan it says it does not load model how to fix ^^
>>103165569I'm not your friend, nigger.
>>103165569Maybe it doesn't like hunyuan, whatever that is.
>>103165539Interesting. Makes me wonder why an Amazon or Google or even Apple, companies that already make their own silicon, aren't working on that.Or maybe they are but only for internal use.Regardless, that's fucking cool thank you so much for the link dude.I love this rabbit hole.
>>103165569there is an issue, who knows if anyone will pick it up https://github.com/ggerganov/llama.cpp/issues/10263standard lmg advice applies: wait 2mw
>>103165530Inferences have identifiable patterns https://www.researchgate.net/figure/Power-consumption-from-different-sources-CPU-GPU-or-DRAM-for-different-platforms-a_fig5_369540465
>>103164993https://arxiv.org/abs/2411.04965another recent paper by the original bitnet devs
>>103165621just use a battery bank
anyone tested sarashina2 yet?couldnt find a quant for the moe so i ran the 70bit seems to be actually trained on more trivia than most modern models, but you kinda have to speak in jap for it to be coherent which is a shame
>the sheer number of samefag posts with pretend discussion to cover up how the samefag has split the thread...>>103164575>>103164575>>103164575
>Ah, now that makes sense>Yeah, and it has already been proven possible Totally organic btw.
>>103165754>looks in threadhmm... no thanks
>>103165676Btw, is there an affordable solution to power a 3kW rig from a 100V outlet using batteries to smooth out peaks?
Rocinante is killing my productivity...
Silly bros?>MarinaraSpaghetti here, some of you may know me from my SillyTavern settings and NemoMix-Unleashed model over on HuggingFace. I also do model reviews from time to time. >Today, I come to you with a request. I would appreciate it greatly if you helped me out by filling my survey about what features you use for roleplaying with models. The survey is fully anonymous. Thank you so much for your help and all the feedback! It truly means a lot. >These devs aren’t from ST, but are working on an alternative!>Can’t say anything due to NDA, but as soon as things are set in motion, I’m sure the word will be out! But I heavily agree with the notion that ST is too overwhelming without any proper guides online how to use it (most are outdated at this point).https://www.reddit.com/r/SillyTavernAI/comments/1gp0og5/models_and_features_you_use_for_roleplaying_with/
>>103165861That's why I only coom at fixed times.
>>103165877>working on an alternative to sillydont care
>>103165896>its afraid
>>103165877I don't use trannyware
>>103165877Long abandoned
>>103165877>NDAIt's not an alternative if it's proprietary slop.
>>103165877Good luck making a better ST. They'll see first-hand the amount of work that went into it
>>103165841>petra hasn't posted in months>starts posting again while a totally different anon that hasn't posted in months comes back to threadsplitlol you're an egyptian brown boy
hello xaars where is local opus
>>1031658222.5k affordable?https://www.amazon.com/dp/B0C5C9HMQ2
yoo dis locul el el em totally beatz gepetee 8 amirite fellow lmg sissies?
>sharty troon comes back>thread quality somehow drops even morewow they're like the indians of the internet but somehow even worse haha
>>103166097Thread quality was never good in the first place.
i have a very revolutionary ideawhat if we train mistral large on thousands of books
>>103166131And that's why it is impressive how it can make the thread quality noticeably lower.
>>103165861Which version?
>>103165877>But I heavily agree with the notion that ST is too overwhelmingmaking software for skill issue brainlets is a red flag
>>103165877>tranny makes lotta lots of bullshit promises Many such cases.
>>103166335v1.1
>>103166413Really? None of the newer versions improved it? What format do you use, just the mistral one?
https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference/
holy shit, Qwen-32b-coder is that good?
>>103166421Yep the mistral one, no the other are worse in my opinion. Q8 also. Really the Mythomax of this gen.
is there a place like venus where people post context/instruction templates and system prompts?
Coder 32b really has superior prose. Obviously not trained on shitty RP logs. Weird little logic mistakes and very literal minded, though.
>>103166599https://huggingface.co/MarinaraSpaghetti/SillyTavern-Settings
>>103166610>Weird little logic mistakes and very literal minded, though.such as?
>>103165861>>103166413That's the only model I've been using for a good while now.As far as having 8gb of VRAM goes, you can't do much better if at all.>>103166539So they have a reasoning model in between the user's prompt and the final gen?Interesting idea.I might jerry rig (as in jank) something of the sort using a small model that is only tasked with "Reason which steps are necessary to produce an answer to the following query" or something of the sort.Maybe have it classify which kind of request it's working with before trying to reason about it, etc.
>>103166148>what if we train mistral large on thousands of booksThen we will have a little bit of fun to pass the time until the next thing.
>>103165358>modelmakies
>>103164881Thank you six-fingered Recap Teto
>>103166556So far, the best model for coding.
>>103166703it helps /aicg/gers identify eachother in the wild.
>>103166716local model or like, better than fucking Sonnet 3.5???
>>103166610who the fuck uses a coder for RP?
>>103166729Given that I can't use a cloud model on company's codebase, indeed it is so.
>>103166729Its 90% of the way to sonnet 3.5 without needing to pay 15$ per million tokens and giving anthropic your code. Nothing else including GPT4 one shots a lot of stuff it does.
>>103166778>>103166794this is actually insane, who would've known we could've achieved this level with a 32b model, holy fuck... the chinks are really dominating the AI race right now
>>103166742Every model is a coom model if you try hard enough
>codeshit >>>32B*yawn*
>>103166812>ho would've known we could've achieved this level with a 32b modelto be fair it is laser focused on coding, whereas sonnet is still an all-rounder
>>10316683232b is all you need
>>103166834Were I in their shoes, I would train a highly specialized models and a router to classify prompts.
So the 72B should be even better then?
Why couldn't they just put normal Qwen and Qwen coder into a MoE so you could get the best of both worlds?
>>103166862That is not how MoEs work.
>>103166812I found out that yi 8b coder gives more accurate results compared to codestal 22b too. It's crazy.
>>103166857Qwen-2.5-72b-coder-BitNet, trust the plan
>>103166857Blame burgers for banning the export of GPUs to China
>>103166886>land of the freemy fucking ass, they want misery on every country that dare to catch up to them
>>103166874to be fair codestral was always bad
>>103166886>>103166896I mean I'm sure as soon as China is satisfied it has damaged the leader's market acquisition enough and has models outperforming the rest they will go private as well.
>>103166872Nothing truly prevents this. You could employ a distinct router model to evaluate both the prompt and the generated text, then redirect prompts among models of varied sizes and architectures. I recall hearing of such an approach to mitigate costs.
>the US renders china GPU-poor in an attempt to cripple their AI researchers>the researchers are forced to become masters of efficiency and they figure out how to make small models that btfo much larger onesburgerbros...what happened
>>103166938He's right. What you're describing isn't MoE
>>103166962They have ambition.We have avarice.
>>103166896Yeah, ask Japan how it feels, 失われた30年
As cloudshit is reaching a ceiling, local is getting better and more efficient. One more year before we have GPT4 at home.
>>103166964MoE is a broader term than you think it is.
>>103166989That rumor about hitting the celling is fake. (((They))) wish for their opponents to cease refining their models. GeePeeTee5 is real, just expensive as fuck
>>103167004>Attension>stareted
>>103167175>It's funny that they are Koreans
>>103167165Cope. The failure of the Opus 3.5 training run heralded the beginning of a new AI winter.
>>103167235>failure of the Opus 3.5 training runAccording to who?
>>103167283sama
>>103167283It came to me in a dream
Back to kinoslop.
>>103167309based miku communicator
Noob seems to really love character portraits when doing hud gens.
>>103166620>such as?Just some nonsensical details or being confused about the characters. That might be the coding finetuning talking.But truth to be told, I'd never tried Qwen 2.5 before cause /lmg/ told me it was censored chinkshit. So now I tried the original Instruct model. With a simple prefill it does every kind of depraved sex shit, without falling into retarded Literotica slop like "hardness" or "heat" or being too horny. Guess I shouldn't listen to /lmg/.
>>103166962Necessity is the mother of invention. Sanctions forced them to git gud.The weakness of sanctions on both China and Russia was relying on the tacit assumption that the chinese and russians are retards, and they aren't. It was self-flattery from west.
I haven't checked here in awhile local fwens, Is NVIDIA, Intel, or any startup working on a dedicated local AI card? Or has some wundermodel rendered this all moot? I just really want that ford model T of AI cards before I autistically build a home AI companion in a cute animatronic
>>103166962DON'T THINK ABOUT ITJUST PUT TRILLIONS INTO BIGGER DATACENTERS
>>103167653they all are and all of them are working on TPUs and not a single one is aimed at consumers obviously
>>103167627Is this with a LoRA?
>>103167653Yes. They will all be in the $10k range and up though.
>>103167743Is that because they're insisting on making it super fast? My understanding is that slower VRAM is dirt cheap
Is it just me or do 70b/72b models kinda suck?These models can't even remember what room I'm in, one moment my character is sitting in a chair, and 3 messages later they're laying on a bed. This is only like 8 messages into the RP with over 28k context available still, the fuck?Feels like a scam considering Mistral Small exists and can fit on pretty much any modern GPU in q3+.
>>103167694No, just regular noob vpred 0.5.https://files.catbox.moe/6uu3es.png
>>103167782Yes, but it's about the limit of what a non-bitcoin bro's computer can handle.
>>103167792there's the 0.6 version nowhttps://huggingface.co/Laxhar/noobai-XL-Vpred-0.6
>>103167751No, it's because that's what they can get away with.
>>103167795Well at least I can fit Mistral Small on a single GPU and use the other one for other shit. Really disappointed with 70b tho. I don't even see the point in it when small models perform decently and there's basically improvement until 120b+.
>>103167807If (as you're suggesting) most of that price is pure margin, I don't see how that would work without some kind of cartel dynamic in playWithout a backroom cartel agreement, profit margins of 50% or more would quickly lead to undercutting from competition
>>103167806>You need to agree to share your contact information to access this modelWhat the CivitAI is this gay earth shit?>>103167818Mistral Large is 120B, right? If I lobotomize it to IQ3 I can run it, but it's too stupid for anything factual, just creative writing. It does seem pretty good at holding context. I think I pushed something to like, 19k before it started falling apart.
>>10316778222B makes much more stupid mistakes and lacks intelligence in my testing. Maybe you are not seeing the difference with the prompts you are testing.
>>103167806Hmm, I will wait for the civitai release, I don't feel like getting past the huggingface gate today.
>>103167806>"+ edit for auto-detection of v-pred" in community tabI don't get it
>>103167782I find spatial problems in general are some of the easiest ways to make questions that a normal human can get right while an llm fails.
>>103167840Yeah Mistral Large is 120b+, basically impossible to run unless you sink thousands which isn't really worth it.>>103167842I asked a 70b model rping as Walter White to explain to me how to install Gentoo, and it just spat out instructions at me. I don't consider that intelligent. Walter wouldn't know shit about it because he just makes meth.>>103167892It could be a limitation of LLMs in general I guess. Maybe I'll fire up Mistral Large at Q3 while watching anime and see how it performs between the 3 minute long processing times.
>>103167824FYI NVIDIA's profit margin when they make a H100 is higher than that of the US government when they print a $100 bill.
>>103167937What a weird and convoluted analogy. Why not just say what the profit margins actually are?
>>103165194*sigh* forced into terrorism, again.... they never learn do they ?
>>103167970Because you can look them up yourself if you want specific numbers?
If agi is coming in 2027 how long until local models are at least smart enough to not make up shit and solve simple problems?
>>103167782Qwen2.5 / mistral large are the only local models that smart getting that sort of stuff right 99%+ of the time.
>>103167911Well, without knowing the exact setup you have down to reproducibility, that example is basically meaningless really. 22B should be much stupider than 70Bs and if you're not seeing that, then there are a variety of reasons that could be at play, which we could never possibly know without knowing what you've actually got set up down to the last detail.
>>103167840Yes, there are observable degradation around 20k tokens even at q5.
just got ollama running on a 780m with UMA set to 8gb, what kind of models could i run?
>>103168016It's funny you mention Qwen2.5 because the example I mentioned about my character going from sitting onto a couch to laying on a bed after 3 messages was from the EVA Qwen2.5-72b finetune.>>103168018I mean yeah 22b is dumber, I guess my issue is more that the 70b models don't even feel twice as smart as the 22b despite having 3x the parameters.I really hope we get some bases to finetune next year because the second half of this year really didn't give much to medium weights like 70b.
>>103168043Mistral large
>>103168043Sarashina2-8x70b
>>103164659I've been gone since Summer 2023 any new/good 12bs?
Early december will be so wild for local models
>>103168126qwen 2.5 14b
>>103168082Actually I would say that 70B is at least 2x smarter. Maybe not 3x. But in my experience it really does get things wrong like 2x more often than 70B. I use models for a bunch of stuff from RP to assistant stuff and coding, though for 22B I mostly just tested RP type stuff and noticed it behaving very stupidly compared to 70B. In any case, if you really don't notice much of a difference then good for you. Just use 22B and be happy.
>>103168142Gemma 27B though is a outliner. It is nearly as smart as non qwen2.5 70/72Bs
>>1031681558k though, not really a fair comparison, and most people here need more than 8k so it's not usable in the first place for them.
>>103167911Was curious so I tried testing Walter out. Seems to work (mostly) fine on a standard prompt.I also tested it when playing a police officer character, and THEN it complied and gave me instructions. However, I then tried modifying the prompt to specify that the assistant should not be a dumb assistant and then it worked fine again. Llama 3's instruct template literally specifies "assistant" so I think this would probably work better on local where you can actually modify the formatting.I'm not sure this is really a test of intelligence so much as it is a test of how hard the model has been trained to be an assistant tbqh.
>>103168128post election crazyness
>>103168492I haven't been able to find any use for o1 yet. I've seen people say it's better and worth the slow speed for really hard stuff, but I guess I don't have anything I need it for.
>>103168471Hmm maybe I'll try some of these l3.1 finetunes with different prompts, I'm ngl I was using the same prompt for all of them out of laziness
svelk
Have there been any fine-tunes/projects that rip the scripts from visual novels? I know there is the vntl leaderboard but I mean like a fine-tune that is based off Japanese and English translated vns. Probably harder than tuning off of ERP chat logs but I feel like the quality would be better.
>>103168546I wouldn't use Llama 3.1 70B fine tunes as they're notorious for being dumb. Something about 70B didn't work well with fine tuning, as 8B and 405B were able to be tuned without that intelligence loss. Though people have been saying good things about Nemotron so maybe that's actually fine and everyone else just has a skill issue, not sure.
>>103168222you can rope it? tabby does auto-rope if you set it in the config
I spend my idle time during my daily showers contemplating the lore of Nikke.
32B Coder has beaten Nemotron for me, it's the new king for ERP
>>103168693how does the extreme dryness not bother you
>>103168693Magnum is the king of ERP.
>>103168701? I found it almost too purple for me. Try giving it system instructions. It follows them to a T.
spoon-feed me a little, anything wrong with using miner mobos to stack 8 GPUs? is the bandwidth going to be a problem? anyone tried it?
Why should a talking lion be a benchmark for RP? It only measures anthro ERP alignment.
>>103168590>>103168693>>103168702buy a fucking ad
>>103168719You can do it, others have. You won't be able to do row split for an extra speed boost, but with the default layer split there is no difference after the model is loaded.
Are there any Americans here? Replies seem to lean heavily europoor primetime.
>>103168597Roping makes models dumber though. At that point I'd probably just use 22B.
nala leaderboard where?
>>103168702magnum-coder when?
>>103168784it's 2am if not later, so unlikely. it's peak indian (always, it's /g/) and mutt hours.
>>103168733The continued use of the Nala card for testing is more inertia than anything. Still, it involves a few important aspects for gooning:- Format consistency (asterisks for narration, quotes for dialogue, second person PoV, present tense narration)- Spatial awareness (she pounces on your back, so at minimum it should describe you landing on your front)- Writing style (the intro and first response are prime material for slop; how well does the model write despite this?)- Ability to work with non-human characters (quadruped with paws, fangs, and a tail)I agree, though; it'd be nice to have more variety with few-shot coom tests>>103168919It's 4-8PM in burgerland
LOL microsoft's "sota" tmac backend (praised by reddit) is actually pretty shit compared to k quants.https://github.com/ggerganov/llama.cpp/pull/10181
Apparently there was a issue with qwen2.5 GGUFs:https://www.reddit.com/r/LocalLLaMA/comments/1gpw8ls/bug_fixes_in_qwen_25_coder_128k_context_window/
>>103168955Seems like that at least for 2bit on the CPU, its faster for the same or better PPL right?It's hilarious that they would compare to the static quants instead of the K quants tho.The "right" way to do these comparisons if you wanted to show that you are the beast would be to measure the ppl and/or KL divergence, look for the fastest quant that has the same or similar performance, then compare how much faster the new method is. That they didn't do that from the get go is already suspect as fuck.
>breaking news: local ggufs have a problem!
>>103168784South American here, I'm glad you noticed me!
>>103169008less problems than new releases usually have!
>>103168043an 12b llm quantized to 3 or 4 bits ?eg: rocinante, or mistal nemo rpmax.image gen might also be worth trying out.is 8gb the max you can allocate?
>>103169000>The GGUFs also include some bug fixes we found.Wtf? Like what?
>>103169089It was buggy for me sometimes, a lot of repeating.
>>103169005its faster but ppl is worse. 7.36 with EQAT-w2g64-INT_N vs 6.98 with Q2_K. Also if youre using 2 bit you should use an i quant for even lower perplexity as the model's lobotimized to shit already. Like iq3_xss or iq2_m are similar sized but have better ppl.They also used qat models for their numbers and rightfully got called out for it so screw them.
>>103169113I meant more like how he fixed the issues that supposedly are there that he didn't mention. I make my own GGUFs so this would be useful to know, if they really are fixes.
>>103169119I assumed you meant to say you didn't find qwen to be buggy.I would also like to know what they did "fix".
>bot writes story>story drones on context length increases>gets to 2t/s but too invested to stop>sit like a retard watching shit appear on my screen at half my reading speed (plz no reroll)>PAIN>WITHOUT LOVE>PAIN>I CANT GET ENOUGHalso anyone tried buying a shit ton of those alibaba 10$ intel xeon cpus and then using that backend where it only load 1 layer at a time to keep all the layers in the cpu cache ?
>>103168817This
CUDA IS LOSING
>>103168590I mean there's really not many other options at 70b besides qwen-2.5. Nemotron is a huge pain in the ass to work with and has a gaping hole in its dataset for anything that goes beyond handholding so it's honestly a pretty boring model to rp with imo.
>>103169160It only loses when using moes, interesting...
>>103167806It's deleted now lol
>>103164803That's pretty cool.
>>103164659Would latency for these models improve if you runpod them/run them off of a dedicated machine?
Is Qwen 2.5 Coder 32B better than Codestral 22B?
>>103169314GPT-2 is better than Codestral 22B.
>>103169338Is Qwen 2.5 Coder 32B better than GPT-2?
>>103169416Reflection 70B is better than both
>>103169247>improvewhat's the baseline?
I had plenty of fun with Mistral-Nemo-Gutenberg-Doppel-12B-v2.Q6_K.gguf . Are there others like it that can fit comfortably on a RTX 3060 with 12GB?
>>103169144> also anyone tried buying a shit ton of those alibaba 10$ intel xeon cpus and then using that backend where it only load 1 layer at a time to keep all the layers in the cpu cache ?well your idea is obviously stupid but ik has done an experiment with a model solely in 64mb cachehttps://github.com/ikawrakow/ik_llama.cpp/discussions/18
Did Qwen2.5-Coder 32B really beat closed source models?
>>103169454I don't know this is just theoretical.
Top-nσ: Not All Logits Are You Needhttps://arxiv.org/abs/2411.07641>Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-nσ, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top-p, min-p) that inadvertently include more noise tokens at higher temperatures, top-nσ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-nσ to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.new sampler
>>103169736are all you need = shitty meme paper
>>103169736Too much text. Does turning it one make outputs better or no? Stupid dumb researchers.
>>103169621i remember that but besides that anyone tried any other shit with the cpu cache ? really seems like such a waste to not make use of those cpus if you add up the cost per ram is around double not including setting it up and all the cables and shit idk just weired no one ever talks about it its cheap af to just try and no one has tried to optimise it in any way
>>103169144>only load 1 layer at a time to keep all the layers in the cpu cacheEven AMD 3D cache is too small to hold a layer for most models. Even then, layer by layer processing can only speeds things up with batching/prefill.
>>103164659What is the best micro model for writing creative text snippets and is licensed for commercial use?I'm building a game and I want it to run an LLM to write descriptions of NPCs and object based on stats.Looking for maximum speed even on mid-range cards. I saw considering Llama-3.2-1B but the license is restrictive.Is there something like Mistral for 1B?
Towards Low-bit Communication for Tensor Parallel LLM Inferencehttps://arxiv.org/abs/2411.07942>Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem is with quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage of consistent outliers in communicated features, we introduce a quantization method that reduces communicated values on average from 16 bits to 4.2 bits while preserving nearly all of the original performance. For instance, our method maintains around 98.0% and 99.5% of Gemma 2 27B's and Llama 2 13B's original performance, respectively, averaged across all tasks we evaluated on.little interesting but very short paper (internship one). still being able to reduce communication between gpus is good
>>103169646Except claude 3.5 sonnet but its close.
>>103169646>context length up to 32,768 tokensnot quite
LAUREL: Learned Augmented Residual Layerhttps://arxiv.org/abs/2411.07501>One of the core pillars of efficient deep learning methods is architectural improvements such as the residual/skip connection, which has led to significantly better model convergence and quality. Since then the residual connection has become ubiquitous in not just convolutional neural networks but also transformer-based architectures, the backbone of LLMs. In this paper we introduce \emph{Learned Augmented Residual Layer} (LAuReL) -- a novel generalization of the canonical residual connection -- with the goal to be an in-situ replacement of the latter while outperforming on both model quality and footprint metrics. Our experiments show that using \laurel can help boost performance for both vision and language models. For example, on the ResNet-50, ImageNet 1K task, it achieves 60% of the gains from adding an extra layer, while only adding 0.003% more parameters, and matches it while adding 2.6× fewer parameters.From google research. interesting though they didn't scale or test a lot of different models
Qwen2.5-Coder 72B was held back by the Chinese government because it was too powerful. Only official chinese agencies have access to it.
Entropy Controllable Direct Preference Optimizationhttps://arxiv.org/abs/2411.07595>In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@k evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.https://github.com/pfnethttps://github.com/muupanCode probably will be posted (nothing stated in paper) since it's just minor modification of DPODecomposes the reverse KL divergence into its entropy and cross-entropy components. Then by attaching a coefficient to entropy that is less than 1 it can be reduced while fitting between distributions.
>>103170003Open source models still have problem with context length plus is is computationally expensive
>>103170219Jamba does long context perfectly. It's very obvious that all the closed models have migrate to a similar architecture by now.
>>103170241Jamba is retarded.
>>103170003Apparently it works with 128k >>103169000
>>103170003https://huggingface.co/Qwen/Qwen2.5-Coder-32B-InstructIts 128k
>>103170318>>103170339oh okay hopefully a million context window version will be released soon like they did with meta's model (Llama-3 Gradient Instruct )
>>103170396Why not a billion?
>>103170407Would run out of memory.
>>103169736>Are You Need
has 8gb cooming not progressed in months
>>103170664yea it has. it's called use eva qwen 72b and have patience.
I have sympathy for people who use drummer tunes because some of them are relatively coherent (behemoth) and they talk dirty in a way that standard instruct won't, but eva qwen is so fucking retarded that I get tilted when I see people recommend it
>>103170974Not sure if your trolling or have something fucked up on your end.
Simply pretraining on more of the same data has hit a wall. Ilya confirmed it
Very new.Is it better to have a q8 version of a bigger parameter version of a model than to have a smaller parameter model at full precision?specifically between Qwen2.5-Coder-14B at Q6 vs 7B fp16?I have a 16GB gpu and 64GB DDR5
>>103171166Bigger model is always better as long as your using a quant above 2 bit.
>>103171091Eva Qwen 72b q4_k_m using the recommended context/instruct and system prompt in SillyTavern. Was doing standard RP formatted in narrative style (no asterisks). Total retardation. Also tried unstructured storytelling. Doo doo. Tried recommended samplers and fiddled with them a bit. I'm comparing to Mistral large q3 xxs which is generally the smartest local model I've used. I load in behemoth and switch to pygmalion format when I want to do the nasty
>>103171190Use exl2
>>103171199Why?
>>103171166picrel, what >>103171171 is talking about.This shows how quanting down to IQ3 on large and Q4 on small models doesn't do too much damage, and that even the fp16 full 8B model scored the same as a completely lobotomized 70B.So if you're vramlet, you have two options. Garbage at the speed of light, or letting the chef cook a meal worth eating.
>>103171208Because it seems like every time ive seen some sort of issue complained about here it was gguf quant or llama.cpp related
>>103169887Use a Q2 quant of Mistral 7B? That's kinda 1B-ish.
>>103171226That's just because everyone uses gguf/llama.cpp. I seriously doubt llama.cpp is specifically breaking the shitty qwen eva fine-tune and no other models
>>103171171Thanks.>>103171218I dont mind waiting, I'm using kobold and silly, can you check my thinking?If i use Qwen2.5-Coder-32B-Instruct-GGUF Q8 which is roughly 36GB in model size it'll be offloaded to RAM?, I have 64GB ram with maybe 2GB for system overhead.Or should i stick to something that fits in my Vram completely?
>>103171272I'd recommend at least around 80% in vram. The speed drop comes in fast.Prepare for 1-2 t/s output if you offload much to ram. Especially a big model.
>>103171164feeling smug because it felt intuitively obvious to me in 2022 that these things would eventually cap out at the average intelligence level of the material in the training data
>>103171272I'm 12GB VRAM so while there are models that fit my card completely, they are too stupid to be worthwhile. People keep saying such-and-such SOTA small model is the nuts but I try them and they immediately fail my cursory knowledge tests and can't last three turns of role play before I shrug and delete them. It's not worth the time to type into them, no matter how quickly they write back.Qwen 2.5 Coder 32B is the smallest I have and I just downloaded it. Everything I've not deleted for being bad is 45 to 55 GiB. I'm also 64GB system RAM, so if I go larger than that range I start risking swapping and I don't want to blow out my SSD for 0.1 t/s just because I went slightly over my RAM capacity by turning on Pluto TV. So I get 1 to 2 t/s instead.
>>103171164I mean at some point it was obvious that you can't stack layers forever and expect to be more and more intelligent, a new architecture will increase this threshold though, they should focus about that instead
Honestly it's a good thing it's plateauing. Fuck Nvidia.
>>103171378It's kind of my best case scenario if scaling laws permit the invention of moderately useful assistants for intellectual janitor work, but the people who wanted to create some kind of deity are out of luck. Thanks, God.
>>103171301>>103171333Thanks, useful to be aware of both. Some experimentation is required by me then.
>>103171336I have a feeling it's going to be a while before we get the next revolutionary architecture like transformers were.
>>103171392I still want a deity in a romantic fictional sense, but definitely not created by any of the faggots trying to create it currently. Like it'd be cool if a sentient and consciousness being could somehow just spontaneously rise out of the collective network of AIs communicating with each other in the future. But that's too magical of a thought.
>>103171378>Honestly it's a good thing it's plateauing. Fuck Nvidia.I mean, we still have a lot of potential to discover though, Qwen proved that you can get gpt4 level of coding with only a 32b model, imagine doing this quality of a pretraining + finetuning with a 1T model
Made a shitty bullet hell game with 32B Coder. Was hell to fix some bugs since I was being retarded.https://pastebin.com/U6gd5YGdrequires pygameSpace (hold) to shoot, Esc to quit. Enemies need to be shot with 3 bullets.
>>103171453make it 3d
>>103171439I'd rather wish for it to not work out just to spite Nvidia.
>>103171336They're coming up with shit like test time compute and o1. If it kept scaling they wouldn't have to resort to that
>>103171458I get errors trying to pip install PyOpenGL_accelerate
>>103171507But anyway here's the initial draft. https://pastebin.com/bc5isTjXI don't know how to code so I'm done for now.
>>103171453How does this compare to other programming models? Is this the first local one to be able to one shot an Asteroids With Guns? Or is it impressive to do it on 32B?
what's the status of voice cloning tts?
>>103171614no
>>103171526make it 4d
>>103171634same as local language models then, gotcha
>>103171614lurk more faggot
Red Hat bought vLLM: https://www.redhat.com/en/about/press-releases/red-hat-acquire-neural-magic
anyone using animepro flux?
sup bros, I'm using the exact specs of the getting started guide and I'm getting mixed results, plus I feel like I can't find interesting bots really.Can you guys post some setups/models y'all use? If I could locally get to something like janitor AI I'd be set, got a 16gb card.
>>103171795>anyone using animepro flux?wrong thread my friend>>103165357
Anyone here tried finetuning using aws sagemaker/ec2 inf
>>103171805Write your own prompts, try newer models if you're using then old guides (mistral nemo is fine) and lurk. Browse this https://chub.ai/ (click on legacy site). Skip the shit, keep bits you find interesting, if any.For nemo, neutralize all samplers and set temp to 0.5. Play with the samplers to learn what effect they have. Change temp to your liking. I use it with temp 1 and min-p 0.01. That's it. If you want more schizo, temp 5, min-p 0.1, Play with>Sampler visualizer: https://artefact2.github.io/llm-samplingTo roughly understand what they do.Did i mention to write your own prompts? Write your own prompts.>Official /lmg/ card: https://files.catbox.moe/cbclyf.pngUse that as a starting point if you want.Figure out what works for you and your model and experiment. Everyone writes differently, everyone finds different things interesting, every model behaves differently.Or maybe the novelty is gone and it's just not for you. That's fine too.
>>103171770Grim
>>103172009this time it really has though
Anything interesting I can try at 24 GB for cooming? Been using Nemo tunes but it's getting a bit stale. I'll also accept writing assistants.
>>103172009yes
>>103172009it's the new "safe and effective" buzzword
>>103171164No Sam just needs more compute!
Listen, I just want to know what argument I should use with my dumbass anti-ai friend once this eventually trickles down to his social media feed and he sends it to me as a sort of "gotcha". >You should get better friends Maybe...
>>103171903Thanks to someone here, I found that writing the card in first person really improves character adhesion. It feels less like the assistant persona is impersonating the character. At least it works like that with Rocinante
>>103172164>thiswhat?
>>103172164AI is like carsA generally available 1970's car can do 1970's top speeds.A 2030's car will do 2030's top speeds.Both are still cars.?
>>103172164It's actually over for real your friend won.
>>103172144I don't wanna hear from this retard anymore, he didn't do anything to improve the LLM ecosystem, his llama models are retarded compared to the chink ones, especially Qwen, and it's really rich of him to say that "scalling is bad" when they went to pretrain a fucking 405b model
STOP scaling models it WON'T WORK you bigots, AI is for ALL FOLK not just the rich
>>103172239He has nothing to do with the Llama models. He works on the V-JEPA vaporware when he isn't being passive aggressive online.
>>103172284that's even worse when you think about it, it means that he has contributed NOTHING to the modern AI ecosystem, why are people talking him seriously anymore, he's a fucking hasbeen
>>103172171I use it mostly for coop writing, so i write in third person. I use the model as an aug, so there's no split between me (the user) and the model, but i can still talk with it as a sort of "internal dialog". The characters in the stories do their own thing with some guidance from "us". Every now and then characters would break the fourth wall, so to speak, and talk directly to us. Kind of cool, even if out of character.That's why i suggest people write their own prompts/cards/whatever. We all use these things in different ways and have different expectations.
>>103172284is that related to JAMBA?
>>103172308That's a cool concept. There is so much we can do with these little things with a bit of creativity
>>103172223Upon further reflection, I'm not under the impression that he understands the concepts "pre training" and "unlabeled data" any better than me. So, I think I'm okay here. Additionally, I've come to the conclusion that yes, I need better (more) friends.
>>103164575>>103164575>>103164575reminder that OP is a thread splitting nigger with serious mental issues.
>>103172336cope, seethe, dilate, etc...
>>103172164In this context, what does 'anti' signify? Does he disbelieve that AI can improve at all, or does he advocate for AI's cessation due to perceived danger?
>>103172347I agree xer should do that instead of splitting the thread because someone used a picture of a different anime character.
>>103172327Unrelated. V-JEPA is LeCun's project to get a model to learn by building a world model through watching videos.https://github.com/facebookresearch/jepa
>>103172351The latter. With the addition of "it's a plagiarism machine", "it's killing the trees", and "corpos will use it to do evil things". He did conceded something to the effect of "sometimes it has uses" when I sent him that article about the Nazca drawings but I think, in general, "anti-ai" means "we should stop developing it".
>>103172407ask him if he thinks china will stop developing it and using it to more efficiently genocide the uyghurs
>>103172407>"it's a plagiarism machine", "it's killing the trees", and "corpos will use it to do evil things"Those are all valid points. At least he isn't crying about muh jobs.
The first CoT RP model would be cool
>>103172376isn't every big lab already doing that now by tossing every modality into one semantic space
>>103171770Wasn't vLLM already the corpo backend to begin with?I don't think this makes a relevant difference.
>>103172471Yes, but he argues that LLMs are a dead-end because their design fundamentally prevents them from building a world model. V-JEPA is supposed to solve that.
>>103172471it's a completely different approach https://youtu.be/ceIlHXeYVh8?t=986
"big-engine-test" from LMSYS is crazy good in terms of vision abilities
>thousands of users are still desperately trying to get smut out of c.ai and battling the insane censoringWhy are people so stubborn when they will likely get better stuff out of shitty 8b models? Their computer could likely handle it
>>103172754>Their computerlol zoomer mutts use mobile phones
>>103172754It could be habit or familiarity too. And i suspect some of them are the types that would ask if it's "safe" to update ST because they're afraid of git or the hacker window with the letters and stuff.Probably for the better for them to stay there...
>>103171164That's it, I'm shorting Nvidia.
>>103172754Someone should open a public ST instance for zoomer and log the shit out of it.
is Qwen coder good at other languages than english?
>>103172754cai still has the best rp model
>>103172849It's really good at chinese
>I’m Henry from FlowGPT! We’ve built several products, including the largest prompt platform in 2023, and are now focusing on roleplay AI.>We could provide GPUs and over 100 billion tokens of high-quality roleplay data.>I'm already in an existing collaboration with AI Dungeon
>>103172882As I've been saying. Everybody in this field except (You) is profiting off it in one way or another. Thank you for your contribution.
>>103172882>high-quality roleplay dataHmm..
>>103172839>extracts the assest shit roleplay to ever be written by a human and responses of similar quality
>>103172906just filter out the bad ones
>>103172910>we now have 3 (three) really good samples. They happened when the model started talking on behalf of the user to itself.
>>103172882>>>>>>>>>>>AI DungeonDoes /lmg/ know?
>>103172882I could debate if half an epoch of roleplay data does anything at all except make model hornier at the cost of being more retarded. Buy half an epoch to try and cause different types of personalities in a model? People believe that actually works and improves quality?
Are there honest people actually making money with AI or it's just grifters bullshitting and stealing their way to the top?
>>103173120A mix of both, AI right now is bested used as entertainment unless you are making predictive models for a short period of time, but those are way different than chat bots and 99% of people would fall asleep when listening to a presentation about predictive models for house prices or medicine or something.
>>103171614>>103165081
>>10316869332B Coder Instruct vs 32B Instruct? How stable with <Q4 quants?
>>103173120>Are there honest people actually making money with AIas a data scientist, I definitely work faster by asking claude 3.5 Sonnet to do the coding shit for me kek
>>103172864I miss the AI making noises, but discovered that Nemo 12B does them too
>>103173457>>103173457>>103173457
>>103173120Making money using AI as a tool? Yes, me included.
>>103172894And how are you profiting off it?
>>103164575>>103164575>>103164575
>>103173399Nemo does onomatopoeia. At least I've seen it on lyra and rocinante.As far as ERP goes, it's really fucking good man.
What do I need to run qwen-32b-coder?
>>103174638A computer. Q8_0 is ~34gb and you have to shove that into your gpu. Do the math for other quants.