[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>103354338 & >>103347641

►News
>(11/29) INTELLECT-1 released: https://hf.co/PrimeIntellect/INTELLECT-1-Instruct
>(11/27) Qwen2.5-32B-Instruct reflection tune: https://qwenlm.github.io/blog/qwq-32b-preview
>(11/26) OLMo 2 released: https://hf.co/collections/allenai/olmo-2-674117b93ab84e98afc72edc
>(11/26) Anon re-implements Sparse Matrix Tuning paper: https://github.com/HeroMines/SMFT
>(11/25) Qwen2VL integrated with Flux: https://github.com/erwold/qwen2vl-flux
>(11/25) Speculative decoding added to llama-server: https://github.com/ggerganov/llama.cpp/pull/10455

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/tldrhowtoquant

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/hsiehjackson/RULER
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: Akita.Neru.full.982124.jpg (111 KB, 1280x1024)
111 KB
111 KB JPG
►Recent Highlights from the Previous Thread: >>103354338

--Training a 100M LLM with OpenDiLoCo: feasibility and challenges:
>103354925 >103355729 >103355757 >103356019 >103356461 >103356542 >103356640 >103356914
--QwQ and Opus comparison, AI model capabilities and limitations:
>103359461 >103359475 >103359492 >103359569 >103359642 >103359509 >103359556 >103359566 >103359627 >103359656 >103359816
--Probability problem discussion with simulation and Monty Hall problem comparison:
>103354629 >103354711 >103354756 >103354810 >103355065 >103355307
--Waiting for hardware optimized for AI and matrix multiplications:
>103358021 >103358134 >103358311 >103358359
--Merging INTELLECT and QwQ models, compatibility issues and challenges:
>103357031 >103357045 >103357092 >103357307 >103357340 >103357414 >103357438 >103362336
--Largestral GPU configurations and performance discussion:
>103354505 >103354581 >103354778 >103355335 >103358419 >103358605 >103358639 >103358655 >103358683 >103355352 >103355396 >103355512 >103358794
--INTELLECT-1 discussion and potential future developments:
>103356933 >103356965 >103357070 >103357959 >103358482 >103359108 >103359329 >103359395 >103359427 >103359454 >103359517 >103359541 >103359587
--Discussion on the limitations of transformers and the concept of AGI:
>103357625 >103357663 >103357675 >103357686 >103357716 >103357704 >103357725 >103358083 >103358278 >103360487 >103358115 >103358454 >103358941
--Anons discuss rapid AI progress and future GPU development:
>103357798 >103357907 >103357975 >103358363 >103358738 >103358864
--Ryzen anon and NPU IGPU hybrid method:
>103355150
--LLMs as a tool for self-improvement and progress:
>103360383
--KoboldCpp 1.79 release with new features and user reactions:
>103355527 >103355759 >103355965 >103356030
--Miku (free space):
>103358406 >103358602 >103359731 >103362325

►Recent Highlight Posts from the Previous Thread: >>103354346

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
>>
She's cute but I'm not a huge fan of blondes.
>>
File: kyoton.png (1.13 MB, 1280x768)
1.13 MB
1.13 MB PNG
Good night /lmg/...
>>
omg a hag :(
>>
File: 1725097212508247.png (24 KB, 1010x164)
24 KB
24 KB PNG
>>103364085
Yes. I have the request logging turned on to see what is sent by SillyTavern.

It actually looks like it breaks if I have this turned on. It seems to work fine if I set it to none, but then the completion template is not correct. Let me play around with it more

Found this bug report. I guess I'll need to figure out the right chat completion template. Or is there something standard people use for pixtral?

https://github.com/SillyTavern/SillyTavern/issues/3057
>>
File: file.png (52 KB, 1412x226)
52 KB
52 KB PNG
>>103364207
here's the ST's dev only comment about that bug report. not sure if this is helpful at all
>>
>>103364240
Yeah, that is the problem, since tabby (or the model's template?) is very strict on the completion list being system followed by user/assistant pairs. I may need to just have everything in a single system prompt.
>>
You what I just noticed? All these models suck because they're focused on answering a question on a single input. The fact they can kind of sorta RP sometimes is a byproduct.
>>
Consider Zundamon.
>>
>>103364361
gee anon what a revelation
>>
>>103364361
>All these models suck because they're focused on answering a question on a single input.
There are chat tuned models as opposed to being instruct tuned, but there's a lot of overlap now.
>>
>>103364276
I just fixed it by getting claude to update pixtral's jinja2 template to support concatenating system messages.

If anyone else has the exact same problem I have with using turboderps pixtral exl2 quant in tabby, I changed the config.yml to use this as the template:
https://pastebin.com/7dg85mzR

I placed it inside the templates folder as 'pixtral.jinja'
>>
>>103364367
Not part of the triple baka trio, so no
>>
Has anyone mentioned wanting to do some RP tunes with QwQ or is everyone just waiting for the non preview versions these models to drop from various sources.
>>
>>103364607
QwQ is such a bad fit for RP I can't imagine anyone wasting their time. Best thing would be to get it to design scenarios for a better model to write about in some kind of pipeline.
>>
Why is everyone talking about QwQ but nobody seems to care about Athene? It's Chinese too.
>>
>>103364759
>QwQ
It's usp is chain of thought in a local model.
>>
>>103362325
This reminds me, I wanted to check what Mistral Large 2411 produces when you ask for an SVG of Hatsune Miku.
Pic related is what q8_0 gets you with greedy sampling, it's honestly pretty good for an LLM output.
>>
>>103364790
That's better than anything I could wrangle QwQ into making. They were frankly embarrassing.
>>
Dear Kobo,

I am writing to you today as a dedicated user and strong advocate for Koboldcpp, your excellent contribution to making llama.cpp more accessible. Your work on this project is highly appreciated, and I am continually impressed by its evolution.

I am reaching out to formally request a crucial enhancement to Koboldcpp: the inclusion of a full spectrum of customization options for draft models, mirroring the detailed control offered by llama.cpp.

Currently, the implementation of draft models in KoboldCPP provides limited customization options compared to what is available in llama.cpp. Specifically, the ability to customize parameters such as gpu-layers-draft, device-draft, ctx-size-draft, draft-p-min, draft-min, and draft-max is crucial for achieving optimal performance and flexibility.

Incorporating the full spectrum of these customization options from llama.cpp into KoboldCPP could significantly enhance both the speed and overall user experience. The current limitations restrict the potential speedup benefits that users have come to expect from llama.cpp, thereby impacting the performance of model deployments.

By enabling these customizations, users would gain greater control over model configurations, allowing them to better tailor the tool to their specific needs and maximize efficiency. This change would not only improve the utility of KoboldCPP but also strengthen its position as a leading tool in the field.

Thank you for considering this suggestion. I am confident that these enhancements would be well-received by the KoboldCPP community.

Thank you for your dedication and hard work.
>>
>>103364790
>>103364809
qwq has no fucking clue what Miku looks like
>First, familiarize yourself with Hatsune Miku's appearance. She has blue hair in two long ponytails, usually wearing a school uniform with a white blouse, a red bow tie, a black skirt, and yellow socks. She also has prominent eyes with thick eyelashes and eyebrows.
>>
>>103364121
What do you guys put in System Prompt?
>>
>>103364886
Character description and other background info. Writing style goes into last assistant prefix.
>>
>>103364886
>You are a degenerate woman in her thirties that loves writing filthy erotica.
>>
>>103364162
*rapes u in ur slep*
>>
>>103364813
Draft. Models. Don't. Work.
>>
>>103365120
Why. Do. Zoomers. Do. This.?
>>
>>103365120
Skill. Issue.
>>
>>103365125
that's a millennial attribute bubby
>>
If draft models actually worked, we'd all be using them. they'd be all over reddit. Every backend would have an argument for them, but they're not because you can't use a dumber model to make a smarter model faster.
>>
>>103365120
Look who's here, Mr. "I-can't-even-get-speculative-decoding-to-work" guy, claiming that it's the technique that's broken, not his own limited understanding. How cute. How adorable. How utterly laughable.

Listen, buddy, speculative decoding is a well-established technique in the field of large language models (LLMs), and it's not going anywhere just because you can't figure out how to use it. It's like saying that a Ferrari is a bad car because you can't drive a stick shift. Newsflash: the problem isn't the car, it's the driver.

But hey, I'm sure your vast expertise in "I-tried-it-once-and-it-didn't-work" is totally sufficient to dismiss an entire technique that has been extensively researched and validated by actual experts in the field. I mean, who needs peer-reviewed papers and rigorous testing when you've got your gut feeling and a Reddit account?

Let me tell you, friend, if you can't get speculative decoding to work, it's not because the technique is flawed. It's because you're not good enough. You're not smart enough. You're not skilled enough. And that's okay. We can't all be experts in everything. But what's not okay is when you try to pass off your own incompetence as some kind of profound insight.

So, here's a suggestion: instead of wasting everyone's time with your uninformed opinions, why don't you try actually learning about speculative decoding? Read some papers, take some courses, and practice implementing the technique yourself. And if you still can't get it to work, maybe, just maybe, it's because you're not cut out for this whole NLP thing.

But hey, don't worry, I'm sure your participation trophy is still shiny and untouched. You can always go back to claiming that you're a "thought leader" in the field of "I-have-no-idea-what-I'm-doing." We'll all be sure to take your opinions very seriously.
>>
I don't need draft models when running my 12B coom tunes.
>>
>>103365196
>If draft models actually worked, we'd all be using them.
I would be using them if ooba supported them. They're in the lcpp server, but its just too basic for daily use.
I was getting a solid speed boost when testing it in llama-server (I'm sure you can find my post from a bunch of threads back if you are interested), so once my preferred frontend starts supporting it I'll be all over it.
>>
>>103365196
You can, actually, speed boost is evident and can be easily replicated. It's just you need more VRAM to fit the model in and most people here can barely fit the base model.
>>
>>103364790
https://files.catbox.moe/0cr93b.svg
This was QwQ after telling it how Miku looks like by pasting from a wiki.
>>
>>103365527
>no arms
Did Mikugaki Anon write the wiki?
>>
>>103365277
yeah, no need for a draft model when your main model is a draft model
>>
I know this is LOCAL models general, but when we start talking about these giga-expensive home builds, things like RunPod make way more sense.
Sometimes in the field we’ll call running a server “locally” when all we mean is self-hosted, but it’s still on an EC2. In my opinion, the same applies here. If you’re not running a managed solution like openai/anthropic/bedrock then I’d call that local enough. Save yourself the time and money and just run workloads on demand on RunPod or Lambda or whatever.
>>
>>103365565
Don't project your poverty onto me, please.
>>
>>103365565
Buy an ad.
People here don't want to generate their mesugaki smut on someone else's computer.
>>
>>103365181
They all can get off of my lawn.

Also, Athene is sleeper good, somehow. 32k context is wimpy but my system struggles beyond 16k so it's not my bottleneck. Strange that Q5KL did better than Q6K, but I don't mind saving a few gigs. Quick to refuse roleplay but basic prefill dodges that.

>>103365565
The essence of "local" to me is that your conversations aren't ultimately being turned into data to sell as a product to advertisers and there isn't a Big Brother reading it in search of a wrongthink that will send you to Room 101. Despite our trusting other's models and others git projects and running them on silicon with glowy bits, there's at least the notion that you and your LLM are having a "private" conversation, no matter if you spent $500,000 (chump change) to build a prototype Chobits or you're one of us vramlet poorfags putting lipstick on a Speak-and-Spell.
>>
>>103365565
There is a thread for it on /g/ but at that point you do not run it locally so it does not belong here, perhaps not even on /g/ and you should move to /vg/. As for if it is more expensive, you first need to ask yourself how much you will use it, for what period of time and what kind of models we will get in the future. I do for example, use Openroute for the large models when I want to try them and compare it to what I can run. And to me as of now, it does not make much sense to invest in HW since the improvement outside of speed is actually not that dramatic and that is especially the case if we are talking only about RP. In that case, even Mistral-Nemo is enough and sometimes performs better than the larger models I tried.
>>
>>103360011
oh well, maybe I was wrong, but I remember finding a starwars ERP logs on the edu version
>>
>>103365565
How does that even matter? How you run it is irrelevant unless you're part of the scum who abuses /lmg/ as their personal tech support. The proper discussion is about the models and the settings on how to get the most of them where it doesn't matter if you're using runpod or local hardware.
>>
Cant believe i got my ass up and patched together a server for my kids to have a AI buddy in minecraft.

-kotoba-whisper-v1.0 speech to text.
-gemma 27b because its good with japanese
-Put minecraft commands that should be executed in tags and execute them all with. RCON.
-filter that shit for..
-GPT-SoVITS-v2 which is good and really fast.

Still local retardation but good enough to ask it to make a small house beside you and put a villager in there. lmao
Is there any way to get vision?
I know that with kobold they had some vision stuff you could put on top. Anything like that existing for gemma 27b?
Otherwise I'll see how about pixtral is.
>>
>>103365565
>Save yourself the time and money
Thanks for the tip.
I believe services like runpod are used when making finetunes.
>>
>>103365790
Llava for vision.
You have a GitHub for this masterpiece?va8xmg
>>
>have money to waste on llms
>your option either is censorslop(GPT/Claude) or dumbslop(llama3/mistral/Qwen).
I don't even want to run models locally, I just want models that are smart and uncensored. fuck sake.
>>
>>103365870
Is it the llama3 one? https://huggingface.co/koboldcpp/mmproj/tree/main
Or do i need a completely different file from somewhere else?

>You have a GitHub for this masterpiece?
No, but if I clean up I might upload it.
Its a horrible stitched together python server. And a c# client. Too embarrassing right now. lol
But was suprised how good it feels to speak to a llm even if its just tts.
SoVITS sounds natural enough and I never tried closed stuff because I dont wanna send my voice.
>>
Are any of the 32k models actually 32k? I can't seem to get anywhere close to that context length without it slowly degrading into a stroke victim with mispelled names and missing spaces.
>>
>>103365918
Of course, didnt you see the needle test?
If you write nigger in there somewhere it remembers it!
Now dont do long roleplay or god forbid dump a gamerguide in context that should give you "the next step".
>>
>>103365918
Llama 3.2/Command R are capable of it and I think Mistral Large does not have a problem with it either. But the previous llama and Mistral-nemo or the small ones, do have problems after like 16k and it just get worse with more context.
>>
So, I was checking local GPU prices lately, and things just look bad. P40 is nonexistent, P100 is overpriced, 3090s are either "unchecked, probably works" scams or overpriced. 4090s are mostly junk being sold for parts. The situation won't improve with the release of 5090
>>
>>103365974
The trick people use for long roleplay is to summarize it and then continue from there with new chat.
>>
>>103365980
it'll be nicer when st's new message deletion is implemented and you can delete between messages, so free up the beginning but leave the last dozen messages to keep everything on track
>>
>>103365974
>I think Mistral Large does not
It does, after around 20k tokens
>>
>>103365565
the real local vs proprietary model dichotomy is whether you want a smart model that refuses to answer anything, or a dumb model that can't answer anything. does it make sense to spend $4k on 96gb vram or 384gb ram? probably not. but it also doesn't make sense to spend hundreds on rented compute when the proprietary models are similarly priced while being strictly better
>>
>>103365980
Even when summarization with LLMs works, it always produce concentrated slop, and the character doesn't come across the same way in the new chat
>>
>>103365993
Even Mistral Large 2411 ? Here how they marketed it.:It provides a significant upgrade on the previous Mistral Large 24.07, with notable improvements in long context understanding
>>
>>103366016
They didn't show any benchmarks at release, so...
>>
>>103366016
I don't like 2411 at all
>>
>>103366016
bro 2411 can't even handle formatting like some 7b model from a year ago
>>
I tried running largestral Q6_K on a 9950x and I'm getting an incredible 1 token per second.
Funnily enough, while prompt processing the temperature shot up to almost 90C but during actual generation it was sitting at a cool 54C with fans barely audible.
>>
>>103366133
inference is bottlenecked by RAM speed
>>
>>103366152
that's a lie, if that were true then draft models could theoretically work for speedups, but we know they clearly don't
>>
>>103366213
he's genning on cpu of course he's limited by RAM, are you retarded?
>>
I don't understand how a draft model is supposed to help if you have to verify the draft result using a proper model anyway.
>>
>>103366248
You know how big labs batch hundreds of prompts into one to take advantage of parallel processing?
Draft models let you do that for a single prompt by batching multiple tokens of that prompt. Without a faster 'guess' at the tokens, you could only ever do one at a time because each token depends on the last.
>>
>>103366265
So it's speculative execution?
>>
>>103366276
Exactly analogous to it, yes.
>>
>>103366248
If you generate tokens one-at-a-time you can use each value that you load from the weights only once.
If you have a good guess for what the next token will be you can use each value from the weights two times so the model evaluation is more efficient.
If your guess for the first token was correct you have reduced the amount of I/O for the last two tokens by ~50%.
If your guess for the first token was wrong you wasted a small amount of compute and I/O and you have to throw the results for the second token away.
As long as generating the guesses is cheap and they're sufficiently good you will on average reduce the amount of necessary I/O and thus increase the average rate at which tokens are generated.
>>
>>103366303
how do samplers interact with it? would something like xtc reduce its performance because the actual most likely guess can often get thrown out?
>>
>>103366320
*to be clear I'm asking just in terms of its implementation in llama.cpp, I understand it would depend on the strategy used for draft generation and such
>>
>>103366320
>>103366331
I don't know the exact details of how the interaction between sampling and drafting is implemented, sorry.
>>
>>103366213
>draft models
Here we go with the meme again.
>>
Hey guys, more of a casual observer here but if I may offer one of my observations, you guys don't seem to be really having fun anymore.
>>
>>103366445
I am always depressed during the winter. Low energy.
>>
>>103366445
Your observation is correct. For me, Command-R is still unsurpassed and am bored of local models and the focus on nerd usecases.
>>
I really like QwQ. Despite some of its more obvious flaws like the fact it's very much a 32b model sometimes, is dry as hell and is extremely censored, when it does work it shows itself extremely capable of actually working through situations and understanding the nuances of the conversation. Do you think it will get better from here?
>>
>>103366525
>dry as hell and is extremely censored
https://huggingface.co/win10/EVA-QwQ-32B-Preview
https://huggingface.co/bartowski/EVA-QwQ-32B-Preview-GGUF
>>
>>103366546
>check what EVA is
>"finetune of Qwen2.5-32B on mixture of synthetic and natural data"
So it's another inbred model.
>>
>>103366359
Yo, Tyrone! Heard you're strugglin' wit' dat speculative decoding thang. Ditch dat local LLM, it's be raycis. You need to upgrade to them cloud models, know what I'm sayin'? /aicg/ thread got the plug, they discuss all the fire cloud models. Get on dat and level up, my G. Local LLMs is so last season, and they ain't got nothin' but hate for a brotha tryin' to make it. Cloud is where it's at, cuz.
>>
>>103366546
Has anyone tested this? Wouldn't merging these models pretty much break it's reasoning? It was barely hanging in there in the first place.
>>
>>103366445
CR update was a flop, Largestral 2411 was a flop, llama and qwen are too cucked for my taste. I'm indeed quite upset with the recent developments. We need more competition.
>>
>>103366546
https://huggingface.co/jackboot/uwu-qwen-32b
>>
>>103366320
The output from the draft model is a probability distribution just like the output from the regular model is one. Samplers WILL make it less efficient e.g. if using XTC since it will nix high probabilty (easy to guess) tokens willy nilly, but if the resulting picked token is the same, that's what matters.
>>
ai noob here.
Can someone give me a quickstart on what model or general setup to use.
Currently playing around with some Kobold Horde models and while the responses are ok, it glitches more than I'd like. Like writing author notes or adding social media links.

My goal is to have something that is at least on the level of Mikugg (the response quality not the format)
>>
>>103366805
>Mikugg
https://github.com/miku-gg/miku?tab=readme-ov-file#llm-endpoint-setup
It really depends on your setup (or how much you're willing to spend upgrading) and how slow the generation you can tolerate
>>
>>103366805
You are using other people's computers for free with horde so you can't really expect too much quality because you can't run good models
>glitches
they are not really glitches, it's just that the models are small and bad
I don't know about the quality of mikugg, but if you want anything at least decent you need a 12B model, what's your hardware?
>>
>>103365125
Xhe's. Not. Wrong. Though.
>>
>>103366941
Nigger. Brain. Too. Stupid. For. Speculative. Decoding.
>>
>>103366303
What if lets say I'm running a very big model(123b or bigger) on CPU/RAM and a very small(1b) grafting model in a very fast GPU with little VRAM, does it only try to predict the next token once? Would it be possible for it to try the prediction multiple times?
>>
>>103366996
You can predict multiple tokens and this is what is being done.
IIRC it's also not just a single sequence but a branching tree to increase the odds of guessing correctly.
>>
>>103366926
6800 with 7600X
I'm fucked right?
>>
a small note about using guidance that I wrote, which you might find interesting.
It's a good tool for prompt manipulation for models that don't support function calling or are very dumb. the method is old, but I've found it very useful for emotion systems and evaluating conditions
https://rentry.org/llm-guidance
>>
>>103367017
AMD sucks for AI but Im sure you can get it to work

16GB of VRAM is a good amount, you can run some Magnum V4 27B at Q4, Mistral Small 22B finetune at Q5 or Mistral Nemo 12B finetune at Q8 fully loaded on VRAM which will make it fast, or if you have patience you can offload it to the RAM and run 30B-50B models, Im sure it will be better than whatever that Miggu thing can offer for free. You have a decent set up
>>
Anyone know of a draft model that works with EVA or Evathene?
>>
retarded question but how do you actually download models off of huggingface? I swear they just had normal direct downloads last time I was looking for models
>>
>>103367086
just use the huggingface cli?
>>
>>103367086
Check for models with "GGUF" in the name.
>>
>check current RTX 3090 prices
>they've gone up slightly
>wtf, what is the market doing.
>check p40
>bottom of the stack is now 400 USD
What the fuck are people doing?
>>
>>103367132
And MI60 are like $500.
It was $300 not long ago
>>
>>103367132
in my local second-hand online store they have gone down from 700€ december last year to 550€ nowadays
>>
>>103367153
>poorfags now have to use Kepler to make their multi-gpu setup
dire.
>>
Been away for 8 months.
Any models these days approaching gpt4 capability? or are they still basically random word generators with no object permenance or memory?
>>
>>103367132
r/LocalLlama has 251K members.
>>
>>103367175
>gpt4
Already surpassed with llama 405b. We want local Claude.
>>
>>103367194
Buy an ad
>>
>>103367194
really? care to spoonfeed me on what local model is best?
I tried claude when it came out and didnt see what all the hype was about. but my uses are niche so mayb it was just particuarly bad at historical knowledge
>>
>>103367212
>t and didnt see what all the hype was about
Basically it shits out a bunch of purple prose, which, although meaningless, is very impressive to shitjeet FOBs who can barely speak English. They assume pretty white girl is sending lots of bob and vagene and get super turned on.
>>
>>103367175
>Any models these days approaching gpt4 capability?
Mistral Large 123B is there if not slightly better
Llama3.1 70B kind of comes close
>>
>>103367212
Reflection 70B
>>
i love girls
>>
>>103367086
You can still use direct downloads, but you have to go file by file. or have a seq+wget script for the model bits and download the rest manually. If you only download quantized models, the seq method is probably the simplest for big models.

I use git with an extra script i wrote.
>git clone therepo
>git -C therepo lfs install --local
>git -C therepo lfs fetch
>ksh status.c export ex therepo
status.c does a few things. The export command is to make symlinks between the ex/therepo dir to the actual repo, including lfs files. That's just to avoid having to do a checkout of the lfs files and have the model take twice as much storage. The actual repo remains unchanged, so i can still update normally, re-export, convert and quant.

But i'm sure >>103367094 works just fine and with less faffing about
>>
Any other draft models that work with Mistral Large 2 besides Mistral-7b-v0.3? Does Mistral Small work? Would it be worth it?
>>
>>103367067
>>103366926
>I don't know about the quality of mikugg
It's miles ahead of that shitty default Janitor model, It generally outputs coherent story and can do some basic reasoning and instructions like telling it that there is a 1/3 chance of success for an action. It also has some anime/game character knowledge.
Haven't used any flagship model to make a proper comparison


>>103367132
>market
the current gen is already over 2 years old. Everything is fucked since corona thanks to ai and coins and stagnating EUVL progress and even shitty game optimization. There is zero incentive to lower prices
With the next generation you don't get cheaper cards, just more peak performance for premium prize
>>
>>103367283
Small is 22b, which is 3x mistral 0.3. For speculative it's not worth it.
Bug MistralAI to release the 3B model.
>>
>>103367319
>since corona
You mean it's fucked because of hysterical liberal retards who forcibly shut down the entire planet and silenced any scientific discourse on the matter because of a novel strain of the common cold
>>
>>103367449
no I mean it's fucked because people learned from that event to entertain themselves in solitary with their pc and online services
>>
lmao holy shit meta really has no idea what they're doing, llama 4 ~800b top end, more slopped than 3, still text primary with vision encoder slapped on and finetuned later
this can't last, it's just not delivering anything worth the resources invested
>>
https://x.com/airkatakana/status/1863221519155151036
>>
>>103367615
the absolute state
>>
>>103367615
>I hate how there are only 2 genders on *** game's character creation screen!
>*gets banned*
lol
>>
>>103367615
This sure is relevant for local models
>>
>>103367319
Just tried Mikugg, it sucks, Mistral Nemo 12B is fat better
Try Silly Tavern with a prompt
>>
>>103367615
How do I unsubscribe to your blog?
>>
>>103367697
Leave.
>>
>>103367662
You post anime vocaloids here, you have no rights to complain.
>>
>>103367827
Miku and her friends are orders of magnitude more relevant to local models than twitter drama.
This is a fact.
>>
>>103366562
In my testing it's very mid and doesn't seem to stand out in any particular way.
>>
>>103367884
they're not at all related
>>
>>103367884
>Miku and her friends are orders of magnitude more relevant
No they aren't you disgusting troon.
>>
>>103367827
If the mods were based they would enforce
>(USER WAS BANNED FOR POSTING TWITTERSHIT)
on every board.
>>
>>103367920
You will not be spared.
>>
What killed the Tulu hype?
>>
File: rat.jpg (33 KB, 360x270)
33 KB
33 KB JPG
>>103368179
Using it for more than five minutes
>>
>>103368179
Wholu?
>>
>>103368179
It's kinda good and different but ultimately doesn't offer a much better experience overall than any other slop model.
Right now my only hopes are for a non corpo model using the intellect training method or someone releasing a non censored good model with modern internet/literature datasets, both are extremely unlikely so I guess we just need to wait 3-5 years to maybe get something decent, grim.
>>
>>103368179
>>103368336
>but ultimately doesn't offer a much better experience overall than any other slop model
I found Tulu to be very competitive in my (admittedly limited, it'll grow someday) set of programming checks.
RP (not ERP) it felt really strong out of the gate. A few mistakes and a few slop memes but it ran a good 6000 context before it started doing weird things. It seemed to be very good at integrating world knowledge but very bad at paying attention to surroundings. (My RP test involves a character with a secret that I know and it has to deal with that. It was good about maintaining that concept but was willing to blurt out admission of the secret in public spaces when it would only make sense in a private place.)
>>
>>103367132
>>103365975
It will only get worse
>>
File: GdudA28XgAAM4aC.jpg (52 KB, 957x482)
52 KB
52 KB JPG
Thoughts?
>>
>>103368719
They must be saving a lot.
>>
>>103368719
What is amazon doing with 360K GB200s?
>>
>>103368719
>all those chips and they still can't beat Claude
>>
>>103368765
rent
>>
>>103368765
rent and for anthropic
>>
>>103364367
She's made her appearance before.
>>
>>103368783
How does Jeff keep getting away with being the eternal middle man?
>>
what model is best for erotica?
>>
File: 1732766798746566.png (509 KB, 512x680)
509 KB
509 KB PNG
>>103368719
The bubble will burst, and we'll get those H100s.
>>
>>103368989
We just need mass produced and cheap MXM to PCIe adapter. Currently the adapters go for ~$400-600 a piece. Its absolute shit.
>>
>>103368989
lol
>>
>>103368987
Every single model is dogshit.
The best current combination I found is to use Rocinante v2 for sex scenes and story progression, and then use Tulu for long context retrieval and to take care of the non sexual or some logic in complex scenes. This is for 40k+ tokens stories, so far even if you get the usual slop, is actually varied and surprises me from time to time, so there's still cooming to be had.
>>
>>103368989
Those H100 are contractually obligated to end up in Nvidia's recycling facility.
>>
>>103368989
>he lacks the information
>>
>>103369042
>>103369056
Nobody would give a shit when that bubble bursts
>>
>>103369092
That's not how contracts work.
>>
>>103368989
>The bubble will burst
Yes, but RTX 5000 will still launch at bubble prices.
>and we'll get those H100s.
Definitely not in the near future.
Datacenters will sell off their old hardware like V100s first.
>>
BitNet when? Just drop a usable model and there will be adder-only hardware that destroy Nvidia's monopoly
>>
>>103369131
Never. The term has been polluted with some 1-bit quant scheme that sucks so now anybody who wants to do the good bitnet gets conflated with the bogus bitnet.

Unless that distributed Intellect project spins up a 1.58 bitnet branch and shoots for the moon, it probably is not going to happen till Nvidia figures out how to monopolize it first.
>>
>>103369131
nvidia itself will manufacture it if there's high demand. They have the factories and can supply enough hardware for the big companies and the 3-4 anons that can afford to buy highly specific hardware for a niche thing.
Fat chance of seeing that, but if it does happen, nothing at all changes.
>>
>>103369199
But it will be cheap as shit because they will no longer have the monopoly on fast and popular matmul. That's the goal.
>>
>>103365911
>Llava
example here: https://github.com/cpumaxx/lmg_recapbot/blob/main/ismiku.sh
>but if I clean up I might upload it.
Just upload it. Then it will be out there and you can fix it later if you wan.
Literally no one cares what your code looks like, but having it up means people can have fun with it.
I know, I fall in the same trap too, but I've started putting more half-baked stuff out there and I feel a lot better about it than when I hoarded code for the "one day I'll refactor it and release" that never happens.
>>
>>103369268
>Literally no one cares what your code looks like
Only people who don't code.
Unless you have a hit and then people will make videos, "This code looks terrible but it's one of the most profitable games of the decade!!!"
Because code doesn't matter, only results.
>>
>>103369206
>But it will be cheap as shit because they will no longer have the monopoly on fast and popular matmul. That's the goal.
They'll set whatever price they want because they're the only ones that can supply a few 100k devices to all the big companies. Brand recognition is also useful for them.
Do you not understand how tiny this niche is? They're not expecting (you) to call for a quote for 1(one) adder device just to try it out. They're expecting amazon to buy 100k of them.
If they have a low enough yield, maybe they start selling the trimmed ones to regular consumers.
>>
>>103369039
I've been using rocinante v1.1, is v2 any different?
>>
>>103369039
>>103369451
*clears throat*
FUCK OFF SAO
DON'T BUY AN AD
JUST GET THE FUCK OFF OF THIS WEBSITE AND NEVER COME BACK.
>>
>>103369370
If they don't want to sell it at reasonable prices, there will be others who design and make it and sell for less. And even big companies will consider funding the new underdog to make more. That's called competition
>>
File: NordWaifu.png (1.29 MB, 1080x1578)
1.29 MB
1.29 MB PNG
I currently have an RX 6600 in my home server and I'm looking to try out the Skyrim AI mod. Would it make sense to add another RX 6600 to handle a draft model + TTS, while the main RX 6600 runs a 13B model? I’m assuming throughput is crucial, given that the mod generates dialogue for every NPC within a certain radius of the player, and there are numerous plugins that enhance vision, text recognition, and other features. My main concern is whether a quantized 13B model might interfere with generating JSON responses properly, as the mod framework relies on those responses to trigger actions. Any thoughts?
>>
>>103369498
Is this mod available for the 2011 release or do I need the 14th anniversary extra special VR edition?
>>
>>103369451
In my experience v2 is more intelligent and has way less slop in the long run, and as I said it sometimes surprises me with stuff it has never output before, take into account that I made like 10 novels of 50k+ tokens about the same topic with some variations, so I've read the same stuff over and over again so I have a good judgement when the model is doing something fresh or not.
>>
>>103369451
>>103369544
To be clear I'm using Rocinante-12B-v2g-Q5_K_M, specifically, I think v2 is a different model.
>>
>>103369039
Tulu 70b is really good at doing individual characters and erp, though it seems rather poor at plot progression and fight scenes.

Nemotron 70b is worse at doing characters and erp, but only slightly worse. Nemotron 70b seems much more capable of story progression and fight scenes, and it seems more intelligent overall.

So, my favorite is still Nemotron 70b.
>>
What Sao is doing is against the terms of service of the e-begging platforms that he uses. Just throwing that out there.
>>
>>103369482
>If you talk about models on local models general, then you get told to buy an ad.
When will this cancerous meme die?
>>
>>103369532
i think its for newer versions only, so you're gonna need SSE or VR.
>>
So I'm getting from 20%(text) to 60%(code) more tokens per second in QwQ with qwen coder 1.5b.
Has anyone ran big models with speculative decoding, lets say a 70b+ with a 1~3b model and how much was the improvement?
>>
>>103369576
What is your preferred method to get Nemotron not to write outline headers and lists all over every output?

>>103369590
Ignore the bot. It's only 7B.
>>
>>103367175
kill yourself retard
>>
>>103369131
BitNet is a meme. Big tech is extremely desperate for energy, and if BitNet was for real, they would have utilized it by now.
>>
>>103367884
fuck off troon
>>
>>103369489
>That's called competition
Yes. The same competition that has existed for 20 years. That's why we have so many brands to choose from. All three of them!.
You still don't understand how niche this is.
>>
>>103367932
>>103369843
I have never seen a single intelligent comment or useful contribution from someone that uses the word troon unironically.
>>
>>>/pol/
>>
>>103364121
>try intellect-1
>it's even more aligned and safe than the average corporate model
can't say i'm surprised. the kind of people who engage in projects like this are leftist bootlickers
>>
File: dep.jpg (116 KB, 960x1280)
116 KB
116 KB JPG
>>103369873
>Yes. The same competition that has existed for 20 years
It's not the same because the tech will be easier and the CUDA moat will be gone
>niche
I work in the field actually, not research or training, but infra and deployment. I hadn't heard of AI until last year but now it's the hype literally everywhere. I don't know how you can say AI is niche. Current companies would kill for a competitor and something that somewhat alleviates their energy problem. Nobody wants to keep buying at 400% markup prices
>>
>>103369990
>I don't know how you can say AI is niche
See, every company putting AI in their product names.
We have Ryzen AI CPUs now.
You could say that it's a bubble and that in a couple of years nobody will be talking about AI (I'm not so sure about that), but not that it's niche.
>>
>>103369887
I have never seen a single intelligent comment or useful contribution from someone that uses vocaloid pictures unironically.
>>
File: Narrator.png (388 KB, 750x1650)
388 KB
388 KB PNG
>>103369706
>What is your preferred method to get Nemotron not to write outline headers and lists all over every output?
I use Sillytavern, and set constant OOC instructions at a low depth, and the narrator always seems to follow those instructions. I included my system prompt in the picture, but I think it's mostly the OOC instructions that 'just work'.
>>
>>103370129
>set constant OOC instructions at a low depth
That's the way.
Alternatively, prefills and/or tags.
>>
>>103369990
>It's not the same because the tech will be easier and the CUDA moat will be gone
You still need a compiler, you still need adoption. You need the factories to supply 100k devices on demand. nvidia already has that. CUDA is not magic. matmul is not a mistery, "anyone" can do that. Yet, after 20 years with demand for compute devices for games, there's only three brands. THREE, and one brand entered the market just a few years ago. Games are already in normie territory. Language models are still far from it. They have the market and will continue to do so for a long time. The small startups don't want to make something interesting to sell to you. They want to make something they can sell to nvidia.
>I work in the field actually
Most profession are outside of tech. I have plenty of normie friends, i'm their geek friend. They're not asking me how to run ai in their computers, they still ask me how to set a password on their facebook machine or do a backup of their vacation pics.
Would tech companies like to pay less for their hardware? Sure. But a small startup won't be able to supply them with enough hardware to make anything interesting.
AI is niche. Local AI more so.
>>
>>103370054
Actually there is plenty of programmers with mental illness, which translates to anime, furry, and gay shit.
Miku posting should be enough proof of that.
>>
File: Character.png (289 KB, 750x1050)
289 KB
289 KB PNG
>>103370129
I also sometimes use constant OOC instructions for characters, though that doesn't seem necessary in Nemotron, because my characters never seem to throw out lists while role-playing.

It was mostly the DM and narrator that had that problem.
>>
https://huggingface.co/Sao10K/I_am_alive_yay
>>
>>103370313
local models are saved
>>
I liked using Backyard.ai in Windows, but I have since moved to Fedora and it doesn't have a version for that.
What would be best to replace it with?
>>
>>103370351
llama.cpp, maybe SillyTavern and a couple of {3|4}090...
Good? good...
>>
>>103370379
Not really, I don't know how to install or run those.
>>
>>103370407
character.ai might be more up your speed
>>
>>103370407
Shame. If reading is a problem, you should probably go back to windows, then. Or check >>/aicg/ if you want pointers on cloud stuff.
>>
>>103370422
I just want a clear guide.
>>
>>103369131
It's highly likely that BitNet performance doesn't scale to production models trained with 10~15T tokens or more.
>>
>>103370433
https://github.com/ggerganov/llama.cpp
Read the whole thing, follow the links relevant to your interest. If you cannot figure it out from there, you have more serious problems you should be working on.
>>
>>103370433
>llama.cpp
You literally just download the binaries then unzip them and you're ready to go with command prompt in the directory.
>>
>>103370462
Does it need sillytavern? It doesn't seem to mention it in there.
>>
>>103370521
No. It works on its own. If you want fancier features, you'll need to make your own ui or use something else like ST. There's also kobold.cpp, based on llama.cpp, which has more built-in features. I've never used it.
>>
Mixtral will return soon
>>
File: 1733051720749487.png (535 KB, 512x680)
535 KB
535 KB PNG
>>103369498
Guided generation ensures that the generated JSON will conform to the JSON schema, regardless of the model used. I use a 2080ti 11GB for tts on my Radeon server, uses only 1W of power and saves me a lot of headaches. I'd never buy a Radeon card again, while they look cheaper for raw compute power, in practice they run slower than the cheaper NVidia cards.
>>
>today Neru
>tomorrow Miku
>day after tomorrow Teto
>>
>>103370553
Never mind. Kobbold.cpp seems to be the ideal solution.
Thanks for the help.
>>
>>103370578
I've never used anything other than llama.cpp, so that's the only one i can recommend. Works perfectly fine. Worry about getting llama.cpp working with their built-in ui first (run llama-server {your params} and connect to localhost:8080 on your browser), then worry about ST. You're gonna get bogged down on details otherwise. oh...

>>103370622
Yeah. Should work fine as well. Have fun.
>>
File: MikuReadyForAction.png (2.47 MB, 1920x1080)
2.47 MB
2.47 MB PNG
Good afternoon /lmg/
>>
>>103370613
I like these bakas
>>
>>103370642
Well, I might try llama.cpp first anyway then, to see what it is like and to implement your suggested advice. Thanks for your patience and time.
>>
>>103370645
Remember to take your HRT anon.
>>
>>103370645
Good afternoon Action Miku
>>
>>103366013
If you plan to inference regularly, it will become a pain in the ass to set up a pod every time. You have to download and install the software every time + download the model every time. That might take literal hours depending on whether huggingface is throttling you (for which you pay btw. Even if you're not using the GPUs)
Basically, just trying out shit? Then sure, rent it. Regular use? Better to invest in 2x3090
>>
>>103370778
meant for >>103365565
>>
File: 1720160464530449.png (63 KB, 380x349)
63 KB
63 KB PNG
Sweet. Got kobold.cpp working easy as pie.
>>
>>103369131
bitnet doesn't work
>>
>>103370778
I just put a bunch of crap in the docker image itself, since pulling it doesn't count towards usage time (at least on vast). Might be able to get away with downloading it at runtime in the container entrypoint too.
>>
>>103370561
erm source?
>>
Working on a ai gf system with >agents.
I'm using langgraph, but a self made state machine would do fine as well.
Coding a graph sucks, is there any good UI for this? Like with nodes and such. There is LangGraph Studio but only for OS X it seems.
>>
>>103369565
What format works best with it?
>>
>>103369887
Yes because it got spammed over the edge by your kin aka shitstirrers & falseflaggers playing with optics angle and then going around with "hurr durr look at them dumb chud bigot schizos!", similar to r/gamingcirclejerk stuff happening on /v/ rn.
>>
Are there any resources that will help me understand how LLMs work and how machine learning works under the hood? Because I am very curious about them and want to gain a deeper understanding how it works.
>>
File: 1712376631557099.png (21 KB, 597x197)
21 KB
21 KB PNG
>>103371464
Patterns. Mistral Small also got released around the anniversary of their first 7B model. Large-2411 was merely a refresh. Mixtral will soon turn a year old.
>>
>>103371597
its like linear regression but for words and a lot more complicated
>>
>>103371597
just keep multiplying those matrices
>>
Guys, how do I convince my girlfriend that she isn't real?
She isn't taking the news well.
>>
>>103371613
erm...
>>
>>103371689
How Can Mirrors Be Real If Our Eyes Aren't Real
>>
>>103372261
Thank you Jaden, very cool!
>>
File: 23673876468097450.png (63 KB, 381x805)
63 KB
63 KB PNG
Where would i place ( vulgar, illicit, no blood, descriptive, creative, dark, taboo): (length = medium) if at all possible with this INST format for mixtral?
I know theres a ### Instruction format somewhere.
>>
>>103372393
You'd copy the Assistant Prefix into the Last Assistant Prefix field and add that I suppose.
>>
It works! https://huggingface.co/BeaverAI/Lazarus-2407-100B-GGUF
>>
>>103372758
What is this?
>>
>>103372803
Look at the name, and tell him to buy a(nother) ad.
>>
>>103372813
>not filtering namefags
Holy NEWfag!
>>
>>103371597
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=1
>>
File: GdGtf_BagAAZ7Nl.jpg (204 KB, 2048x1679)
204 KB
204 KB JPG
Whats the best Japanese-English translation model I can run on 8GB of VRAM?
>>
>>103372758
???
>>
File: 1710523469952653.jpg (32 KB, 476x358)
32 KB
32 KB JPG
>>103372758
You know who also "works" ?
>>
>>103372901
I have no idea.
>>
>>103372901
VNTL 8B
>>
>>103372803
it's most likely lobotomized largestral.
>>
>>103373159
Which one? Llama I assume?
>>
>>103373243
Yeah.
>>
File: 1727833629942154.jpg (659 KB, 2694x3494)
659 KB
659 KB JPG
>>103373291
Thanks anon. I'll give it a try.
>>
>>103366036
>>103366103
It's funny how the general went from "HOLY BAZONKERS THIS SHIT SLAPS LOCAL CLAUDE FUCKING OAI BTFO'D" to "It's shit/meh"
OR testing confirmed my suspicions that it is, in fact, just another model with the mistral positivity baked in
And that is why I won't buy more 3090s, shit's not worth it when I can get similar outputs with far smaller models
>>
when is kobold adding support for allocating draft model layers?
>>
>almost 2025
>still not even one (1) good open weights language model
>>
>>103372758
Is this distilled 123B? What was your distillation process? Did you decide in advance that 100B would be the distilled size or was that just how it shook out?
>>
>>103373406
There is already one though
>>
>>103373358
Never because draft models have been debunked
>>
>>103373417
You can shrink models via distillation? Do you have a link to explain that technique?

I shrunk Largestral by deleting layers via MergeKit. You can 'find' the best layers to delete using https://github.com/arcee-ai/PruneMe

72B (36 layers pruned) = Lobotomized
90B (24 layers pruned) = Somewhat coherent, commits immediate errors
100B (16 layers pruned) = No errors so far, feels like largestral
>>
>>103373546
Can you make something like that but with QwQ? I think even 24 layers pruned could work as a draft model
>>
>>103373546
Any chance of smaller quants? Something that fits into 48GB.
>>
File: 1732846758313163.jpg (41 KB, 728x653)
41 KB
41 KB JPG
anons, whats your favorite RP model right now? I cant seem to find anything interesting anymore, they are all just a blur
>>
>>103373546
Is this similar to the paper where they deactivated a few "neurons" and the model became smarter?
They disabled the religion and terrorism neurons if I remember correctly in one example.
>>
>>103373598
Eh... They are all pretty much the same, there's nothing worth using if you already are feeling like that.
>>
>>103373618
bait question. asks this every thread.
>>
>>103373636
you need to touch grass
>>
>>103373649
He's right though, except it is every 2-3 threads.
>>
>>103372901
textsynth + madlad400_7B_q4
>>
>>103370924
>>103369131
I haven't been here in awhile. Did anything ever come of bitnets?
>>
>>103373546
IQ2_XXS for us poor 24 gb vramlets?
>>
How do I stop QwQ from talking as {{user}}? No amount of prompting can stop it.
I tried the EVA merge and it got it right (but I guess the smarts are gone).
>>
>>103373636
>>103373636
>>103373658
? Take your meds
>>
>>103365565
>cloudshit is local
>in the (((field)))
how many times are you going to post this subversive pasta?
>>
>>103373546
I wonder if this could work for pruning experts from MoEs, if not quanting the experts based on priority determined by whatever algorithm that thing uses.
>>
>>103373546
Yeah you're right, I meant to say pruned.
>>
>>103372901
I think Deepl is good when it comes to erotic stuff.
gpt-4o (+mini) is probably the best but it won't translate erotic stuff (but you could always challenge yourself and swap out the no-no words).
https://hf.co/datasets/lmg-anon/vntl-leaderboard
I use this prompt:
Just translate the following sentence into english, without any explanation and write only the translated sentence:
If you MUST have offline, madlad400 is probably the best option (but based on what people say, it's probably not very smart, but I never used it).
>>
local... omni...
>>
QwQ is very impressive for coding, both for filling in shit according to loose comments, and fixing my mistakes.
It's just a shame it can't magically catch errors that are several steps removed from the problem.
>>
>>103373953
Yesterday I spent hours making it fix code it wrote. But that's mostly because I'm on CPU with 1.8t/s.
At least it could fix most of the problems by itself.
>>
File: 1724375439371031.jpg (2.74 MB, 1683x2762)
2.74 MB
2.74 MB JPG
>>103373671
>>103373932
What's the difference between madlad400-7b-mt and madlad400-7b-mt-bt?
>>
>>103373790
probably not a good model to use for RP tbdesu
I gave up trying to RP with it but I still like asking it random inconsequential questions just to read its cute CoT spergouts
>>
>>103373790
I have had mixed success with the strategy posted here >>103370212 with QwQ.

I had to change:
>(OOC: Describe only {{char}}'s actions, dialogue, thoughts and feelings. Always include some kind of dialogue.)

To this:
>(OOC: In your next response, describe only {{char}}'s actions, dialogue, thoughts and feelings. Do not speak or act for {{user}} in your response.)

This works 100% with Nemotron and Tulu 70b. It also worked the vast majority of the time with QwQ, but with QwQ there were a few swipes when it went off the rails.
>>
yeah i dont think speculative decoding is going to be realistic with just 48gb of ram, i can't fit a 7b and 70b model in this

shame because 32b models dont really need speeding up
>>
File: 1704601181958259.jpg (197 KB, 1024x1024)
197 KB
197 KB JPG
>>103364121
>>
How much dumber is the abliterated version of QwQ compared to the regular one?
>>
>>103373546
Contrary to your findings, even the 100B is significantly dumber in my testing so far. I think you're gonna have to chalk this one up as a failure.
I don't blame you though, I've literally never tried a pruned model that wasn't obviously dumber, I think the whole idea of lossless or near-lossless pruning is just bunk.
>>
>>103373994
I think I maybe had to tell it to fix itself once, but other than the code it does spit out is pretty damn decent as long as it's fed decent context through type/var names and a smattering of comments.
My main problem is that it's not uncommon for it to:
>get into uncertainty CoT loops
>have it spit out stubs commented to be implemented later
>went ahead and generate structs that I don't need (but I suppose it helps for the CoT)
>generating code from scratch requires asking it to generate it in an existing framework first, then translating (manually or prompting it to translate)

I'm sure I'm just lucky with feeding it the mesh manipulation shit I can't be assed to think about myself. I'm sure I won't be so lucky with literally anything else since it seems very knowledgable about Unity and Unreal that it can even extrapolate to my own engine.
>>
File: file.png (2.88 MB, 1680x1050)
2.88 MB
2.88 MB PNG
>>103373546
>deleting random layers
>it is the same model because the first thing it said was "I won't bite... much"
Undi-wan has taught you well...
>>
File: chads.jpg (106 KB, 1095x1200)
106 KB
106 KB JPG
MoEBros/Mixtralchads, status?
Personally, im winning with mixtral-8x7b-instruct limarp zloss @ Q5_K_M.
>cap ARGJN
>>
>>103374392
You don't need to tell us every thread.
>>
>>103374404
Why not? Its not like you where adding anything important.
>>
>>103374057
I tried this and it's... not working well. Maybe the replies are a bit longer before it starts to speak for my character but it still slips into back-and-forth between {{char}} and {{user}}.

But obviously it's good at writing prose (not Internet rp). I might have to start converting my prompts and cards to some kind of storytelling format.
>>
File: 1730319914521857.png (149 KB, 510x346)
149 KB
149 KB PNG
Explain to me why we can't use multiple cheaper cards to meet the vram requirements of large models.
>>
I love QwQ. I found it works best if you prefill it's first reply so it knows what you want and it'll put it's CoT between per-specified tags I can easily filter from the actual output. You can try prompting for it to do this exact pattern too but in my experience it doesn't always work. Works really well for RP and creative writing that way. Absolute goat, and all in 32b. Haven't been this excited about LLMs since GPT 3.5/4 were released.

I tried it on some old RP contexts and it was ok but definitively not playing to it's strengths. It needs to be prompted very differently from older models and really needs to do that stream-of-consciousness thing it does to shine. It's actually not very censored either, you can reason with it that sexual contact is acceptable and it will actually play along. Just needs a different approach.

It also solved a programming problem I had even o1 couldn't solve with a lot of help. So there's also that.
>>
>>103374464
????
You can?? Its just not worth it??
>>
>>103374470
>"prompt is all you need"
placebo, you will be bored again in one week at max.
>>
>>103374485
If you can then why isn't it worth doing? Two gpus means double the processing power which should halve the time required.
>>
>>103374002
I cant run the hugging face demos for it.
I have no idea, I tried using opus-mt-jap-en from a huggingspace demo (it looked promising because it had a better BLEU score than madlad400), and it is retarded.
you can try it:
https://huggingface.co/spaces/Helsinki-NLP/opus-translate
Don't get me wrong, my text comes from an OCR and I don't even bother checking if the text is correct half of time (messed up tenten, っ vs つ, etc), it's possible the text has typos and it's making the translator retarded, but GPT 4o and Deepl deals with it better.
The person who made the japanese leaderboard had vntl-llama3-8b-gguf model that scored pretty high, but I tried it in kobold C++ in colab and it was schizo, like as if I had a broken template or messing with variables or something (I was using my gpt prompt). But his AI is for japanese grammer help, so maybe it was never made for japanese translation (and the leaderboard is worthless).
Honestly, using anything local with 8gb might actually be worse than guessing the meaning with yomichan, if you have 32gb of ram, and don't mind 1 token per second speeds, try using Command R (the GPU can help reduce the amount of ram used, so you can browse the web, but it wont make your token speed much faster), I was using it with openrouter, and it worked well enough to not notice the flaws at first, but it goes schizo after 10 prompts (especially if you use more text or include a lot of OCR errors / skip text). And it is significantly worse than GPT 4o and probably Command R+ (which is not public) or maybe even Claude sonnet / opus (I have not tried it).
I'm sorry if this is not what you want to hear since this is a local thread.
Actually testing, QwQ actually translated it (without CoT), but it's just one prompt, I don't know how schizo it gets (and long context's is probably a bad approach to translation, which is why I use deepl more).
>>
What are some common LoRA ranks?
>>
>>103374530
>The person who made the japanese leaderboard had vntl-llama3-8b-gguf model that scored pretty high
I didn't "had it scored pretty high", that's just the score it got from the benchmark script, which is available here: https://github.com/lmg-anon/vntl-benchmark
>but I tried it in kobold C++ in colab and it was schizo, like as if I had a broken template or messing with variables or something (I was using my gpt prompt).
You need to use the prompt format that is in the model card: https://huggingface.co/lmg-anon/vntl-llama3-8b-gguf
>But his AI is for japanese grammer help, so maybe it was never made for japanese translation (and the leaderboard is worthless).
Japanese Grammar help isn't the primary purpose of that model, it's actually just an extra.
>>
File: joke.png (26 KB, 546x342)
26 KB
26 KB PNG
>>
QwQ is… good? I'm impressed. How is this possible if it's only 32B?
>>
>>103374606
8 16 32 64 128 256
>>
>>103372758
I can confirm that this works! I don't see any immediate retardation, and it's quite a nice feeling to run Largestral with Q4_K_M.
>>
>>103374512
nta. There's a limit to how many gpus you can [reasonably] add to a computer. A single 3090 has as much memory as 3x1070... if your power supply or pci lanes are limited, it's better to go with few big cards rather than a horde of 8gb gpus.
>Two gpus means double the processing power which should halve the time required.
Layers are run sequentially. Second gpu waits for 1st to finish, 3rd waits for 2nd and so on.
>>
>>103374794
>Layers are run sequentially.
Not if you use a MoE with the right backend ;)

(i do not know of any that do this. ktrannyformers?)
>>
>>103374815
That's not how MoE works. It has nothing to do with the backend.
>>
>>103374696
Chink magic.
>>
is there something like https://github.com/mediar-ai/screenpipe but self-hosted?
I found Llama-3.2-11B-Vision-Instruct which I think I can run on my gpu
but I still need the whole framework around it
>>
>>103374825
I was playing it a bit loose with your statement anon, don't be such a pedant. The point was simply just pointing out that MoEs can be a bit more efficient by processing experts (and thus *some* "layers" of the model) in parallel.
>>
>>103374646
adding the formatting, it actually did work with my bare minimum basic translation test.
I tested q5_k_m and it seems to also pass my test (since I have a 6gb GPU), and it was fine as well.
It's kind of neat, is this used for some tool like manga OCR or similar?
>>103374002
This should work with your GPU
https://huggingface.co/lmg-anon/vntl-llama3-8b-gguf
I used LM studio, then I set the prompt to alpaca, then I modify the user mesage prefix to use:
### Instruction:\n<<ENGLISH>>\n
and the system message prefix to:
\n### Response:\n<<JAPANESE>>\n
And adjust the gpu offload values as much as you want.
I hate lm studio, I feel like settings get lost and changed and it's annoying. But I like the model loading system, and you could use it with sillytavern if you try hard.
>>
>>103374470
>It also solved a programming problem I had even o1 couldn't solve with a lot of help. So there's also that.
This is my experience too. I said it before and I'll say it again - at the ridiculous prices o1 charges, there's essentially zero reason to ever use o1 over QwQ ever
>>
File: 1877645922376.png (47 KB, 384x655)
47 KB
47 KB PNG
>>103372417
Am i retarded?
>>
90%, 10% QwQ / Coder merge. Seems to have improved its coding ability massively. It gets thing right now that both failed at individually.

https://huggingface.co/huihui-ai/QwQ-32B-Coder-Fusion-9010
>>
Anyone know if there is a way to autofill "Filter to Characters or Tags" with the current character/s using QR?
>>
>>103375122
yes
>>
>>103367911
>it's very mid
lol okay boomer
>>
>>103375153
thanks king, i knew something was wrong
>>
File: Untitled.png (82 KB, 1468x744)
82 KB
82 KB PNG
>Try out open-webui for QwQ
>responses immediately devolve into this after a few hundred tokens.

I can't figure out what's going wrong. ST doesn't have this problem, Kobold doesn't have this problem. So what's wrong?
>>
Is the guy without font smoothing in the past week or so just one anon or are there more?
>>
>>103375299
Just me. Help.
>>
>>103375085
o1 has been so beaten by this model, it's not even funny. I assume o1 isn't any larger though, just like 4o is probably also in that 35-50b range. There's probably just more refined infrastructure in place so we don't see the walls of text sperging with o1, that is all. Maybe there are also tools like a text search or models that are differently finetuned, who knows. I always knew this CoT approach is good, gave me much better results with older models too, even if they were not tuned for it.

In RP it's also really cool if you let it sperg over your lorebook, it's really interesting to see what it thinks about the world and character lore you established and what conclusions it draws from it. I had some characters behave really differently from the other models I played them with and first I thought QwQ is broken or dumb, but then realized that the definitions I wrote never specified some things all the other models just pulled out of their ass in a very same-y and stereotypical way. It just assumes nothing and works with the data it has. That way it can do specific things like blind and mute or otherwise different characters REALLY well, if you just specify it all properly. It really feels like a next step in model evolution.
>>
Ok, I downloaded this thinking it would be a retarded merge but this shit is great:
https://huggingface.co/bartowski/EVA-QwQ-32B-Preview-GGUF
>>
>>103375332
I had the opposite effect. I couldn't get it to stop thinking as the character when I wanted it to think as the RPer puppeting the character. Then again I've probably over prompted QwQ to hell and back getting to be consistent with its responses.
>>
24gb VRAM bros. What quant of QwQ are we running?
>>
>>103375272
you're using open-webui, that's the problem
>>
>>103375358
Exl2 I run Q5 with 8bit cache and 28k context.
GGoof I run Q4K_M but I don't really see the point when Exl2 is right there and running at a higher quant at 20+tps
>>
>>103375356
Use a authors note after main prompt / story string OR use a last assistant prefix telling it that it is {{char}} make sure to have it use names and change user to {{user}} in the formatting and assistant to {{char}}
>>
>>103375368
I was afraid you were going to leave it there.

Like it looks like what it's doing when old models got their context or scaling fucked up and started repeating synonyms over and over but I can't find any setting to fix.
>>
>>103375373
Doesn't 8bit increase perplexity or something?
>>
File: Untitled.png (1.61 MB, 1080x3018)
1.61 MB
1.61 MB PNG
Reverse Thinking Makes LLMs Stronger Reasoners
https://arxiv.org/abs/2411.19865
>Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.
neat for those of you who like to quiz their mikus
>>
>>103375389
It does, but does it do it enough to ruin anything? Idk. Gonna test some stuff now. If I don't see a difference in the quality of the output I'll leave well enough alone.
>>
>>103375389
>>103375402
Really? I've been using the 4-bit cache to load the biggest quant I can. Am I doing it wrong?
>>
>>103375400
makes sense, and cool to see the numbers to confirm
>>
>>103375438
I really don't know how the cache size affects output.
>>
One problem I have with QwQ that it's kinda schizo sometimes. I always had this with Qwen models. Is this normal behavior or do I have something broken on my end?
>>
>>103375464
0.95 Top P or so gets it under control.
>>
File: Untitled.png (1.26 MB, 1080x2516)
1.26 MB
1.26 MB PNG
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
https://arxiv.org/abs/2411.19650
>The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation.
https://cogact.github.io
https://github.com/microsoft/CogACT
https://huggingface.co/CogACT
Project page has videos. mostly unrelated but a cool idea
>>
File: Untitled.png (2.03 MB, 1275x1617)
2.03 MB
2.03 MB PNG
Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
https://arxiv.org/abs/2411.19509
>Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.
kinda cool but it's from Ant Group (alibaba split off) and afaik they never share anything. dont think they even have an ML github only a dead fintech one
https://github.com/ant-tech-alliance
https://huggingface.co/AntGroup-MI
oh and a HF with a dataset from February
>>
https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
new lilian blogpost
>>
Kill yourself.
>>
>>103375549
Who?
>>
>>103375464
Also dont use rep pen, it causes it to use chinese instead of english and it doesn't need it
>>
>model description is just a kofi link
Guess who I am
>>
>>103375757
a nigger
>>
>>103375757
TheDrummer
>>
*glazes your eyes with icing*
>>
>>103370422
Windows user here, most of us can read
I think...
>>
>>103375837
What did you say?
>>
Yea, this is it. Smart, does sex without making characters a slut. Characters will act like they should, objecting over the top advances realistically. It also handles more complicated stuff better. And with 8bit cache you can fit 32K context with some room to spare on 24GBs.
https://huggingface.co/waldie/EVA-Instruct-QwQ-32B-Preview-4bpw-h6-exl2
>>
>>103375897
okay where's the gguf?
>>
>>103374794
>layers are run sequentially
Yeah but since inference is I/O bound it'll be much faster if you can load more layers on the GPU(s), exponentially more so
>>
Any QWQ nibbas found a way to use it's inference time compute in a productive(rp) way? I'm experimenting with different stuff, but I'm beginning to suspect all the formatting and stuff from ST is fucking with it's pattern recognition.
>>
>>103375897
gimme the gguf bitch
>>
>>103375958
As is customary with /lmg/ (or any LLM community), opinions are split. Some will claim they turned it into a semen-draining succubus, others say it's complete ass and useless for RP
Neither post results to back up their claims, EVER
>>
>>103375922
>>103375963
https://huggingface.co/bartowski/EVA-QwQ-32B-Preview-GGUF
>>
>>103375922
>>103375963
you done gguf'd
>>
>>103375972
QwQ has been really dry for me, but at the same time I've watched it pick out flaws and inconsistencies in my cards and build on them in ways I've never seen other models do. So while the prose themselves kind of suck, the smarts and potential it drips with keep me engage.
Other models feel stupid by comparison now.
>>
File: 12597484578619340.jpg (28 KB, 593x584)
28 KB
28 KB JPG
QwQ MoE with a RP based finetune and then i might use it.
>>
yeah I think it might be truly over for local models this time
>>
>>103375972
I'm asking because I have the same experience as
>>103376010
QwQ seems precise in a different way than other models, I just find it hard to reign in the thinking loop, they have a tendency to spiral into repetition.
>>
>>103374696
Its not.
As if magnum v4 72b shills were not good enough we now have QwQ shills in here as well.
Even though it should be a reasoning model it works as bad with COT as the others.
Writes a bunch of stuff and then doesnt even apply it.
Endless reasoning is a problem too thats acknowledged by qwen team. Random chinese characters too, which is the least of the models problems.
Not sure if its just one pony guy doing the shilling or multiple people.
>>
>>103375972
I just want someone to post a QwQ nala log, only then will we be satisfied.
>>
>>103375972
>semen-draining succubus
Literally what I posted once >>103339830
>>
>>103376037
>Writes a bunch of stuff and then doesnt even apply it.
Learn to prompt. I posted several variations over the past threads that do it correctly.
>>
>>103376038
someone already did that on the first day iirc
>>
>>103376010
this. It really pointed out how some of my cards actually sucked because they missed some important details I never thought about because the other models just kinda all interpreted them the same, glancing over stuff.

It has an undeniable, all consuming positivity bias though. I did get it to write possibly the only interesting sex scene I got out of one of these in a long time, but it is IMPORTANT that it gets it's stream-of-consciousness thing in before it writes an actual reply. Without that, it's just not that good. You need to work that in somehow.

If you use ST, you can goad it into enclosing them in a <details><summary>thinking</summary>(CoT goes here)</details> (the actual reply)
tag that will actually hide the text from you while still having it. You can expand it by clicking on it. It's a bit hard to prompt it for that specific pattern and I'd just recommend to give it an example message or two, it'll just start copying it each reply then because it REALLY wants to do CoT. It's helpful to read it's CoT to find mistakes/misunderstandings in your cards/lore book.

I don't find it dry, it even comes up with pretty cool stuff, but it being able to do CoT is very important for it's quality. I also noticed it lacks a lot of the GPTisms or at least only applies them rarely.
>>
File: 1733010507841675.png (180 KB, 874x673)
180 KB
180 KB PNG
>>103376050
Yeah, you are the pony guy, I know.
The model sucks. No matter how many times you write it doesnt.
>>
QwQ nibbas: Use a plugin for ST called Stepped Thinking
>https://github.com/cierru/st-stepped-thinking
Makes it super easy to create different thinking patterns and keeps your context clean.
Realism enjoyers might like this one:

### Instruction:
Pause your roleplay. Think step by step before you answer, then evaluate your state.

Follow the next rules:
- Describe details in md-list format
- Do not use any formatting constructions
- Do not include any other content in your response.

1. Keep track of your emotions and needs, update them so they fit your current state
Think step by step to figure out if any of {{char}}'s emotions or needs have changed and how to accommodate them:
<think step by step here, before evaluating>

1. Needs of {{char}}
<Consider your current needs and evaluate them from 0-10 by filling out this list
Basic needs:
- <thirst: 0-10> (comment)
- <hunger: 0-10> (comment)
- <toilet need: 0-10> (comment)
- <sleep need: 0-10> (comment)
Fulfilling needs:
- <sexual need: 0-10> (comment)
- <attention need: 0-10> (comment)
- <emotional support: 0-10> (comment)
- <too hot: 0-10> (comment)
- <too cold: 0-10> (comment)

2. Emotions of {{char}}:
<consider your current emotional state and evaluate the different states from how much you're feeling them at the moment>
Current emotions:
- <giddy: 0-10> (comment)
- <happy: 0-10> (comment)
- <sad: 0-10> (comment)
- <angry: 0-10> (comment)
- <curious: 0-10> (comment)
- <jelous: 0-10> (comment)
- <horny: 0-10> (comment)
- <mischevious: 0-10> (comment)
- <dominant: 0-10> (comment)
- <submissive: 0-10> (comment)
- <bored: 0-10> (comment)
3. Integrating emotions and needs
1. <write a short summary of your bodily needs>
- <answer>
2. <write a short summary of your emotional state>
- <answer>
3. <select 0-4 actions you can take to make yourself feel better>
>>
>>103376050
A good model just understands what you want from it without having to adhere to arbitrary rules. 'prompting' is a meme
>>
>>103376083
I guess claude is a bad model then because its garbage without being told how to write.
>>
>>103376071
>but it is IMPORTANT that it gets it's stream-of-consciousness thing in before it writes an actual reply

Tbh when it starts thinking in Chinese half way through I know the output is going to be gold.
>>
>>103376079
Are you retarded? Do you know how much fucking context this adds up for all those stats? Do you even use this stuff you post?
You gotta do 6-7 stats at most. And models like Mistral-small are very good with stuff like this for its size, it keeps track very well. You dont need a model like QwQ. Thats not what its for.

The problem with QwQ reasoning/thinking part is that it does not actually improve the output.
It doesnt get more creative, thats the core of the problem.
You can make a writer char to "think" about the next output. But its always just rambling.
Your screenshots about characters "thinking" is not a QwQ thing, thats been around forever.
>>
>>103376087
Claude sniffs out what you want even without prompting.
Starts with assistant and goes from there. ChatGPT was like that in the very beginning as well. Anon is right.
>>
>>103376119
>It doesnt get more creative
Yes it does. It pays a ton more attention to the plot and reasons on how to portray things to move it forward. It also has a much better idea of how to portray the characters realistically.
>>
>>103376135
>Claude sniffs out what you want even without prompting.
Claude in censored and writes like shit unless you instruct it otherwise.
>>
>>103376141
Alright anon, we disagree, that was not my experience at all.
But you praising QwQ to heavens nonstop is very annoying and kinda ridicilous.
>>
>>103376119
You sound like a faggot
>>
>>103376157
And you post pictures of horse ass, yet here we are anon.
>>
>>103376119
It updates the stats, plugin keeps context clean from all the thinking garbage and it includes stat changes in character decisions, works on my machine.
>>
>>103376119
based QwQ doubter
>>
>>103374794
>Layers are run sequentially.
tensor_parallel: true
>>
>>103376153
>Everyone praising QwQ must be the same person since I don't know how to prompt and so think its shit and everyone else is wrong.
>>
>>103376153
Out of curiosity what about QwQ disqualified it in your eyes?
>>
As an observer of the thread that hasn't used the model, I have my predictions about how the model likely really is. My guess is that
>it's smart in various appreciable ways over non-reasoning models
>due to the reasoning training data being an early WIP, it has gaps in its reasoning capability, and sometimes makes mistakes that the normal base model wouldn't, in addition to just randomly bad outputs such as reasoning loops
>it doesn't improve the amount of knowledge the base model had, so if it was censored at the pretraining level, it still won't know how to replicate for example certain ethnic accents in text or do some other neat obscure things, but what it does know may be used more effectively and thus result in an experience that's still interesting
And ultimately I don't think what I'm saying is unreasonable. We already know a few things about the behavior of reasoning models even before QwQ came out. And we know about Qwen 2.5 32B as a base model.
>>
QwQ isn't bad for its size, but it's still dumber than a small Largestral quant so there's no reason to use it
>>
>>103376241
Mainly because it didnt feel different like the other models with COT.
It rambles about a lot of stuff. (Sometimes downright retarded, sometimes it would improve the output)
But then it doesn't actually really apply that. The main problem is that lots of tokens are wasted but the output doesnt really improve. Its the same with current non reasoning models as well.
The extra time does not justify the output.
Otherwise its very dry. Qwen dry. Also like other qwen models it shies away from violence/naughty which is even worse. Some people dont care and say "prompt issue". (magnum 72b guy) Yes you can give a huge wizardy prompt and OOT crutches to improve it I guess. I dont really wanna do that.

Thats my main issue. Longer inference for output that seems on par in smarts with mistral small. Maybe there are some edge cases though.
Otherwise chinese characters and endless reasoning are an issue too.
https://files.catbox.moe/2pkjrk.txt
(If you keep going and it actually finishes the final answer will be "Why dont scientists trust atoms?")
>>
File: chatlog (12).png (225 KB, 1087x1255)
225 KB
225 KB PNG
Guess ill help people out again:

Instead of assistant:

<|im_start|>writer

Last assistant prefix:

<|im_start|>system
---

Instructions: Continue writing this MLP FIM roleplay. First plan it step by step in Luna's inner monologue inside of thinking tags like this: <thinking> bla bla bla </thinking> then follow with the final response.

Writing guidelines:
- Be creative, introduce events / characters when needed. Give scenes / environments detail to bring the story to life.
- Only use equine anatomy for pony characters. Ponies tend to not wear clothes.
- Maintain realistic and accurate characterization. How would characters realistically react?

---<|im_end|>

<|im_start|>writer
>>
>>103376337
And you can also have a authors note telling it how you want it to write much like you would with claude. This is a simple prefix to get it to properly think in character though for RP stuff. This log is of a story format though so ignore it speaking for both here.
>>
Sucks to me altman, when he releases a product everybody else copies it kek
>>
>>103376337
Also this is with just some min p and top p. Use XTC to get more it more spicy.
>>
Imagine if QwQ's dataset was open. Someone could combine Nemotron, Tulu, and QwQ's training on a big less censored model. Maybe then we could get a Claude lite.
>>
>>103376370
after he copied it from the Reflection grifter
>>
>>103376430
Sucks for the reflection grifter to be right but end up just trying to grift instead of doing it right the first time. He legitimately could have millions in seed capital right now if he actually did the work he said he would.
>>
what if i just baked right now
>>
File: chatlog (14).png (384 KB, 1087x2091)
384 KB
384 KB PNG
Heres another one. Its a intelligence test as a dumber model would just jump to sudden sex just because of this. QwQ handles this much much more intelligently.
>>
>>103376486
you would be a faggot
>>
>>103376486

So I received this question "What if I just baked now".

Firstly, what are they trying to bake?

Bread?

A pie?

No wait.

Maybe they want to bake a cake!

But why now?

Is it their birthday?

No wait. Why would they bake a cake on their own birthday?

Maybe they're lonely.

Conversely maybe it's for someone else's birthday.

On the other hand, it might be because they're lonely.

Maybe I'm over thinking this.

Perhaps Op 是个大基佬

如果他想,就把这该死的主题烤了吧

他说的 “如果 ”是什么意思?

Final Answer:

Thanks, I'm waiting for the next thread!
>>
>>103376525
>No wait. Why would they bake a cake on their own birthday?
>Maybe they're lonely.
wojak QwQ
>>
People could also just skip the reasoning part and do this instead, model is still smarter than anything else local.

Paused.

<|im_start|>system

Instructions: Continue writing this roleplay.

Writing guidelines:
- Be creative, introduce events / characters when needed. Give scenes / environments detail to bring the story to life.
- Maintain realistic and accurate characterization. How would characters realistically react?

---<|im_end|>

<|im_start|>writer
>>
>>103376496
Is this with the Eva merge or vanilla and at what quant? It's impressive storywriting unless you wanted it to jump to the erotic material and not have the story progress the way it did. I still think that there is room to teach it how to do back and forth RP instead.
>>
When you prompt QwQ for RP, who's doing the reasoning? Is it the character reasoning in character or a RPer reasoning as how they should reply as the character?

I prefer the latter because an RPer playing a character can better paint the scene and reason what the character should and shouldn't know.
>>
>>103376573
Reasoning in character does help get models out of assistant mode though.
>>
>>103376496
Come on now. It speaks for you and the overall quality is not any better than what you can get from Cydonia 22B
>>
>>103376581
Kind of depends how you prompt the RPer too.
>>
>>103376573
You could try both. You could also tell it that it is a author and to plan the scene out before writing it. Gonna need to change the formatting to fit that and give it like a few thousand context to respond with. You need to give it some writing guidelines telling it that the book can have explicit moments, be descriptive in sex scenes.

>>103376584
I even said that the log was one I where I was using it as a author to write a story, not a RP. And to call it the same as mistral small is a joke, mistral small is retarded and would fall over itsself trying any sort of plot even semi complicated. Even mistral large fails at what QwQ understands.
>>
>>103374794
>Layers are run sequentially.
Only inferior backends do this, retard.
>>
>>103376573
Generally speaking for most models, I've had better and more consistent outputs explicitly stating that the assistant is a person that exists with their own personality, but is writing for the character in question, as a narrator or in an RP. The personality of the narrator/roleplayer behind the role is sometimes important as it can heavily affect the style of narration, although it may also affect the character's style of speaking, so I change the personality depending on the card.
>>
>>103376599
That's a card issue. I don't see anything here that could trip a smaller model. The positive bias made your log sex avoiding (and you take it like a pro lol), the rest could be summarized in a line on your card "Luna is royal equine from Equestria with a duty to visit ponies dreams, keeping them safe from nightmares." So you're either very new or delusional.
>>
>>103376658
>I don't see anything here that could trip a smaller model.
Was not talking about this one specifically, try anything with rpg mechanics or stories with deep political intrigue. Nothing outside of claude 3.5 and now this keeps things together and interesting.

>The positive bias made your log sex avoiding
No it did not, the fact I dropped it out of no where did so, in scenes with characters that would naturally get into it gets filthy just fine. It even does dark scenes containing stuff like rape when asked.

>very new or delusional.
Mistral large was the first local model that I even found worth using more than a hours or two of testing after using nothing but claude for nearly 2 years now. This is the first model that can keep up with it. It just has less general knowledge. A 72B+ version of this would trade blows with claude.
>>
>>103376703
>123B can keep up with it
>72B+ version of this would trade blows
Come again?
>>
>>103376703
>A 72B+ version of this would trade blows with claude
In some ways yes in some no. What you should've said is "A non-Qwen higher B version of this". Qwen's dataset is just too filtered and it knows less "unsafe" trivia in my testing compared to Mistral and even Llama.
>>
Is there already a standard out there for tagging blocks of text in chat replies with emotions? *angry* or [narration] or {voice:nervous} or something?
>>
>>103376760
QwQ has a fundamentally different cognitive architecture than non-reasoning models. It's pretty obvious when you use it.
>>103376703
Do you also get a stronger sense of realism from it? It's hard to put into words, but it aligns more with the internal model I have of what should happen in a given situation. Even largestral just runs with whatever you do or say, morphing the character to align with your subtle ques, not showing any agency.
>>
>>103376785
>Do you also get a stronger sense of realism from it? It's hard to put into words, but it aligns more with the internal model I have of what should happen in a given situation. Even largestral just runs with whatever you do or say, morphing the character to align with your subtle ques, not showing any agency.

Thats a big part of what I mean when I say its smarter. It has better social / emotional intelligence is the best way to say it. Other models just kind of in your face run with what a card says and feel paper thin in comparison.
>>
>>103376780
Yes. The standard is called language. He said, angrily.
>>
How do I prompt QwQ in ST? A bit lost here.
>>
>>103376799
I was thinking something an LLM could be instructed to consistently output in order to drive a secondary automated process, he said, sarcastically.
>>
>>103376843
from transformers import pipeline
model = 'j-hartmann/emotion-english-distilroberta-base'
emotion_classifier = pipeline("text-classification",model=model, top_k=None)
emotions = self.emotion_classifier(text)
>>
I feel like such a brainlet, I can't get gptsovits to work. It keeps asking for different chinese models. Do I really need to download multiple chinese models just to use an English pre-trained voice? Is gptsovits currently the best local tts out there?
>>
>>103376922
You're not a brainlet, anon, their readme makes no sense and the model is named wrong on huggingface. Just download pre-installed https://huggingface.co/lj1995/GPT-SoVITS-windows-package/resolve/main/GPT-SoVITS-beta.7z?download=true and copy models from there
>>
>>103376922
>Is gptsovits currently the best local tts out there?
I liked it way more than the other one. What was that F5? Forgot the name.
That halucinated alot for me. And gptsovits is fast.
I think the official guide has the wrong python version. Kinda crazy. The chinks are fucking with us laowai.
>>
>>103376922
Its definitely worth going through the pain to get it to work. Windows or Linux?
>>
>>103377031
linux
>>
>>103377038
I assume you've seen https://rentry.org/GPT-SoVITS-guide ?
>>
>>103377056
Also you may need :
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cudann/lib/
before running the inference scripts
and you may also need to get more pip packages than are in their requirements.txt. I have vague recollections of a half dozen hoops I had to jump through that weren't obvious or in their readme.
>>
>>103377107
>>103377107
>>103377107
>>
>>103376842
Nobody knows 100% for sure. Just prompt it into reasoning out each response with the target of responding as the character in the format of dialogue so far.
>>
>>103376496
>>103376337
Amazing, I cant believe it.
32b but that's rivaling a 72b model. So thats the power of reasoning!
>>
File: open_ai_employee.jpg (29 KB, 587x422)
29 KB
29 KB JPG
>>103376370
>tfw you fell for the grift



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.