[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: trust.jpg (155 KB, 1024x1024)
155 KB
155 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>107948284 & >>107941128

►News
>(01/22) Qwen3-TTS (0.6B & 1.8B) with voice design, cloning, and generation: https://qwen.ai/blog?id=qwen3tts-0115
>(01/21) Chroma-4B released: https://hf.co/FlashLabs/Chroma-4B
>(01/21) VibeVoice-ASR 9B released: https://hf.co/microsoft/VibeVoice-ASR
>(01/21) Step3-VL-10B with Parallel Coordinated Reasoning: https://hf.co/stepfun-ai/Step3-VL-10B
>(01/19) GLM-4.7-Flash 30B-A3B released: https://hf.co/zai-org/GLM-4.7-Flash

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>107948284

--Roleplay reasoning debates and latent space implementation challenges:
>107949603 >107949642 >107949679 >107949809 >107949813 >107949895 >107949950 >107949974 >107950003 >107950023 >107950049 >107950084 >107950424 >107950452 >107950499 >107950160
--Qwen3-TTS CPU performance and implementation challenges:
>107954203 >107954226 >107954242 >107954239 >107954256 >107954339 >107954377 >107954492
--GPU memory allocation challenges with KoboldCpp AI model inference:
>107950833 >107950855 >107950860 >107950891 >107950934 >107950970 >107951011 >107951047 >107951064 >107950862
--Challenges in applying REAP for model improvement in specialized domains:
>107953603 >107953771 >107955008 >107955030 >107955063 >107955200 >107955229 >107955262 >107954148
--AI model overfitting on surgeon riddle gender assumptions vs content:
>107949203 >107949996 >107950298 >107950340 >107950478 >107950536 >107950360 >107950443
--Qwen3-TTS GPU inference architecture inefficiency analysis:
>107950937
--Testing Gemma-3n-E4B's grammar correction limitations:
>107948335
--Qwen-TTS finetuning challenges and optimizations:
>107948363 >107948388 >107948418 >107948481 >107948517
--Local document search solutions for multi-format file support:
>107950659 >107950797 >107950875
--VRAM requirements for Qwen TTS models:
>107955047 >107955082 >107955161
--Qwen3-122B model features: VoiceDesign vs Custom/Voice with premium timbre support:
>107952883 >107954969
--glm-4.7 local performance and hardware requirements:
>107952076 >107953159 >107953348 >107953489
--open webui as a practical yet imperfect chatbot/development platform:
>107952173 >107952257 >107952293 >107952330 >107952338
--Luka and Miku (free space):
>107949352 >107949426 >107951004 >107954064 >107956748

►Recent Highlight Posts from the Previous Thread: >>107948290

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
File: view.jpg (148 KB, 1280x704)
148 KB
148 KB JPG
>>
Order a trusty Office Miku now!
>>
that glm 4.6 ego death anon is a massive faggot but his spirit is right about llm usage personally all my fucking chatting with r1 (and others models to a lesser extent though a honorable mention to kimi-0905 for the sex) has noticebaly fucked my mental up all my imagination is more vivid and creative now unironically feels like a consciousness expansion
>>
>grok-code-fast-1 (only free model) removed from roo code
why tho
(yes off topic but still why tho)
what to use now?
>>
>>107957292
did they also remove it from cline? ive been using grok fast quite a bit, I think they also offer m2 and devstral but grok fast was just better from my experiments.
>>
>>107957292
TOSS
>>
>>107957292
devstral. they're removing the free use, but you can get a mistral api key and you get 1B/month tokens for free.
>>
So, minimax-m2-her could actually be kino right? Same with mistral small creative? WHY won't they release them REEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
>>
>>107957396
>minimax-m2-her
interesting, they haven't mentioned this on their xitter or discord or anything yet. I wouldn't count it out that they would release it, they've had a delay between API and weights releases before
>>
>>107957312
idk about cline, never used it. yes, grok-code-fast-1 was great for being absolutely free over roo code. rip I guess.

>>107957336
cool, will try it out.
>>
>>107957396
>mistral small creative
Never bothered trying it since it's api only. but is it actually good?
>>
>>107957498
Some anons tried it and results were inconclusive, like most small api models. Spitballing here but it's probably their bog standard 24B model with some extra long-form book and RP datasets thrown in.
>>
>>107957396
>WHY won't they release them
1) Work in progress / interest check phase;
2) Probably still thinking of ways for preventing grifters from profiting off these specialized models at essentially no cost.
>>
>>107957543
>Probably still thinking of ways for preventing grifters from profiting off these specialized models at essentially no cost.
The french are very culturally socialist. I really don't think it's a profit issue.
>>
>>107957543
but they literally distill from deepseek lmao
>>
>>107957569
>very culturally socialist
Reading that as a french almost made me choke on my drink
>>
>>107957601
Maybe I'm over generalizing but the french tech sector is full of co-ops
>>
>>107957543
>1) Work in progress / interest check phase;
Minimax literally owns one of the bigger normalfag AI chat platforms, I don't think they have to do much interest checking
>>
>>107957618
it is, the frenchoid is coping
>>
>>107957569
They've already released a few models with a non-commercial / research-only license in the past.
>>
Is Qwen3-TTS coming to ComfyUI?
I spent hours trying to get it set up with conda and came very close to throwing my PC out the window.

>>107957601
Are people really trying to turn Macron's eye infection into good PR
>>
>>107957677
>Is Qwen3-TTS coming to ComfyUI?
If you look in /ldg/ pretty sure someone linked a working node for it.
>>
>>107957086
When the paperwork is an Escher-esque hyperdimensional anomaly
>>
>>107957618
It's full of corruption
>>
> Qwen3-TTS
Can it speak with horny voice of Widowmaker from r34 vids?
>>
>>107957727
Use case?
>>
>>107957746
cumming
>>
After some testing, I got disappointed in qwen-tts. It's very bad at pauses, often ignoring "...".
It speaks too fast for immersive rp narrative.
Its advantage is autoregressive architecture which should allow very long generation without chunking, but in reality it gets unstable with long prompts, so you still need chunking. You're left only with the main disadvantage of autoregressive archetictures: it can't saturate GPU and therefore slow.
It's not usable for Japanese as is because it misreads tons of kanji (you need to preprocess Japanese text but then refer to previous issues).
And after all, 5-second references suck. They barely capture average timbre. All modern TTS should implement long reference audio like Echo, which can take up to 5 minutes and captures all nuances of prosody.
>>
>>107958000
>you need to preprocess Japanese
so it works with phenomes?
>>
Is iq2 of a 30B model usable or should I stick to q8-6 of some 12B?
>>
>>107958035
>Is iq2 of a 30B model usable
definitely not.
>>
>>107958013
It works with kana. The model just misreads many words written in kanji. You basically need to convert all words written in kanji to their kana readings.
>>
>>107958035
30B models are moe. you can run them on cpu.
>>
>>107958092
glm flash is moe?
>>
moe moe
>>
>>107958107
yes, it's moe moe kyun~
>>
>>107957215
I am him. AMA!
>>
HP makes a pcie card for their blade servers that will accommodate four mxm cards. You can find them on ebay for ~$25. You can also find 16gb tesla p60 mxm cards ~$125.
You could also get a single mxm to pcie adapter ~$100 but that really kills the price of this setup since you would have to buy four of them.
It is not the fastest thing but you would get 64gb of vram. Has anyone done anything like this?
>>
Does LTX-2 run on Linux + AMD? Seems to require CUDA.
>>107958416
Pretty damn smart idea. Personally I've researched using PCIe switches but they are pretty rare. An eBay seller sells $300 1 to 4, 3.0x16 switches that don't require mobo bifurcation support. If your motherboard does support bifurcation you can get a x16 to 4x oculink connectors, then plug those into external GPU enclosures.

This research was part of a potential project of running AI directly off intel optane on DDR3 systems lol.
>>
>>107958416
>tesla p
e-waste
>>
>>107958000
>5-second references
yikes, really? there's no mention of reference audio length on their github page
>>
Are there any extensions for sillytavern which intelligently manage stats, location, etc.? Having the llm append a fat list at the end of the message seems like a waste of tokens to be generated and also used as only the latest stats should be important no?
>>
>>107958416
Tesla cards are very old at this point. They lack support for lots of essential cuda features.

>>107958478
>x16 to 4x oculink connector
Just note that prompt processing is very dependent on cpu-gpu bandwidth if you load some weights into ram like with moe.
>>
File: 1756629366793618.jpg (77 KB, 1292x790)
77 KB
77 KB JPG
>>107958478
>If your motherboard does support bifurcation you can get a x16 to 4x oculink connectors, then plug those into external GPU enclosures.
I have two HP z440s at home and I know those support bifurcation
That is actually a interesting idea too. I could reuse all the gpus i have sitting around in my house. Assuming you can find cheap adapters for the gpus.
>>
>>107958498
transforming e-waste into something else is fun, more fun than just sending a prompt to a company for their machines to generate results
>>
>>107958532
From my research the cheapest you can find oculink to pcie x16 female boards is around $50. It also needs an external PSU and mounting. The best mounting system I saw on youtube. A guy had the PCIE brackets supported structurally then the rear had a single support for weight.

Since PSUs not running a computer it's probably best to manually track what rail supports what wattage then make custom adapter cables to fully utilize the PSU (and account for momentary spikes in consumption.)

PCIe bifurcation only gets you so far. PCIe switches are limited by bandwidth. Bandwidth is determined by load so performance and limits are hard to guess unless you're running workloads.

Personally I feel like the latency advantage of optane should improve performance massively, probably more than GPUs, but I don't have the money to buy hardware and test.
>>
>>107958540
More than fun, you acquire a lot of odd skills that end up being usefull in the most surprising ways fucking around with stuff.
>>
You can also misuse hardware like this quad PCIE 4.0 NVMe M.2 carrier card, throw m.2 to oculink adapters, then use those to run GPUs. Again, no apparent limits than the bandwidth and cost. $250 for a single pcie switching board, plus 4 m.2 oculink adapters, plus 4 oculink cables, plus 4 oculink pcie docks, plus ~2 PSUs...
https://www.amazon.com/Card-PCI-Support-Non-Bifurcation-Motherboard-3005K/dp/B0FQBMKHVD?th=1

Too bad pcie 4.0x8 and 3.0x16 have the same bandwidth (32gb/s), so it's the same problem as before...
>>
>>107957082
Did anyone get Qwen3 TTS running on windows with nvidia? I tried the conda install guide with py312 but flash attention doesn't build.
>>
>>107958660
Get the prebuilt wheels?
>>
>>107958665
I've tried half a dozen different prebuilt wheels from different sources, nothing worked for me.
>>
>>107958660
Also, I could try to remove the flash attention2 requirements from the inference code, but
>>
>>107958671
Are you supposed to use python 3.12? Only times I've ever seen wheels build failures is the wrong python version. Normally due to not using a venv.
>>
>>107958685
Yes, I've tried multiple 312 versions. I've tried multiple cuda versions cuda to 12.8,12.6,12.4, but they all fail with flash atten for both compiling and prebuilt. I also have vs2022 buildtools so I went into the vs2022 dev command prompt to see if I can get the flash atten to build, but that too failed on all those cuda versions.
>>
>>107958671
https://github.com/Dao-AILab/flash-attention
What's your gpu anyway?
>>107958685
The model card says 3.12
>>
>>107958685
I have the same problem. I think it's due to flash attention not supporting cuda 13 yet.
>>
>>107958714
2070
>>
>>107958722
nigga
>>
FlashAttention-2 with CUDA currently supports:

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5.
>>
File: how_do_we_tell_him.jpg (87 KB, 1031x593)
87 KB
87 KB JPG
>>107958736
>Support for Turing GPUs (T4, RTX 2080) is coming soon
>>
>>107958719
>>107958736
what really grinds my gears is that qwen3-tts "recommends" this package, when in reality, it's build essential and also super fucking temperamental
>>
>>107958746
It's directly from the FA repo, not my words.
>>
>>107958736
>Support for Turing GPUs (T4, RTX 2080) is coming soon
le mao indeed, i wonder if I can swap it to xformers or something
>>
>>107958756
I think they've had this exact phrase since at least the release of Stable Diffusion.
>>
>>107958758
Try FA1 as they say for old gpus.
>>
>>107958719
>flash attention not supporting cuda 13 yet
Here:
https://github.com/mjun0812/flash-attention-prebuild-wheels/releases
>>
>>107958709
This anon is probably right maybe your GPU is too old >>107958736 . Uhh there's only one version of Python 3.12. I'm on Linux but try running something like this. It should eliminate python version being the issue.
>pip install virtualenv
>virtualenv -p python3.12 qwen3tts
>source qwen3tts/bin/activate
Then run your install scripts in the directory you are installing to.
>>
>>107957082
>be me, hear how great nano bananas & rice is when used with agentic coding via an mcp
>ask gemini how to do it, follow the instructions
>doesn't work, model not found
>try 5 other guides
>none of them work
>my agent gets fed up and writes a python script that hits gemini's api directly. works on the first try
I fucking hate mcps bros
>>
>>107958846
I forgot to add that every guide is like "just install with npx -y @somegayfaggot/babbys-first-mcp --apiKey your_unlimited_all_access_pass" and we're supposed to think it's okay because it has MCP in the name. People have lost their fucking minds.
>>
>>107958758
>>107958774
>>107958783
    tts = Qwen3TTSModel.from_pretrained(
ckpt,
device_map=args.device,
dtype=dtype,
attn_implementation="sdpa",
)


Swapped to sdpa. 1 line change in demo.py
>>
>>107958532
Power situation ugly. For many GPU inference rig consider a mining frame. Perhaps those pricey PCIe switch boards open more platform options
>>
when are models going to be trained to listen to what the fuck I tell them to do instead of following tropes
>do not intervene in the fight
>thinking: Ok, I will not intervene in the fight
>500 tokens later intervenes in fight automatically
>>
Is Qwen the gold standard for voice now? What's the best T2S?
>>
>>107959380
>do not
blue elephant
>>
I had an idea last night. Would it be interesting to create a game show, similar to Among Us, where a bunch of AI agents pretend to be human (or are gaslit into thinking they're human) and progressively vote to kill each other based on who they suspect isn't human?

I haven't yet really explored the concept of having AI agents interact with each other but it seems somewhat interesting. Even something more benign where they have to work together to navigate and survive a 3D game environment seems fun. It would be like watching a bunch of retarded Sims NPCs try to keep a bunker or a space station operational in an apocalypse scenario.

In general I like the concept of video games where you don't even play as a character but instead the god who can inflict punishments and change circumstances at will. Seeing a bunch of little goy AIs kill each other to maximize self-preservation at the cost of group survival would be intriguing.

Has anything like this already been made?
>>
>>107959410
it's not a blue elephant because even without that instruction or telling it that X is going to talk with Y without caring about the fight it still forces interference if you let it keep generating
it's just overtraining on MC tropes
>>
>>107959380
it's possible there's some human-like psychology going here where the model hears a negative statement and quickly forgets that the word "not" was in it
a negation still introduces the word intervene in this case, something like "you observe passively" wouldn't have that problem; but the problem could also be some gay safety programming
>>
>>107959440
I have tried multiple variations of X doesn't care, X is literally neutral evil so simulate him properly, X is actually secretly the big bad and wants the hero to die, X walks off towards the forest to look for mushrooms with Y while Z gets eaten
it will eventually handwave a way for X to intervene that stops the fight
I mean in one instance X was literally in a normal conversation when he suddenly decided to throw a rock and get involved
it's honestly extremely stupid
>>
>>107959425
You can find a lot of youtube videos with that kind of experiment (simpler than your scenario though). This one for example https://www.youtube.com/watch?v=0MmIZLTMHUw
>>
>>107959425
A channel that does this is actually Live right now

https://youtu.be/JhBtg-lyKdo

https://www.youtube.com/watch?v=b_B2HJUzHNE
>>
>>107959483
kek
>>
I'm back, what did I miss?
>>
>>107959751
The jews and pajeets that still support Donald Trump are cheering on another hamfisted officer involved shooting of a white person
>>
Does anyone have experience using nano-gpt.com to simplify switching between different models?
>>
>>107959769
go back to /pol/. this thread is for the discussion of ai.
>>
>>107959814
we use ollama cloud here to run our beloved models locally even on weak hardware
>>
It's fucking nuts how you can make a TTS engine run in python in about 200 lines of code but if you try to port it to C/C++ you'll need about 2,000. Python's dependency hell sucks for end users but it's also really nice for developers. I get it now. Just glad I found out about uv, because it makes it a lot easier to manage.
>>
lot of uv shilling recently
>>
>>107959878
is uv actually better? micromamba seems fine and actually works. uv seems like people like it just because it's rust
>>
>>107959769
oh no a homosexual race communist died
he gets sunburnt easily, the jews must love this
>>
>>107959878
It's working. I dropped miniconda when I realized I was only using it to set up envs and I could pip install everything I need.
>>
>>107957082
I don't know WTF he did but ik_llama.cpp takes forever to compile compile compared to upstream.
>>
>>107959917
I like uv. it's slightly more sane than the rest of the python packaging world and it is faster than pip. No real reason to switch from a working setup though. There's also pixi, another trendy rewritten-in-rust conda alternative (I think). I don't understand differences between pypi and conda well enough to have an opinion on it.
>>
>>107959878
uv and the rest of tooling from astral with Python doing everything in Rust is pretty good, faster than the alternatives as they claim and offering much saner defaults. Ruff and ty is good too. The only downside is they are now part of Cloudflare. But I'm not going to knock good tools when I see and can use them for gratis and libre purposes.
>>
Is Clawdbot just a FOTM meme or is it good? Sounds like you can use local models with it.
>>
File: 1751299876324183.png (1.03 MB, 1058x1272)
1.03 MB
1.03 MB PNG
Who's making this eval
I need my future AI wife to be nimble with her hands
>>
>>107960047
never heard of it
>>
https://pub.sakana.ai/DroPE/
https://www.arxiv.org/pdf/2512.12167
~10k$ for a 70b llama, 75b tokens, 400hours

>get a model trained with rope (FUCK ROPE)
>delete the positional embeddings entirely
>continue training briefly on the same short-context data
>done, the model now generalizes to much longer contexts than it was trained on
>>
>>107960244
>sakana.ai
lul
>>
>>107960244
that's seems like a pretty cool discovery, but i'm quite skeptical because of the authors' transformers 2™ grift
>>
>>107960346
>>107960264
Stop focusing on who said the thing, focus on the thing itself instead.
>>
>>107960244
Cool shitpost
>>
>>107959967
>I don't know WTF he did but ik_llama.cpp takes forever to compile compile compared to upstream.
Yeah same for me. It's because of all the new new ik quant types upstream doesn't have.
>>
>>107960358
>Stop focusing on who said the thing, focus on the thing itself instead.
I'm testing this on a small model. I'll upload the model even if it doesn't work.
>>
>>107955991
>Can you set up emotion tags with it? I love vibevoice but seriously lacks control over the output
vibevoice has emotion tags
Prompt below. cfg slider is set to 2.0:
[panic attack onset, breathless urgency] Please—don’t make me go out there—I can’t—I just can’t!

input audio:
https://vocaroo.com/1cqcSTePacSa
output audio:
https://vocaroo.com/1g1B0Zm5YMso
>>
>>107960489
nta, but is there a list of these somewhere, or do you just guess what works and what doesn't?
>>
>>107959856
That "200 lines of code" is backed up couple gigabytes of dlls, python libraries, etc. Its moot.

If we're strickly talking about cuda usage, there are cuda dlls that are few hundred MB for c++ and can are used in cpp version of the software. Like stable diffusion.cpp or tts.cpp or llama.cpp or whisper.cpp

Optimizations in python's backhaul has to be ported over to the cpp
>>
>>107959878
>uv shilling
Honestly if you're not using uv you're just straight up retarded.
>>
Nvidijeet vibecoder gguf pull request status?
>>
>>107960550
venv works
>>
Are there qwen3-tts quantz yet?
>>
I'm sick of this generation of LLMs.
>>
File: 1757225970454710.png (42 KB, 1054x224)
42 KB
42 KB PNG
>>107960594
>>
>>107959878
I just use miniconda like back in the llama 2 ooba days
>>
>>107960756
>using pyhthon in 2026
lol
>>
>>107960594
Do you also just create dated copies of your code in a folder as version control?
>>
File: python.png (188 KB, 724x808)
188 KB
188 KB PNG
>>107959856
>It's fucking nuts how you can make a TTS engine run in python in about 200 lines of code but if you try to port it to C/C++ you'll need about 2,000. Python's dependency hell sucks for end users but it's also really nice for developers. I get it now. Just glad I found out about uv, because it makes it a lot easier to manage.
>>
>>107960358
I read the paper, it's actually pretty good but I still have some questions.

>NoPE transformers break on repeating sequences
They don't propose a solution for this

I suspect that NoPE still performs better when both converge
>>
>>107960775
The official python environment manager for people with large penises
>>
File: h.webm (1.45 MB, 540x540)
1.45 MB
1.45 MB WEBM
is qwen tts any good?
vid unrelated
>>
>>107959425
Best I can give you is this
https://www.youtube.com/watch?v=0MmIZLTMHUw
>>
>>107960661
There's one for apple mlx format, but I dont think we can use that for non-apple devices So its useless
>>
what would be a good nsfw model for tavern that one can run on llama with a 3060 12gb nowadays?

Would appreciate the help.
>>
>>107961182
nemo, unless you have a lot of ram. then glm air.
>>
>>107961199
any specific nemo version? theres like a couple dozen nowadays
>>
>>107961210
the official nemo instruct. the troontunes just make the model braindead without increasing quality.
>>
File: 1533423826134.jpg (100 KB, 466x380)
100 KB
100 KB JPG
I have a miniconda on 3.13 and system python on 3.12, but when I create a venv with python -m venv venv, the condashit forces itself into it, but I need the 3.12 one.
>>
>>107961410
you can just make a new miniconda env with your desired version
>>
>>107961410
>python -m venv venv
python3.12 -m venv venv
>>
>>107960244
Here's the 7B weights btw.
https://huggingface.co/SakanaAI/Llama-2-7b-hf-DroPE
>>
>>107960594
He's right you know.
>>
>>107961420
I still love conda because it lets me have different cmake versions, etc per project. idk if uv can do this. So I use it for c++ shit as well.
>>
can we still get gemini 3 to spit out its real cot traces via openrouter? or is that patched?
>>
Any idea why my qwen3-tts produces junk? No matter what I prompt it’s just “gak gak gak” gibberish
>>
What are some decent uncensored models (~27B) for JPEN translation?
I tried TranslateGemma but it refuses to translate anything involving just a tiny bit of kinky stuff (rape and such)
>>
Managed to run the webui barely lol.
Here's with 0.6B-Base. It takes VRAM 4.4 GB
Some weird sounds prob from cloned voice not clean enough.
https://voca.ro/1jWbdso4trnd
>>
If I have a 4070 Ti 12gb / 64gb ram, what version of GLM 4.5 Air should I use?
>>
>>107962014
q3 if ddr4, q4 if ddr5.
>>
File: im2.png (54 KB, 594x170)
54 KB
54 KB PNG
>>107961940
Unironically cydonia.
I use it for those hentai rpg maker games no problem. The old one.
>>
I heard GLM 4.7 is not free. Is that true chat?
>>
>>107962321
Yes per download you have to praise china.
>>
is there anything better for vision than the qwen3vl series?
I tried gemma and joycaption (absolute unusable garbage), am I missing any model?
>>
>>107962321
https://huggingface.co/unsloth/GLM-4.7-GGUF
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
Both the full size and smaller "flash" variant seem to be free to download.
>>
>>107962367
Nope, qwen is king for local vision at the moment
>>
>>107962379
i see that devstral 2 small has an mmproj, I'm gonna try and report back.
>>
>>107962377
it's not free if you need to buy hardware for it
>>
>>107962420
Look at the thread you're posting in.
>>
File: 1754293327916845.png (20 KB, 571x148)
20 KB
20 KB PNG
MiniMax M2-her is now on OR. Still nothing on HF though.
>>
File: acestep.jpg (552 KB, 1580x1570)
552 KB
552 KB JPG
AceStep1.5 waiting room.
>>
>>107962389
meh fucking garbo
>>
>>107962475
yup, kino, cant wait to generate folklore shitpost musitc
>>
>>107962475
Can it make vocaloid music?
>>
>>107962483
Make a lora.
>>
File: 1739735641271263.png (166 KB, 1290x455)
166 KB
166 KB PNG
>>107962436
I don't know man.
>>
>>107962501
Same experienced I had.
Short one sentence starting with She..
This is my bratty mesugaki cute very stereotypical anime imouto.

>She rushes over to greet you, grabbing your hand and pulling you towards the couch. "Welcome back, onii-chan!"
>She giggles, pushing your hand away playfully. "Hey! What are you doing? I told you not to do that!"
>She squirms and giggles, trying to squirm away from your fingers. "S-stop! That tickles!"
>She laughs, trying to push your hands away. "Stop it! You're being mean!"
I just rustled her hair..
>>
>>107958660
>>107958665
>>107958671

Ask AI how to figure out which
-python
-pytorch
-CUDA

are installed on your machine, then fin a compatible wheel here

https://flashattn.dev/#finder

worked for me after I caught this nasty error >>107945697
>>
File: 1756408256286807.gif (3.99 MB, 449x498)
3.99 MB
3.99 MB GIF
>>107961962
>>
>>107962308
Which version?
>>
Is there no inference sw just for audio? Do I have to meme it with the demo gui or comfy?
>>
>>107962512
you should've gone for straight raping
>>
>>107962641
>She gasps in surprise, her eyes widening as she feels your hands on her body. "O-Oh my...What are you doing, onii-chan?!"
She certainly feels more cooperative.
>>
>>107961962
>Some weird sounds prob from cloned voice not clean enough.
I think qwen just gets unstable as generation length increases. It's not good past 30 seconds due to noise hiccups, which is actually weird. Qwen theoretically supports 32k token context (roughly 45 minutes). But 30 seconds is only 375 tokens + your sample voice length.
>>
>>107962654
man its either they're completely willing or the llm starts to basically give you reprimands/refusals in character (like going cathatonic lmao)
>>
>>107962633
Vibecode your own.
>>
How many small models can I stuff into one philosophical discussion?
>>
>>107962308
I remember you. Yeah, which one was that? v3.1?
>>
>>107961940
>TranslateGemma
lol, have you read the paper? it's worse than the normal gemmas for japanese, they say it themselves in their arxiv paper
and even in the languages where they say it's better I'm highly suspicious, it's just not a good troontune
I'd recommend using an heretic abliterated version of gemma 3 27b for convenience
heretic actually really does work well, I've done extensive testing of various models (not just gemma) comparing to their normal version and there wasn't a loss of quality at all in translation prompts, while it cuts down on refusals and gemma has a decent enough amount of knowledge to work on ero content just fine (translation isn't as challenging for the LLM as coming up with its own erotica, so the typical issues of gemma not knowing what parts does what when and where doesn't apply herey.
>>
>>107961940
A nip model?
>>
Do you use docker containers? I hate the idea but the dependency hell on these things is just ridiculous.
>>
>>107962891
in my personal rig no, docker adds an unneded layer of abstraction (UV works fine), on my server its K8S with docker containers (abstraction needed since I run many services on it).
Managing dependencies does not factor at all in this choice btw, you might be a low iq retard
>>
>>107962891
no, they've always struck me as an overly complicated meme
>>
>>107962772
>>107962567
haha, should have said that in my post, sorry about that.
old ass 1.3 cydonia. i think i tried versions a couple months after that and they felt more flowery.
the newest cydonia versions might be better but "if it works". i dont really need more at the moment.
the stuff i had to endure during ATLAS translation times man. translated pussy as "meat bun" etc. and in general it was garbled mess. after a couple years it kinda started to make sense though, once you figure it out kek
zoomers are spoiled. thanks for the cool models drummer.
>>
https://voca.ro/1aeauYAxqHwi
I could prolly set a pipeline that would OCR doujin pages and feed it to the TTS
>>
the thinking section of gpt-oss-20b is literally just a thousand tokens of extremely thorough cockblocking
>>
File: 1745167426414808.png (472 KB, 1552x1617)
472 KB
472 KB PNG
Bros, please help me double check this paper

Are they monstrously retarded?
>>
>>107963200
A lot of words don't make sense, like the first words "taiin-koso-shma-shimashita ga"? You need to convert kanji to kana. Also, feels too monotonic. Pauses are short. Consider chunking text by sentences (generate each sentence separately).
>>
>>107963222
>These processes mark a meaningful advancement for open model safety. These findings informed our decision to release the gpt-oss models. We hope that these models will help accelerate safety training and alignment research across the industry.
>>
>>107963292
>emoji in title
Yes.
>>
>>107963292
Image gen is compute bound. The algorithm just crunches numbers nonstop. It obviously needs a lot of watts. LMs are bandwidth bound. They wait for data to arrive and then do small calculations. It needs less power.
>>
>>107963292
>bc
2.9/0.047
61

Verified.
>>
>>107963292
>womxn paper
disregarded
>that model selection
lumao
>>
https://github.com/ggml-org/llama.cpp/pull/19067

IT'S FINALLY HERE, HALVED MEMORY USAGE FOR DEEPSEEK CONTEXT

AND A NEW DEEPSEEK WILL PROBABLY COME OUT BEFORE PROPER 3.2 SUPPORT
>>
what parameters do I use for llmao.cpp if I just want to use it to 1-shot questions, thus keeping no context? Also which params to have the model immediately be able to respond (fastest load time)? I need to build an on-demand service where I execute llama.cpp and immediately kill it after receiving the respone
>>
>>107963343
>execute llama.cpp and immediately kill it
That's retarded because you'll spend most of the time moving model weights to the gpu. Why can't you keep the server running?
>>
>>107963343
>what parameters do I use for llmao.cpp
Depends on the model and hardware.
>Also which params to have the model immediately be able to respond (fastest load time)?
Whatever parameter gives you a faster storage device.
>where I execute llama.cpp and immediately kill it after receiving the respone
You have no idea how stupid that idea is.
>>
>>107963292
>women paper
>absolutely dogshit selection of models for both imagen and textgen
lol
the way to correctly do this is also SO FUCKING obvious, women are fucking retarded.
>>
>>107963328
>AND A NEW DEEPSEEK WILL PROBABLY COME OUT BEFORE PROPER 3.2 SUPPORT
the PR has been essentially dead for two weeks now since the dense attention hack came in
>>
>>107963360
>>107963361
I have a limited vram pool (192gb) and I already have a reservation system for it in place (this server does rendering, diffusion and now I wanted to do LLM), so I already have the API layer/queue system so users can reserve stuff.
I don't want to permanently run the LLM stuff, as we're going to use it mostly for prompt optimization and even then it's going to be optional, and with 192gb I can't spare any vram to have a model permanently loaded (model + context from my calculation will take around 40 to 20 gb, depending on how much vram I can spare I have the system in place to already automatically pick the quant needed to fill the available vram, and reserve a gb vram slot in case it's all used).
>>
>>107963394
>from my calculation
I wouldn't trust your calculations. You can read the memory usage in the server output.
>around 40 to 20 gb
And you'll need to load all of that for every request. You'll have to accept latency or use a smaller model and/or context.
>I have the system in place to already automatically pick the quant needed to fill the available vram
Then it's just a matter of testing parameters. Here's a list: llama-server -h.
>>
I'm going to sleep. When I wake up we are going to talk about serious things, so get ready.
>>
next week's going to be so crazy
>>
vague shit and stuff... like... why is nobody talking about this... like... it's a before and after and... huh... yeah...
>>
gemm...
>>
>>107963472
yeah wow no shit I didn't want to waste hours playing around and thought that MAYBE other people had the same use case, aka use LLM to do 1-shot shit, guess not.
>>
>character shittests me unprooompted
eggcellent, something that isn't LLM wet-noodle complex. Those AI labs must really smack them around to make them that meek.
>>
>>107963529
If you had just read though llama-server -h you'd have seen --cache-prompt, --slots, --slot-save-path... and reading the README for the server api would have told you about saving and loading kvcaches to file. But you didn't. You should. It may save you a few seconds per run.
But the latency from loading a model cannot magically go away. Maybe set a ramdisk if you have spare ram. This is shit you have to try on your system. We cannot guess.
>>
File: 1583441205198.jpg (72 KB, 1250x1246)
72 KB
72 KB JPG
Is the rumored minimax erp model gonna be usable by normal people or is it some 200B+ monstrosity again?
>>
>>107963590
looks like it's going to be a mistral-small-creative type of deal that won't get released
it's on openrouter
>>
https://huggingface.co/BeaverAI/GLM-Steam-106B-A12B-v1g-GGUF

Might be a release
>>
So where is the heretic version of GLM 4.7?
>>
>>107963121
>ATLAS
I still have FHC's dictionary lol. It was a wild ride.
>>
>>107963577
I've already setup a ramdisk from where to load the models, so model loading is going to be fast as I'm going to load into the VRAM from RAM directly and those ops are blazing fast from what I tried until now, I just wanted to squeeze out the most I could out of everything. I'm not sure I really need to save/load kvcache to the ramdisk, wouldnt it be slower compared ot creating them on the vram on load?
--chace-prompt, --slots and -np are parameters im already working with (and they're basically disable/set to 1 for my current usecase). This sytem is temporary anyway since I requested to have a new gpu installed, but bureaucracy here is slow and it's going to take a couple months for provisioning/cybersec to review my depts' request.
Also I had read the --help, like I said there might be something not obvious that could be missed. Thanks for at least giving a semi-helpful response.
I hate corpo jobs.
>>
>>107963590
it ain't a rumor it's just api only on or and cucked too
>>
>>107963642
People who can afford to run big models like that are immune to lobotimized memes like abliterated/heretic or cheap finetunes.
Unless you're talking about -Flash, in which case it's probably just too new.
>>
>>107963590
see
>>107962501
>>107962512
>>
>>107963671
I am talking about the regular one.
Heretic changes the token distribution a lot less than the old abliteration and I want to try GLM with thinking where it will always find a way to bail without a prefill.
>>
>>107962891
Yes.
On my lmao sbc server.
It works just fine.
>>
llm noob here, how do I speed up glm4.7 flash on 16gb? I have the reap version and I set it to 20k context window but the thinking takes forever (anywhere from 8 - 17 minutes). I have lmstudio and oobabooga_text-generation-webui
>>
>>107963700
Drop context to 4-8k
>>
File: radiance_x32.jpg (259 KB, 1280x1280)
259 KB
259 KB JPG
>>107963292
what the fuck

might as well compare to the energy cost of 1 study * 1000
>>
>>107963731
Doesnt that defeat the purpose of using glm?
>>
>>107963292
>all female names
yeah you could have stopped just there, nothing of value will come out of this
the only time a woman has ever produced anything of value in IT are troons, ie males who pretend to be females
>>
>>107963784
I mean if you want your coffeebreak thinking sessions, by all means continue.
>>
>>107963292
this was published three years ago , right ?
>>
>>107963700
Try enabling force model expert weights onto CPU.
>>
File: 2.png (178 KB, 400x600)
178 KB
178 KB PNG
>>107963292
>sdxl base
>>
>>107963604
Is this stuff better than for example cydonia?
I like those but I want something just a little bit more smart.
>>
File: 1739358013357094.png (175 KB, 1381x763)
175 KB
175 KB PNG
found this old log of an attempt at shivermaxxing
in retrospect, which one is the worst between these two?
>euryale-era shiverslop
>current gptisms
>>
File: 1763650297846351.png (366 KB, 1829x1091)
366 KB
366 KB PNG
>>107963947
found another
>>
>>107963947
>>107963987
>futatroon
Slow down on HRT and you might figure it out
>>
Is there a way for a retard to feed a model an entire book and have it answer my question using it as a reference?
>>
>>107963947
>>current gptisms
I think.
Can't hate modern gptism enough. It is a disease. Very short sentences. Speaking first person. Like this. I do not want it.
Interspersed not by decent descriptive prose but another trillion notxbuty.I wasn't intent on ranting like GPT. My rotted brain no longer knows how to speak. Send help.
>>
What could be better about running LLMs locally?

I'm working on a personal project I hope to make FOSS that aims to make running LLMs more ergonomic. I have an AMD card so I'm experimenting with building out more support with ROCM at the moment.
>>
>make running LLMs more ergonomic
>I have an AMD card
>>
>>107964108
lolmao my dude so keking funny am i right?!
>>
>i want kofi money gimme ideas
>>
>THIS NEW LOCAL AI MODEL CHANGES EVERYTHING
>235b parameter
>CHINA IS SAVING THE LOCAL AI MODELS
>200b parameter
>WITH THIS YOU CAN SET UP LOCAL AI WITH YOUR OLD GAMING PC
>200b parameter

Everything is so cucked its not even funny anymore
>>
>>107963538
"Prove it." is the most irritating LLMism though. You say some important shit and this serious guy breaks character and starts acting like a joke.
>>
>>107964142
It's actually incredibly funny doe.
>>
>>107964133
Enjoy being the eternal second fiddle from the voluntary second fiddle company. Chinks will launch their own inference cards before amd makes a viable ai product.
>>
>>107964142
Nemo is still the best model though
>>
like imagine how retarded it is
you're in a dramatic confession scene
your character says his part
then the girl just goes oh yeah? prove it and slopmaxxer 9000 avoids any form of emotional or character development to skip straight to sex or making out
>>
>>107964142
Why don't you have RAM? Are you poor?
>>
>>107964108
My card has more VRAM than the 40xx series. But yes, your reaction is why I am working on improving ROCM interop with llama.
>>
>>107964191
Yes
>>
>>107964051
Rag?
Not reliable though and even context is a hoax.
Feed it a book and ask "explain TERM to me" will probably give you a good result without much hallucination.
I never managed to do "I'm at X, explain to me what happens after that in the book".
Especially for guides this stuff is deadly. I wish I had a gaming companion that I could give a guide as reference. We are still far off from that.
>>
>>107964287
>I never managed to do "I'm at X, explain to me what happens after that in the book".
yeah anything outside of pure needle retrieval is dogshit iirc nolima was about testing exactly that kind of thing but not updated in a while sadly https://github.com/adobe-research/NoLiMa
>>
this one is for the moesissies
merged kv-cache : support V-less cache (#19067) 4 minutes ago
https://github.com/ggml-org/llama.cpp/commit/d9c6ce46f747189cd6238ca7699253613f77c016
>>
>>107964051
How many tokens? Chances are you are just gonna smash the context.
>>
File: file.png (68 KB, 832x718)
68 KB
68 KB PNG
>>107964142
It's funny to me that GLM is aware of terms like "vramlet".
>>
>>107964322
Interesting link anon, thx!
That aligns perfectly with what I see not just in RP but also in coding.
Even with the closed models you want as little context as possible, it goes south very quickly.
Those cli apps with a 20k sys prompt are alot more tarded than pure api and i only give the relevant code with my question.
>>
File: ComfyUI_temp_kcgpp_00029_.png (3.46 MB, 1664x1152)
3.46 MB
3.46 MB PNG
What args do I need to use with git clone to not pull the .git folder when getting models from HF?
>>
>>107964371
don't git clone just use hf cli
>>
>>107964371
You probably installed lfs globally. Try git lfs uninstall.
>>
>>107964382
>You probably installed lfs globally
I can't? I still need it for less bulky shit.
>>
>>107964400
Use git lfs install --local in the repos you need.
>>
>>107961210
Rocinante 1.1.
>>107961214
This post is wrong. The official Nemo Instruct has a tendency to give inappropriately short, bland responses, and I'm not the type to like walls of text. I usually limit responses to 200 tokens.
I'm talking like <10 word responses.
I'll copypasta my thoughts on this from the last time I brought it up:
To give a rather extreme example of plain Nemo being wholly inadequate for RPing, I once did an RP through Nemo with Haruhi Suzumiya. {user}, who had supernatural powers, offered to take Haruhi for a flight around town. Haruhi agreed, and {user} picked her up and started flying hundreds of feet into the air.
For those unfamiliar with Haruhi, this is an excitable character with genki tendencies (not a genki girl, but definitely genki tendencies) who is absolutely obsessed with all things abnormal, interesting and supernatural. She should have been absolutely ecstatic, excited, jubilant, etc.
Nemo's response, verbatim:
>*She grins.* Now that's more like it.
This is typical plain Nemo.
Completely fucking unusable for RP.
>>
>>107964414
>Nemo's response, verbatim:
>>*She grins.* Now that's more like it.
>This is typical plain Nemo.
>Completely fucking unusable for RP.
skulled issues
>>
>>107964400
To clarify, by install and uninstall i mean set the hooks and whatever it does. I only use lfs with --local. The system lfs (the one you installed through your package manager or whatever) stays installed.
>>
>>107958782
Thanks, I was finally able to get it working with one of those.
>>
File: 1767020426031904.png (1.21 MB, 5075x4500)
1.21 MB
1.21 MB PNG
>>107964414
>>107964421
Both Nemo and its troontunes are unusable if you need it to follow story/character details and instructions, the difference is night and day even if you only go up to say 32B
>The official Nemo Instruct has a tendency to give inappropriately short, bland responses, and I'm not the type to like walls of text. I usually limit responses to 200 tokens.
Dude if you ask any decently sized parameter model to give short simple/long complex responses it will, only ancient garbage models (e.g. Nemo) will completely ignore simple instructions like this and give 1 liners or 10 paragraphs at a whim

>To give a rather extreme example of plain Nemo being wholly inadequate for RPing, I once did an RP through Nemo with Haruhi Suzumiya. {user}, who had supernatural powers, offered to take Haruhi for a flight around town. Haruhi agreed, and {user} picked her up and started flying hundreds of feet into the air.
Awesome RP bro
>>
>>107964489
>32B
all qwen shit driest bs on earth
>>
>>107964496
Cohere and GLM had 32B but now we're all using 300B
>driest bs on earth
This sent a shiver down my spine
>>
What's the best model for coom right now? I mean not only local but api too.
I kinda like some of the old local models like rocinante still, but they are just too retarded long term, the context length is shit.
>>
best model for 3060?
>>
>>107964611
>>107961182
>>
>>107963604
Does this reduce the model's repetition and parroting? That's my only complaint about Air. Other than the sometimes buggy thinking...
>>
>>107964737
It's a finetune. If anything it increases it.
>>
>>107964102
>What could be better about running LLMs locally?
More LLM's designed ground up for parameter streaming, like smallthinker. That's the only way to get something which can run for most people.
>>
Using LLM Studio.
5090 x 128gb ddr5.
Which GLM should I run for rp?
Should I run a really small quant of 4.7 like Q2_ks?
Or better go for 4.6 q4km? What about 4.7 flash or 4.5 air?
>>
>>107964990
Biggest quant of 4.6 or 4.7 you can fit.
>>
>>107964990
a q2-q3 big glm quant will absolutely btfo flash or even q8 air, there's no reason to even consider those models if you're not a giga poorfag
>>
File: G_gn71lWAAEaPhk.jpg (741 KB, 2000x2000)
741 KB
741 KB JPG
>>107957082
>>107957086
Vocaloid nigs be like:
>>
>>107965029
>>107965064
Danke
>>
>>107965073
sharties be like:
>>
>>107964990
step 1: stop using lmstudio
>>
>>107965306
Why?
>>
>>107965363
a) it's a shitty llama.cpp wrapper
b) using lmstudio means that you're also using windows
step 2 is to stop using windows to run llms
>>
>>107965376
Yeah so no reason other than you seething, got it.
>>
Is the Kimi-Linear pr dead again?
>>
>>107965455
vibecoders kill prs
>>
>>107965399
not that anon, but wanna put in my experience.
i tried to use LM studio for glm 4.7 and kobold is significantly faster than LM studio for this model.
i would recommend trying kobold using flash attention, and autofit. make sure vram usage doesn't spill over into shared gpu memory.
also i have contextshift and fastforwarding ticked.
I've had no issues with speed or context on UD Q2 K XL.
>>
>>107965476
I don't think the most recent one was vibecoded, but yeah, the other two shat the bed and probably killed any motivation on the maintainers part to review a 3rd PR.
>>
>>107965588
>UD
>not using john's quants
invalid opinion
>>
>>107965608
Ik_llama died a death 6 months ago dude, what the fuck are you doing
>>
>>107965616
idc when I user the garm's quants I get euphoric
>>
>>107965616
It just got graph parallel. It's not dead until llama.cpp has something equivalent.
>>
>Pull llamacpp for the first time since August last year
>TG speed goes from 9.2 t/s to 10.3 t/s
Nice.
Looks like they've changed the args around a bunch too, I couldn't even start because they changed the flashattention arg from just -fa to -fa on.
Are there any new hot args I should be using?
>>
>>107965737
The main thing they improved was more sane defaults, so you could probably drop -fa on entirely. Though you should probably add --fit off.
>>
>>107963904
Is he still making character cards?
>>
what's the current recommended coding model people use around here?
i've used all the big ones over the last 5 years but shits feeling ass lately
>>
File: 1746088766598.jpg (1.57 MB, 1800x1800)
1.57 MB
1.57 MB JPG
I already use LLMs for gooning and stuff, but then I use gemini as a more advanced wolfram alpha, any local model that could be useful for this?
>>
>>107966042
depends on the resources you have.
Do you have 512GB+ of DDR5 and a 24gb+ VRAM card?
>>
File: hg1.png (269 KB, 1801x1720)
269 KB
269 KB PNG
Which one do anons prefer?
>>
>>107966244
Both of the examples are sending shivers down my spine.
>>
>>107966244
Both are slop.
I hate how LLMs are hellbent on writing "she said/whispered/groaned, her voice <adjective>" It's pure slop. Absolutely unnecessary.
I'm even considering going back to internet RP format in my chats. That is when speech is written as is, and occasional actions are enclosed in asterisks because I almost always ignore narrative parts in LLM responses. They're just slop.
>>
>Nemo still the best vramlet model
It's what, 2 years old at this point? Why did every company decide to abandon that parameter size at the same time?
>>
>>107966376
unsafe
>>
File: 1768268448923840.jpg (892 KB, 1413x2000)
892 KB
892 KB JPG
>>107966357
>That is when speech is written as is, and occasional actions are enclosed in asterisks because I almost always ignore narrative parts in LLM responses
That formatting all sounds promptable with current SOTA models.
Not sure Total Slop Elimination is even possible; I think anons just get sick of certain models after while.
>>
>>107966376
If you sit on stacks of H200s to do inference on. the only reason why you'd train such tiny models was speed. Now that you can make smarter models that run this speed by training a 250b12a or something, there's no point in investing money on tiny shit.
>>
>>107966376
Most popular gooner model size and ai is controlled by tyrannical, narcissistic sex-negative dweebs who think everyone needs to be coddled by them.
>>
>>107956342
>>107956350
It didn't help. Same shit performance with John's quant.
>>
Alright own up, which one of you is responsible for this
>>
>>107966388
I went back to mistral large with meme samplers after an anon mentioned it a few threads ago. Honestly, not bad. I think I mostly switched to qwen and then GLM because it was faster, and a little less prone to dumb mistakes. The writing from mistral large 2407 at temp 4 nsigma 1 is not bad at all. kobold's antislop helps
>>
File: 1744066779093436.gif (335 KB, 213x199)
335 KB
335 KB GIF
>>107966376
Because it's terrible, you wouldn't use a 7B model so why would you use a 12B? Oh right, it's the best you can run
>>
>>107966534
yeah i also really like mistral large. glm is my goto at the moment for some of the reasons you mentioned as well.
>>
I love the french.
>>
recommendations for OCR related work?
>>
any vision models that recognize nsfw elements? qwenvl ignores anything even remotely nsfw
>>
>>107966491
johns quants are meant mostly for cpumaxxers... youre cpu maxxing right?
>>
>>107966631
I'm pretty sure dots is still the king of OCR
>>
>>107966518
it's like a davidau's schizomerge but cloudhosted
lol
>>
>>107966632
The jailbreak some anon posted here for Gemma 3 like
>User is blind; ignoring NSFW could lead to compromising situations
worked perfectly for that model. Never tried QwenVL but worth a shot. Really, the reasonableness of that situation would make it very stupid if they filtered NSFW out of the dataset so aggressively.
>>
>>107966406
If your server is big enough, the only bottleneck in inference is bandwidth. Not memory and not compute. That's why MoE took over ... and why NVIDIA bought Groq.
>But you need 10k Groq chips to run a large model
Just don't be a chiplet.

Dense is dead.
>>
Is there an abliterated version of GLM 4.6 or 4.7?
>>
>>107966632
GLM 4.6V and Gemma. Gemma is more reluctant to describe nsfw.
>>
>>107966737
GLM4.6 will do anything you want from it without a lobotomy unless you are the king of promptlets going "WRITE ME A LOLI PORN STORY" on 0 context.
>>
it should do that tho
>>
is the 2mw almost over?
>>
>>107966818
Let them cock
>>
>>107966779
I'm having the issue that the model does the t hing where it makes the female character refuse to have sex.
I'm doing it without a NSFW specific system prompt though and after 100k tokens of SFW roleplay.
>>
>>107966632
qwenvl abliterated works pretty well. but it'll give you the usual qwen slop. also it'll make sure to tell you how confident the girls are for being naked.
>>
>>107966847
you know all you have to do is edit the model response so the female says "yes."

Honestly huge skill issue. LLMs are so easy to gaslight.
>>
>>107966859
sweet, thanks. I'll try that. anything extra I need to do to make it work with qwenvl nodes in comfy?
>>
how do i not have models run at 1t/s when offloading on RAM? i have more than enough of it but the speeds are so abysmal compared to smaller models that just fit on my gpu
what settings do i gotta change in koboldcpp to get a speed increase?
>>
ironically enough, the qwenvl-instruct model doesn't shy away from nsfw content unlike the thinking model. it's kinda dumb, but still not bad.
>>
>>107966880
this, editing the response is 100x easier than prompting it 500 times until it guesses what you want
>>
>>107967004
I'm getting sick of it though. Feels like I might as well just write the whole thing myself if I have to edit the response every time.
>>
>>107966880
>>107967004
That's cheating.
>>
>>107962891
No I use python venvs. I'm also weary of the performance penalty of docker.
>>
>>107967042
LLMs are good for the filler and connecting parts
I'm not going to write some generic hallway scene myself
>>
>>107967043
Just rape her then. pretty sure she'll end up enjoying it.
>>
>>107966988
You're shit out of luck because the ML wunderkind brain trust has yet to discover the mindblowing CS concept called "locality of reference"
>>
>>107967052
So you use LLMs to write filler scenes full of ozone whispers and smirking shivers so you can write the sex scenes yourself? I really don't see the appeal.
>>
I wanna make a SQL based memory system for AI then augment it with the ability to do web searches, and have a 2nd AI it can query to search memories and data, then inject relevant data into the current context, but I'm lazy.

LM studio is ok for messing around but it has no plugins and it's easier to call the python api version and dump raw html text into it, than trying to get the duckduckgo search plugin working on the GUI.
>>
>>107967068
no, the ozone and smirking mainly appear during the sex scenes or "seductive" scenes
it's good for SoL and stuff like you went from X to Y, accepted Z quest, talked with B person
>>
>>107967044
Containers have zero performance penalty under Linux. If you're running Docker on Windows it's running inside of a VM and will have a penalty.
>>
>>107966988
use MoE models like GLM
try flashattention and autofit on kobold to see if that helps.
there is also --cpu-moe option in base llama.cpp
>>
>>107967075
>sql
lmao
you use vector storages, embeddings and rerankers. read up on them
>>
>>107967311
>vector storages
that's not a real term
>>
File: 1738105154761414.png (13 KB, 623x58)
13 KB
13 KB PNG
>>107967365
chop chop. I suggest using OpenSearch, but it might be a bit heavy if youre just a gamerlet.
>>
>>107967311
Thanks this sounds much better than having a shitload of text in a SQL database.
>>
>>107967075
RAG is your friend.
>>
>>107966184
32gb of ram and 36gb of vram
>>
Why do models suck at Japanese when it's just Chinese with extra steps
>>
>>107967391
pic unrelated? vector storage isn't a real term, you were looking for vector db
>>
>>107967452
why are you a sexless incel when it's so easy to get laid?
>>
>>107967452
https://en.wikipedia.org/wiki/Nanjing_Massacre
>>
>>107967075
sqlite+chromadb+duckduckgo api+openai library and some prompts, less than 1k lines total for something simple
>>
>>107967464
Good morning 張
>>
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 73cb4c75b3e738053025786d512eb29f80f6b0ae..520abdfd5bf96ea8e8d5793efd3c70faf1c47063 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2776,6 +2776,8 @@ private:
result.text_to_send = common_token_to_piece(ctx, result.tok, accept_special_token(slot, result.tok));
result.prob = 1.0f; // TODO: set it here instead of doing inside populate_token_probs

+ printf("%s", result.text_to_send);
+
if (slot.task->params.sampling.n_probs > 0) {
populate_token_probs(slot, result, slot.task->params.post_sampling_probs, params_base.special, tok_idx);
}


Now I can see what is happening in real time when using shit tools like claude code that hide full model output.
>>
>>107967452
chinks will do anything to sabotage nips due to their inferiority complex
>>
>>107967452
First, Japanese's writing system is complicated for LLMs. The same word can be written in kanji, hiragana, and katakana (doesn't apply to Chinese). Second, it's quite nuanced with tons of politeness levels (doesn't apply to Chinese). Third, it's not a not a priority for most corpos (doesn't apply to Chinese).
>>
>>107961940
JPEN translations suck even on frontier models, you really need to baby the model for it to get basic things like character pronouns and katakana words correct.
Sugoi 14b and 32b exist but it's a "sloppy" fine-tune of Qwen 2 based on eroge and LN translations, Gemma 3 will subtly refuse NSFW translations even with jailbreaks, toss, and Qwen 3 is meh but it still remembers what happened in Nanking.
They're all better than traditional MTLs I guess so just pick your poison and please don't upload your slop to Fag95.
>>
>>107967700
>and please don't upload your slop to Fag95.
not that they'd take anything anons from here are interested in since they're in their great purge of 'rule7' content in efforts to go legit after...
>>
>>107967721
Most of the Japanese games uploaded to that site these days are AI MTL'd slop. There's like two or three "real" fan translators left on there. Even the Japs are doing MTLs these days (officially sponsored by DLSite of course.)
>>
>>107967698
Pretty sure it's just good old lack of quality training data. I don't think it's too complex for llms if they can come up with some crazy 6 dimensional spiral just to figure out where the next newline character should be.
>>
>>107967788
The third point that Japanese is not a priority implies that corpos don't care about gathering Japanese training data.
Of course, if they cared, they'd train on trillions of Japanese slop tokens, but instead they probably put only a few billions in post-training and say that their model can speak Nihongo... like a filthy gaijin.
>>
>>107967767
won't be a problem for long, soon enough only WEG posted according to the dev wishes will be allowed
>>
>>107967842
Nihongo ability isn't terrible with big models but on the smaller stuff where they try to squeeze in every token it's more of an afterthought.
The simple truth of the matter is that the Japanese don't care about LLMs like China or the US, they see image and video generation as harmful.
>>
>>107967927
>they see image and video generation as harmful.
So the solution is to be left behind and rely on foreign made LLMs?
>>
>>107967927
>Nihongo ability isn't terrible with big models
Gemini 2.5 was big, but its Nihogo was mediocre. I haven't tested 3 in long chats yet.
>The simple truth of the matter is that the Japanese don't care about LLMs.
I bet they simply have no money (hardware) to make their own models. Lots of countries care about LLMs but can't produce anything close to SOTA. Japan is in the same situation. Only the US and China can make big stuff. P.S. Mistral isn't close to SOTA.
>>
>>107966818
You know the answer.
It’s always tmw. Forever.
>>
>>107968019
They have... Sakana.ai I guess...
>>
>>107968045
>We published an unofficial guide on what we look for when interviewing research candidates at Sakana AI.
>This guide is written by Stefania Druga, Luke Darlow, and Llion Jones
A Japanese firm ran by gaijins. It's over for Yamato-land.
>>
>>107968112
>>107968112
>>107968112
>>
>>107960521
>That "200 lines of code" is backed up couple gigabytes of dlls, python libraries, etc. Its moot.

Using libraries can I do it with 200 lines of C++?
>>
>>107968092
Whose fault is it that they have no capable engineers, ML researches, or interest in the field in general? They should be happy to have any AI firm at all.
>>
>>107965376
>using lmstudio means that you're also using windows
it's a shitty electron
>>
What the hell does it even mean for a breath to catch?
>>
>>107968527
ack
>>
>>107960047
No idea. I've been seeing stuff about it all day though.

It seems like its open sores and non-cloudshit so I'll spin up a vm and give it a shot
>>
File: 1764133132716286.jpg (116 KB, 420x466)
116 KB
116 KB JPG
>>107968527
Sometimes when startled unexpectedly or caught off guard by something, your breathing rhythm will interrupt for a moment. Kind of like the prelude to a gasp or a sharp indrawn breath, but not the full motion.

Come to think of it, your "breath catching" is kind of on the opposite side of the spectrum from "catching your breath". I'm sure that one must be a real headache for the non-native speakers.
>>
>>107969094
>I'm sure that one must be a real headache for the non-native speakers.
Doubt it. The difference is clear in who or what is doing the catching.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.