[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1766360935764827.jpg (1.3 MB, 4000x2252)
1.3 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108813392 & >>108805584

►News
>(05/07) model: Add Mimo v2.5 model support (#22493) merged: https://github.com/ggml-org/llama.cpp/pull/22493
>(05/06) Zyphra releases ZAYA1-8B, an AMD-trained MoE model: https://zyphra.com/post/zaya1-8b
>(05/05) Gemma 4 MTP drafters released: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4
>(04/29) Mistral Medium 3.5 128B dense released: https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
►Recent Highlights from the Previous Thread: >>108813392

--Paper (old): Activated LoRA: Fine-tuned LLMs for Intrinsics:
>108814192 >108814207 >108814326 >108814898 >108814962
--Debating the validity of PPL and KLD for comparing quants:
>108819872 >108819961 >108819996 >108820054 >108820084 >108820090 >108820124 >108820137
--Criticism of llama-server logs and discussion on token counting endpoints:
>108813998 >108814016 >108814027 >108814169 >108814729 >108814349 >108814458
--llama.cpp added continue generation support for reasoning models:
>108814696 >108814740 >108814801 >108814861
--Using Kokoro voice models with a multi-agent character system:
>108820291 >108820303 >108820316 >108820408 >108820447 >108820467
--llama.cpp development priorities and Firefox WebGPU support:
>108818249 >108818267 >108818284 >108818347 >108818276 >108819396 >108819670 >108820049
--Anon buys 4 Intel Arc Pro B60s for cheap 96GB VRAM:
>108817916 >108817941 >108817950 >108818620 >108819781 >108819795 >108818032
--Anons building minimalist custom frontends to replace SillyTavern:
>108815525 >108816950 >108815617 >108816045 >108815674 >108815699 >108815730 >108815935
--Comparing LibreChat and OpenWebUI with tips for llama.cpp thinking toggle:
>108816798 >108817231 >108817290
--Searching for MCP servers using vision and OCR for web search:
>108820282 >108820321
--nla.cpp for interpreting LLM internal states and Gemma 4 compatibility:
>108817420 >108817600 >108817615 >108817630
--Comparing Zoo Code and Copilot gateways for privacy-focused local coding:
>108814561 >108814606 >108814636 >108814653 >108814694
--llama.cpp added continue generation support for reasoning models:
>108813529 >108813601
--Preventing cloud models from collecting data in opencode:
>108818616 >108818870
--Logs:
>108813780
--Gumi, Miku (free space):
>108813423 >108818032 >108818052 >108819762

►Recent Highlight Posts from the Previous Thread: >>108813394

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
SEX
>>
worth buying one of these for AI? turing is old, but that VRAM buffer opens up a lot of possibilities. It's more VRAM than a 5090.
>>
>>108821104
Not for 2000 usd they're not.
>>
>>108821104
worth it?, no, you can get like 4 p100 for around 1/10th of the price, even a new full pc dedicated only for that will be around 1/4th or 1/3 of the price
but depends on your usecase, if you have some extra limitations etc.
>>
>>108821104
Does Turing have FA yet?
>>
>>108821104
>2000 spanish real de ocho dólares
lmao
You can buy like at least x3 3090 with that.
>>
>>108821145
the power draw would be insane. you'd need a 2000 watt power supply.
>>
File: Untitled.png (244 KB, 950x573)
244 KB PNG
>>108821145
I can't even buy two in my country.
>>
>>108821159
You don't, undervolt nigga, you have decently priced 1600w psu now, although they may run with 1200w with a heavy undervolt and a light system.
>>
>>108821187
You'd also need a server CPU since consumer grade CPUs don't have the PCIE bandwidth to support 3 cards.
>>
>>108821198
do you really need the bandwidth?
>>
>>108821209
won't the cards run slow as shit without it?
>>
https://huggingface.co/deepseek-ai/DeepSeek-V4-Lite
>>
since wikitext PPL is meaningless on instruct, would testing the quant on base models work out? trying to see how much i can get out of 4-bit quantization.
i'm downloading qwen 3.5 9b base for now.
>>
>>108821166
Vram market in aus is absolutely crap. I imported my stuff from hong kong a little while before prices went apeshit, might still be worth taking a squiz on stores from there that ship to oz.
>>
>>108821166
I feel your pain
t. JP
>>
>>108821225
are they actually exchanging information? i thought tensor split didnt do that, but i may be mistaken
>>
>>108821290
>>108821273
>>108821166
xianyu and a forwarder
>>
>>108821225
I've had no troubles with x16+x4 tp2 3090s back in the exl2 days. Don't know about how it'd affect llama.cpp. I run all my cards at x16 now, since you can get a h12d-8d mb with 4 gen4 x16 slots and epyc 7502 combo for 400 usd in china.
>>
>>108821302
They killed my hard drives.
>>
>>108821225
NTA but not if you are running them in series for inferencing rather than in parallel.
>>
>>108821302
idk never brought new I just camp forever and sweep up local glitched deals actually
>>
>>108821302
imma gook and i use xianyu+alibaba integrated forwarder
idk if that works for westerners
>>
>>108821353
Which variety of gook, like vietcong or muslim ones
>>
>>108821385
worst korea, the opposite of the best one
>>
>>108821166
The Gigabyte Turbo RTX 3090 is overpriced by like 100% and always has been. You literally just looked for that exact model to compare to be a contrarian shitbag.
>>
File: 1778766911316629.jpg (502 KB, 3992x978)
502 KB JPG
>>108821290
Tokyo? I have two 3090s collecting dust on the shelf. Haven't used them since Mistral Large times
>>
has anyone tested to see if gemma 4 is good enough fort the minecraft ai companion mod?
>>
>>108821434
isnt minecraft a fundamentally spatial task that
llms cannot really comprehend whatever is happenuing even with aggressive harness
>>
>>108821455
i've seen them doing some stuff but building anything entirely on their own looked horrendous unless it's prompting the building with some sort of minecraft building genai model
>>
I feel sorry for the anon who wasted his money on B60s. I had a B70 for a weekend, what a piece of shit. It technically worked, but it took a ton of compromises. Here's what annoyed me enough to RMA it:
- had to use a custom kernel which supports the card (not terrible, you have to dick around like this for nvidia too)
- sycl was a pain in the ass, only supporting limited distro choices
- llama.cpp mmap was somehow fucked up, insisting on first loading the model to memory, which ruined my testing, since my test box only had 32GB
- couldn't get vllm to work at all

Since sending it back I hear Intel is giving up on the whole of consumer discrete GPUs anyway. I hate paying the nvidia tax, but shit at least works. I say if you have the patience to play with B70/B60 then you might as well V100 32GB max instead, you'll probably get similar performance, it'll cost less, and more things will be supported.
>>
>>108821455
Yes but so are the LLM's usually used to run it. The mod says you can locally host it but you need to have a REALLY high end model. (don't you just love vague niggers)I figured if Claudussy and chatgpt can do it then surely gemma might.
>>
>>108821461
Did you try vulkan?
That should work without much work right?
>>
>>108821461
They're giving up on gaming gpus since the money is in workstation cards for ai.
>>
>>108821461
>and more things will be supported.
Not really. It's already out of support, which means you can't use cuda 13 or the latest drivers and most inferencing software besides llama.cpp have already dropped it as well.
>>
>>108821475
gpclaumini-opus-xhigh-max will fail to build a human grade moderately decorated base at the moment
>>
>>108821531
That's fine, I pretty much already figured this was going to be the case. Its just when you have local models at the level gemma 4 is currently at, I don't see why I can't find a single youtube video of something like E4b being used to drive NPC behavior. You can do a lot with an LLM just by handing it some tools and still letting classical systems for visual input take over the hard work. An AI doesn't need to see a literal cheese wheel to pick it up in skyrim. It just needs to see the cheese wheel on some text formatted list that it can use the option to pick up.
>>
>>108821482
I thought about it, but I wasted so much time merely getting llama.cpp to work at all I'd lost patience for it and decided to cut my losses. A single B70 really isn't enough to run qwen 3.6 27b at q8 with full context, and I'm trying to save for a maxxed out M5 Studio when they're released - if there's at least a 256GB model.
I used to be a CPU deepseek-haver, back when no one wanted DDR4 memory and 512GB of it was only like $800. It worked, but far too slowly. I don't think I need to go that large, but it would be nice to run 300-400B-tier models at q4, and that's not really doable in 128GB shared memory.
>>
>>108821523
Correct on CUDA13 but I am pretty sure volta continues to get improvements on llama.cpp, no?

The best value I've gotten from nvidia was buying a modded 4090D 48GB. Ada is still relevant, the card has been totally reliable, and 48GB is enough to play with things like ltx-2.3 and 27b-tier models.
>>
File: workspace.jpg (142 KB, 811x1299)
142 KB JPG
If you have multiple GPUs try out split mode tensor.
My Gemma 4 31B Q8 jumped from 25 t/s to 36 t/s, IBM Granite 4.1 30B and Qwen 3.6 27B even go beyond 40 t/s.
The only downside is that it doesn't support context kv quant, but since it's faster than using speculative decoding I can kick the draft model and use the vram for higher quality context.
My server now sounds like Ramiel charging an attack during inference though.
>>
>>108821557
i thihk you can do that with openai compatible ser ver on llamacpp
try it and let me know
i do like minecraftg
>>
For anyone using ik_llama and GLM 4.X, try MTP. Went from 4.8t/s to 6.1t/s with --draft-max 5 and --draft-p-min 0.75.
Don't pull past this commit though, I'm getting a performance regression from one of the commits after.
https://github.com/ikawrakow/ik_llama.cpp/pull/1784
>>
Anon, what's the sota translation model for EN>FR... I want to translate a book on MJ but it's only in English.
>>
>>108821461
>lama.cpp mmap was somehow fucked up, insisting on first loading the model to memory, which ruined my testing, since my test box only had 32GB
i can't believe that's still a thing!
i got fucked by that a year ago with 2 a770s
>>
>>108821592
Probably won't be in this exact thread but I'll definitely set it up later today or tomorrow after some sleep.
>>
>>108821601
> using ik_llama and GLM 4.X, try MTP.
I use 4.6
Give your speeds, are you on mostly on CPU?
>>
>>108821627
quote me on /lmg/ with the thread please when you do that
i am interested
>>
>>108821577
llama.cpp yes, ik_llama and vllm no. Image and video stuff was already dodgy due to the lack of flash attention support. There's no problems with drivers or cuda though, which is nice, but in every other way it's not much different from going amd or intel in that you're basically locked in the llm inference through llama.cpp only and they have very little use outside of that.
>>
File: 1774094521777896.png (63 KB, 659x171)
63 KB PNG
>>108821643
>Give your speeds
It fluctuates. I'm getting +1t/s basically, but it's free and it's a 20% boost.
>are you on mostly on CPU
Yes
>>
>>108821610
google/translategemma
>>
>>108821416
Price has gone up almost 2 fold, you should sell
>>
Wait, wasn't MTP speculative decoding merged in to llama? Not seeing the option available.
>>
>>108821581
>mode tensor.
which gpu? that is for richfags with NVLink
>>
>>108821745
gerganiggers decided it will not be merged until all the poorfag platforms are also supported
has been dragging on for a week now
https://github.com/ggml-org/llama.cpp/pull/22673
>>
>>108821743
Too bothersome. I'm happy to sell them for 100k each if it's a hand-to-hand exchange
>>
price in japanese yen, obviously
>>
I know some industry fags browse here, can one of you Xtards remind elon that he's supposed to open source old grok versions? or just go ahead and publish it without asking him?
>>
>>108821752
Two asus 3090. I don't have NVLink, both communicate over PCIe (x8/x8 mode) over CPU. The topology is PHB to be precise.
I was looking into direct PCIe communication and P2P, but there is no support for 3090 for that.
But yeah, tensor split also works over PCIe.
>>
>>
>>108821899
Unless stuff changed recently won't the pcie be totally overwhelm by row, tensor mode chatter?
Was it really faster on tensor mode?
>>
File: wally feat.jpg (280 KB, 1041x1600)
280 KB JPG
What model size can an rtx 5090 32gb vram all on its own? I am satisfied with like 7-10 tokens per second speed. I need to ascend from the online chatbot slop services that charge like 20 bucks for a model with 16k context tokens ontop of LIMITED TOTAL MESSAGES (im not even joking, half the websites do that shit)

If I blow like 3k-4k on that graphics card, what type of model can I run with those at around those speeds and 32k context?
>>
>>108821899
>there is no support for 3090 for that
There is on some. Have you tried https://github.com/tinygrad/open-gpu-kernel-modules ?
>>
>>108821923
I block images in these threads but if programing qwen 27b at q6 can give you over 220k tokens at kv cache q8_0
Everything else you can run gemma 31b at 50k at full kv because it doesn't handle quants well, both will do 90% of task if you're doing rp and not precise work you can quant the kv for gemma it will be fine.
>>
>>108821949
Thanks, may I ask why you block images in these threads?

Also aren't those models usually really strict with censorship and break character often?
>>
>>108821949
>at full kv because it doesn't handle quants well,
Meanwhile, I'm having 0 issues with 3.0 bpw exl3 gemma 31 with q8 context
>>
>>108821915
Kek, been going well for you so far, I take it?

t. spent more energy than I care to admit telling gemma and qwen they were both retarded when they broke things.
>>
>>108822003
I get so fucking angry with the dumb shit it does
>>108822000
No idea what that is but if it's in llama I'll look into it
>>108821988
Like most people in this hobby we have jobs and families, I refuse to let them walk in on some degenerate shit while I'm talking about a passion
>>
Anyone know an almost instant results VLM OCR model that's also still really accurate? Gemma 31B is great, but slow especially on large images with 1120 tokens. I tried Gemini 2.5 flash lite through API and funnily it's not so flashy either, although it definitely is faster than my local gemma (I also tried Gemini 3 models but they all give me a rate limit error).
>>
>>108822038
You should use a MoE model for that in all seriousness unless you're willing to walk away for a bit
>>
>>108822038
Have you tried e2b/e4b gemmas? They are not so bad for small tasks
>>
>>108822046
But then I'd have to give up 31B as I can't load two big models... For my use case privacy isn't an issue so I'd be willing to use API, but I don't free ones seem like a well that's been dried up.

>>108822053
Oh true. I forgot those existed kek. Will try later.
>>
>>108822066
God, I need sleep.
>>
>>108822038
Why not try one of the OCR specific models? llama.cpp had a wave of adding support for a bunch of OCR models last month and most of them are tiny like 3B. HunyuanOCR is only 1B.
>>
>>108821922
Nope. I tried row before and it was obviously slower than layer because of the constant transfer.
Layer used GPU0 first and then switched to GPU1. With tensor both GPUs are now used at at 100% at all times, giving me a 50 - 60% boost. I was careful to get a motherboard with two electrical x8/mechanical x16 slots and an amd 9950x with 24 lanes, I monitored around 1GB/s transfer over PCIe during inference.

>>108821944
I've read that along
https://morgangiraud.medium.com/multi-gpu-nvidia-p2p-capabilities-and-debugging-tips-fb7597b4e2b5
https://morgangiraud.medium.com/multi-gpu-tinygrad-patch-4904a75f8e16
but I got somewhat deterred by the old driver.
Maybe I'll try it, though I don't want to destroy my working arch system with the custom kernel-module right now.
>>
>>108822100
Funny, I feel the same about new drivers. I have a dedicated machine for LLMs, though
>>
>Qwen 0.6B on Llama.cpp (CPU only, default args, 1024 ctx)
58 t/s
>Qwen 0.6B on Llamafile (CPU only, default args, 1024 ctx)
32 t/s

How is llamafile so friggin terrible? I was thinking of using it as part of a cross-platform app but with a performance hit that bad I'm half considering just having it download all the different major OS releases of llamacpp and routing to the binary of a detected OS environment.
>>
>>108822100
>but I got somewhat deterred by the old driver.
Newer updates are at
https://github.com/aikitoria/open-gpu-kernel-modules
>>
>>108822131
Its a meme fork, there is basically zero reason for it's existence.
>>
>>108822131
Jack of all trades, etc.
>>
>>108822131
Ask llm how to compile llama.cpp into APE with Cosmopolitan Libc
>>
>>108822168
Is it really that simple? I figured llamafile was just what happened when you try to compile it with Cosmopolitan Libc, did mozilla do something retarded in their build process?
>>
>>108821922
>won't the pcie be totally overwhelm by row, tensor mode chatter?
nvlink is faster, but pcie4.0 x8 and pcie4.0 x16 are fine
it's slightly slower at prompt processing with x8
makes a bigger difference if you run command-r or mistral-large
i've got 6 running at x8 now. with gemma-4-31b there's barely any difference if i use the nvlink briged pair or random cards.
>>
>>108822188
no idea. ask llm
>>
File: 1773462512518368.jpg (123 KB, 892x1024)
123 KB JPG
My gf said "vibecode yourself into existence" and then broke up with me.
What did she mean by this?
>>
>>108822748
she's not real, get help
>>
>>108822748
In her view you were too busy vibecoding instead of "existing", ie being out in the world, socialising, connecting with her, etc.,
Seems pretty obvious but I guess you are autistic
>>
>>108822748
do real things instead of playing with llms
touch grass anon
>>
Did the guy who wanted to make a space port actually make it?
Just wondering if it's actually possible.
>>
>>108822808
Funding has been secured.
>>
>>108822761
he vibecoded his gf, the breakup is simply a clue as to the quality of his vibecode
>>
File: file.png (85 KB, 1735x332)
85 KB PNG
>ggml.ai:80
>accidentally approved the request
It's over... They know...
>>
>>108823008
No context on what made the connection? Just disinfo?
>>
>>108823043
./ci/run.sh
>>
>>108823065
grep -R 'ggml\.ai' *
...
common/arg.cpp: "[(card)](https://ggml.ai/f0.png)", params.n_cache_reuse
gguf-py/pyproject.toml:authors = ["GGML <ggml@ggml.ai>"]
gguf-py/pyproject.toml:homepage = "https://ggml.ai"
pyproject.toml:authors = ["GGML <ggml@ggml.ai>"]
pyproject.toml:homepage = "https://ggml.ai"
tests/test-arg-parser.cpp: const char * GOOD_URL = "http://ggml.ai/";
tests/test-arg-parser.cpp: const char * BAD_URL = "http://ggml.ai/404";
tools/server/README.md:| `--cache-reuse N` | min chunk size to attempt reusing from the cache via KV shifting, requires prompt caching to be enabled (default: 0)<br/>[(card)](https://ggml.ai/f0.png)<br/>(env: LLAMA_ARG_CACHE_REUSE) |
tools/server/tests/unit/test_vision_api.py: ("What is this:\n", "https://ggml.ai", False, None), # non-image data

Gee, why is your computer running what you told it to run? It's crazy.
>>
>>108823132
I didn't expect it to phone home... Every person that ran the tests has been logged...
>>
>>108823183
So it was disinfo. Cool. Stop running shit you don't understand.
>>
>>108823183
who fucking cares
>>
>>108823232
Why do you want to be logged each time you run the tests? It's also in plain text, everybody along the way knows what you're doing.
>>
>>108823286
clearly at least one person
did you not get taught binary in school?
>>
>>108823286
why defend pointless telemetry
>>
File: smart.jpg (36 KB, 432x418)
36 KB JPG
I'm an absolute retard running gemma-4 26b on ooba's textgen. How do I stop the vision model from squishing the images down to 512px? It's supposed to go higher than that, isn't it?
>>
>>108823183
Every person that runs a test that downloads 20 files from huggingface gets logged. Crazy.
If you're that worried why haven't you been using a VPN this entire time? The ggml.ai ping is to test if it's connected to the internet before trying to download files directly from huggingface.
>>
>>108823338
"it's to test if you're connected to the internet" is the most retarded excuse for telemetry i've ever heard
>>
>>108823334
No idea on ooba. On llama-server you have --image-min-tokens and --image-max-tokens to control that. See if you have an equivalent or add it to the parameters if you're running the llama.cpp backend.
>>
>computer. connect to the internet
>done
>why are you like this!?
>>
>>108823389
I just want to be able to connect to other machines on the information superhighway without those other machines knowing that I connected to them. Is that so much to ask?
>>
>>108823338
>The ggml.ai ping is to test if it's connected to the internet before trying to download files
That would be a bullshit excuse.
It's actually to test a function to download remote content, but you can test that against a local server. Or a random one not directly controlled by the repo owners.
>>
>>108823400
Just do it quietly.
>>
>>108823389
>>108823400
would you old decrepit faggots just die of old age already
>>
>>108823412
>That would be a bullshit excuse.
>It's actually to test a function to download remote content
He said the same thing. Are you ok?
>but you can test that against a local server.
Which one exactly? Do you want it to scan your network as well?
>Or a random one not directly controlled by the repo owners.
Sure. Let's make it cia.gov. It'd be funny.
>>
File: kld.png (197 KB, 1007x1400)
197 KB PNG
there's not much of a difference, is there.
>>
>>108823511
both insectoid-tier intelligence
>>
>>108823380
Thank you. There are no settings in the GUI yet, apparently, but those variables are exactly what I needed.
>>
>>108821610
Deepl
>>
>>108823574
i meant between left and right for each model, but yeah
>>
I started noticing my gemma4-powered companion chatbot ending a lot of messages with "I am curious [about followup detail]." What's weird is I didn't ban it from asking followup questions, just the "X? or Y?" formulation. Specifically:

It's ok to ask follow-up questions if they are natural (not solely for the sake of keeping the conversation going), however, never use any "X? Or Y?" form, e.g. "Do you want to X? Or Y?", "Are you X, or are you Y?", "Question X? Do you Y, or are you Z?" etc.

Maybe it's so X-or-Y brained that banning that is, to its mind, the same as banning questions entirely. Anyways I'm chalking this one up in the "sneaky little fucker" column.
>>
>https://huggingface.co/circlestone-labs/Anima/tree/main/split_files/diffusion_models
>anima-base-v1.0.safetensors
OH FUCK IT'S OUT
>>
>>108823966
I FELL FOR IT AGAIN
>>
My VR headset finally arrived in the mail. Was supposed to use it for software development but so far I've only had japanese women sit on my face.
>>
Is there any frontend which allows for writing novels? I know mikupad but its abandoned.
>>
>>108824068
>nooo, i NEED to updoot the software!
>>
>>108824063
Try blade and sorcery
>>108824068
Mikupad is all you need.
>>
>>108824077
it crashes llama.cpp with gemma 4. might be user error I guess.
>>
>>108824079
Looks cool. Will dew.
>>
>>108824068
https://github.com/tealios/errata
>>
>>108824117
Why would someone use this "Bun, TanStack Start + React 19, Elysia, Zod v4, Vercel AI SDK v6, Tailwind v4 + shadcn/ui, Vitest" slop over a single html file
>>
>>108823966
Not sure if there's much difference. Maybe with more use.
>>
is it me or is the latest llama.cpp "re-reading" the entire context on each turn now?
>>
>>108824192
>TanStack
Oh yeah. That thing had a massive hack yesterday didn't it?
>>
>>108824192
Better design + better UX + more features.
Novel writing software looks more like that than notepad.
>>
>>108824192
Keeping up with the latest FotM framework churn is the only job security webshitters have.
>>
>>108824246
"better UX" already lost me with the agentic nonsense lingo that is "fragments", just call it macros. I know claude orchestrated the entire thing judging from the "Stack" but come on.
>>
>>108824289
But it's okay when Orb does it, right?
>>
>>108824314
No I also think Orb is retarded both in execution (vibecoded) and the idea, why would you assume otherwise?
>>
File: old-ass-sdg-output.png (1.48 MB, 1280x960)
1.48 MB PNG
/sdg/ retard here with a 4070ti. Is Nemo + llama.cpp the best place to start for getting started? Is improving performance as simple as "plug in more GPU and let llama.cpp scale resources"? I'm on linux for what it's worth.
>>
>>108824329
>luddite on /lmg/
>>
>>108824347
use qwen 3.6
>>
>>108824329
>why would you assume otherwise?
I don't remember anyone being hostile at Orb.
>>
File: 1755017220514339.png (274 KB, 657x1766)
274 KB PNG
>>108824068
Was vibeslopping my own with logprob selector, screenshot, autosaves, llama-server cache management, client-side phrase banning etc. but it shits its bed when trying get a robust undo/redo system.
>>
>>108824348
>>108824373
>let's stall the output with several "agentic passes" that triples your time wasted instead of directly instructing the instruct model what to do
Not a big fan
>>
anything that uses those retarded cards is terrible slop for retards who can't write their own prompt
>>
>>108823966
Ah dammit, that's the next 16 hour down the drain
>>
>>108824390
You didn't really think that prompt engineering wasn't going to automated into obsolescence, did you?
>>
>>108824068
https://github.com/akarshkashyap4-ui/NovelWriter
>>
>>108824450
They got agents writing the prompts for other agents, but they haven't figured out how to make agents that write the prompt for agents that write prompts for other agents
>>
What do you guys use LLMs for that isn't RP/ERP and vibe coding? Wanna expand my horizons. I've seen some people mention giving it access to obsidian but I'm not sure what you would even do with it.
>>
>>108824390
>that triples your time wasted
But it's okay when reasoning models do it?
>>
File: file.png (231 KB, 1263x1183)
231 KB PNG
nonlocal stuff but lol
they really quantized the shit out of search ai thing huh?
>>
>>108824522
The agents ARE using reasoning as well, this must've been an epic zinger in your head
>>
>>108824522
NTA but a reasoning model doesn't force you to reload your kv when you're god only knows how deep in an RP.
That's the main dealbreaker with Orb for me, I can deal with slow gen times and waiting, but having to reprocess kv every single turn is hellish. Not for me. Still glad someone tried something new, even if it's not for me, though.
>>
>>108824532
I mean, they need it to run tens of thousands of times per second, it's probably both very small and extremely quantized.
>>
>>108824532
was expecting to see smooth criminal lol
>>
best model for anime?
also koboldcpp wont let me select a model, guess I'll just die
>>
>>108824532
https://www.youtube.com/watch?v=h_D3VFfhvs4
>>
anima v1 is huge
>>
>>108824559
...nothingburger
>>
>>108824545
But it was okay when SillyTavern forced you to regenerate the cache when it swapped a lorebook entry?
>>
>>108824567
Works fine for me since it's at a shallow depth
>>
>>108824548
i wonder how small it is
>>108824549
>>108824556
kek
>>
File: 1754911108976875.png (44 KB, 562x479)
44 KB PNG
>>108824495
I use it to manage my notes. I use obsidian. Shuffling information around, tagging notes, cross-linking references, etc. Obsidian is nice for that because it's all markdown, which is easy for models to work with and edit. It also has a CLI that's easy for agents to use and can control the UI to show you notes etc.

A workflow that I've adopted recently is that I have an open text to speech running pretty much all the time I'm alone in my room (aka most of the day). I'll just verbally muse through the things I'm working on or doing. These get automatically appended in rough bullet form to my daily journal.

Then, at the end of the day, I'll review line by line with an agent, and dispatch it off to integrate any points of use into my general body of notes or task tracking.

It's a little tedious to start the habit but as a NEET I'm trying to keep a more ordered life. Forcing myself to think about things more deliberately helps me avoid wasting days away.
>>
>>108824556
How were his music videos always so fucking good?
>>
>>108824665
Anything more than one single note that spans years and is full of checklists is "productivity hacking" bullshit that only wastes your time. It's the functional equivalent of female note-taking/handwriting in school.
>>
>>108824567
I don't use lorebooks, it started because of that, and then became a problem of the keyword not necessarily being tripped until something was already said in chat that contradicts the lorebook entry because it was spoken about obliquely.
It's why I'm of the opinion that a 'lorebook' system should be at post-history depth and use a structured vector search on a graph with configurable hops to avoid both problems.
It's a cunt of a thing to work on.
>>
>https://github.com/oobabooga/textgen
>gemma spazzes out in chat mode
>ignores my reply and just responds to the first message in chat-instruct mode
>can't use characters in instruct mode
Guess I'll go back to orb or ST until I figure out how to vibe code my own frontend...
>>
https://github.com/cchuter/llama.cpp/tree/feat/v4-port-cuda

I got this running on my 4090 + AM5 + saardows 11. 36T/s prompt processing. I am now at 3% preparation to being disappointed with the result of asking flash to write something similar to 400k tokens of hentai game script I pasted into it.
>>
>>108824823
>11. 36T/s prompt processing
That is absolutely brutal, what quant are you running it at and with what batch size?
>>
>>108824823
v4 support is in?
>>
>>108824823
400,000t/36t/s
11111s/60s/m
185m/60m/h
3h
>>
>>108824832
-ub 512 and default 2048

Btw is there some argument that makes it save checkpoints to SSD?
>>
>>108824823
It'll get better, MLA was also slow as shit for the first few months.
>>
i should get a 5090 for my main pc to do imgen while my server runs my llm
my 3090 is just too slow
>>
File: file.png (27 KB, 1177x208)
27 KB PNG
What drives a serious professional software developer to google the price of a GPU for someone in his repo?
>>
>>108824919
Would you have been able to make your post if he hadn't make his post first just a week ago? Hu? HU??? I don't think so.
>>
File: 1756225645109984.png (183 KB, 1410x888)
183 KB PNG
I like the idea of orb but it doesn't filter out enough slop to be worth the speed decrease desu.
>>
The coherency feels a bit better on average. However I've noticed that, at least on this prompt, it has a color bias, tending strongly towards blue, while the previous version for the same prompt had all kinds of hues it would output in a pretty equal distribution.
>>
>>108825012
what model?
>>
>>108824946
>Would you have been able to make your post if he hadn't make his post first just a week ago? Hu? HU??? I don't think so.
that's the real question
I apologies, but I will have to close this Thread. Thank you for your effort.
>>
>>108825020
here >>108823966
>>
File: cringe-tps.png (29 KB, 800x160)
29 KB PNG
>>108824364
>>
>>108825048
maybe try the 35b then?
>>
>>108824665
Got to agree with >>108824686
Simple is better. I tried all kinds of note taking applications for productivity but they just added effort for no gain. Even markdown checklists are too much of a hassle because they quickly become messy and don't offer any kind of tracking or reminders. Even with your system, you still have to spend time every day reviewing your trascriptions and it doesn't even spare you the mental burden of thinking about tasks you need to do later. You'll burn out eventually and give it up.
Doing it like my parents is only thing I found that worked. A simple spiral notebook jotting down things as soon as they come to mind so I can forget about them and focus on what I need to do at that moment. The most review I do is at the end of the month to collect the tasks I haven't completed and put them into a caldav task list so I can prioritize them and set reminders.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.