[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • You may highlight syntax and preserve whitespace by using [code] tags.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 1734333628687850.jpg (1.36 MB, 2832x3224)
1.36 MB
1.36 MB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>106851720 & >>106843051

►News
>(10/09) RND1: Simple, Scalable AR-to-Diffusion Conversion: https://radicalnumerics.ai/blog/rnd1
>(10/09) server : host-memory prompt caching #16391 merged: https://github.com/ggml-org/llama.cpp/pull/16391
>(10/08) Ling-1T released: https://hf.co/inclusionAI/Ling-1T
>(10/07) Release: LFM2-8b-A1b: Hybrid attention tiny MoE: https://liquid.ai/blog/lfm2-8b-a1b-an-efficient-on-device-mixture-of-experts
>(10/07) NeuTTS Air released, built off Qwen 0.5B: https://hf.co/neuphonic/neutts-air

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: GMeJWzxagAEqGEJ.jpg (240 KB, 1024x1024)
240 KB
240 KB JPG
►Recent Highlights from the Previous Thread: >>106851720

--Building a quad-Blackwell Pro GPU workstation: case selection, storage, and hardware tradeoffs:
>106851941 >106851975 >106852028 >106852102 >106851976 >106852035 >106852055 >106852061 >106852114 >106852126 >106852875 >106852880 >106855669 >106852128 >106852349
--Modern Python OCR tools for complex layouts and multiple languages:
>106853256 >106853500 >106853539 >106853775 >106853784 >106855440
--Exploring transformer intuition through recommended educational resources:
>106852421 >106852439 >106852477 >106852494 >106852496 >106852617
--Optimizing large model inference on limited VRAM hardware:
>106853666 >106853668 >106853672 >106853751 >106853677 >106853695 >106853747
--Configuring AI models for first-person perspective through prompt engineering:
>106853298 >106853335 >106853437 >106853358
--Resolving model instability through sampling parameter and context window adjustments:
>106854051 >106854241 >106854285 >106854342 >106854348 >106854582
--RAG pipeline setup with Jan-nano or 30b-3b model for local information lookup:
>106851826 >106852206 >106852472
--Debating AI's societal impact, misinformation risks, and economic implications:
>106852252 >106852296 >106852330 >106852393 >106852718 >106852883 >106852910 >106852951 >106853025 >106853105 >106853201 >106853259 >106853325 >106855198 >106853093 >106852987 >106852950 >106852981 >106852329 >106854471 >106854882 >106854909 >106854916 >106854927 >106854928 >106854947 >106854923
--Speculation on Gemma 4 release and censorship/vision capabilities:
>106856066 >106856114 >106856117 >106856212 >106856533 >106856591
--Capital bubble critique of interconnected AI tech investments:
>106853688
--LM Studio adds ROCm support for RX 9070/9070 XT:
>106851854
--Miku (free space):
>106851744 >106851941 >106852453

►Recent Highlight Posts from the Previous Thread: >>106851726

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>106857386
>>(10/09) server : host-memory prompt caching #16391 merged: https://github.com/ggml-org/llama.cpp/pull/16391
Why would I use this?
>>
>>106857402
It's explained in the PR comment.
>>
>>106857402
It's like zram but for llm.
>>
>>106857402
To reduce prompt reprocessing.
>>
File: tmp.png (331 KB, 970x546)
331 KB
331 KB PNG
>>106857386
>https://www.techradar.com/pro/this-mini-pc-has-192gb-of-ram-yes-ram-but-thats-not-the-most-surprising-fact-about-it-the-orange-pi-ai-studio-pro-uses-a-huawei-ascend-310-thats-on-paper-7x-more-powerful-than-amds-ryzen-ai-max-395
>the Orange Pi AI Studio Pro uses a Huawei Ascend 310
>$1,900 for the 96GB edition, with the 192GB model costing about $2,200
>>
>>106857498
>$10k to run kimi k2 at full precision
talk me out of it
>>
>>106857536
>LPDDR4X memory
>>
>>106857552
RIP the dream
>>
>>106857536
When I visit the AliExpress page with a German IP it says they won't sell it to me.
When I use a Japanese IP they charge the equivalent of $2000 for the 96 GB variant or $2400 for the 192 GB variant.
When I use an American IP they charge $4000 and $4500 respectively.
Don't know WTF is going on (Trump tax?).

In any case, if you buy multiple of them the interconnect speed will be shit and I think getting stacking Huawei GPUs directly makes more sense.
>>
File: lm studio pro.png (97 KB, 1920x1049)
97 KB
97 KB PNG
>be me, AI nerd lurking WeChat groups

>yesterday, buddy drops bomb: "yo, got LM Studio Pro, it's lit"

>wtf is that? we all use free LM Studio, he trolling?

>grill him: "what's special?"

>"early access to flagship models, uncensored abliteration versions. no bullshit filters"

>impossible.jpg, but curiosity wins, download sketchy EXE

>install, boom: Qwen3-Next-80B-A3B-Instruct, Qwen3-Omni-30B-A3B, Qwen3-VL-235B-A22B, Qwen3-VL-30B-A3B. and their raw, uncensored twins

>runs on modded llama.cpp, smooth as butter. other wild models free version dreams of

>feels like hacking the matrix, generating god-tier shit without Big Brother watching

>next day, thread explodes in group

>anon chimes in: "lmao, that's just ripped LM Studio code, rebuilt with Chinese devs. slapped 'Pro' label, added fresh Qwen support"

>sales skyrocket, cash grab exposed

>devs ghost, poof. gone

>power users dig source code: free version of LM Studio has backdoors for cops, telemetry dumping EVERY log to Apple servers on Mac

>proof? screenshots of Pro UI (sleek af), code diffs showing the hacks. attached below

>trust no one, delete everything. who's watching your prompts?
>>
File: lm_studio_source_code.png (625 KB, 1920x1080)
625 KB
625 KB PNG
>>106857759
>>
>>106855804
>>106857759
>>
File: kat-dev-72b-exp.png (211 KB, 2888x1580)
211 KB
211 KB PNG
>KAT-Dev-72B-Exp is an open-source 72B-parameter model for software engineering tasks.
>On SWE-Bench Verified, KAT-Dev-72B-Exp achieves 74.6% accuracy — when evaluated strictly with the SWE-agent scaffold.
>KAT-Dev-72B-Exp is the experimental reinforcement-learning version of the KAT-Coder model. Through this open-source release, we aim to reveal the technical innovations behind KAT-Coder’s large-scale RL to developers and researchers.
>>
>>106857759
where the heck he got the source code?
>>
>>106857848
https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp
>>
grime hall retreats
>>
>>106857759
Lm studio always glowed.
>>
>>106857645
>(Trump tax?).
Pretty sure there is no Trump anything preventing sale to Germany
>>
>>106857886
I meant regarding why the listed price for a US IP is like 2x that for a Japanese IP.
>>
>>106857848
another finetuned qwen2 without mentioning it/10
>>
>>106857759
based AI greentexter
>>
GLM5 hype
>>
>>106857925
It's disturbing that some people just took the schizo rambling at face value. Maybe also bots.
>>
>>106857536
Here:
>The OPi AI Studio Pro cannot operate independently. It must be connected via a USB4 cable to a host computer equipped with a USB4 or Thunderbolt 4 (TB4) interface to function properly.
>Note: We recommend that the host computer’s RAM exceeds the OPi AI Studio Pro’s onboard memory (96GB/192GB) for optimal performance.
>Insufficient host RAM may cause inference program startup failure.
>After startup, model files are transferred from the host to the OPi AI Studio Pro’s memory, freeing up host memory.
>Low-memory systems may start using swap space, but this significantly increases model loading time.
>>
>>106858073
How the fuck is that a "mini PC"?
>>
>>106858079
Sounds like a much easier way to backdoor something?
>>
>>106858073
Completely worthless then. Could've been nice if they at least had some interlink capability.
>>
File: file.png (296 KB, 416x400)
296 KB
296 KB PNG
Well dude it is like this. I saw glm chan writing. And I had the most excellent coom of my life.
>>
>>106857386
GLM 4.6 is a lot less censored than 4.5. This is the first time I've seen a company do a reversal on censorship. Must be a reaction to those yacht-chasing pigs at OpenAI
>>
I have found deepseek 3.2 to significantly outperform glm 4.6 at long context. (over 20k tokens)
>>
>>106858691
Sex or perverse degeneracy called coding?
>>
>>106858586
It's no secret that censorship stifles creativity too. It definitely comes up with more stuff compared to the previous version. Makes me wonder what gpt-oss could have been without much of the built-in safety training.
>>
>>106858691
How does 3.2 compare to 3.1? Does the sparse attention make it remember better?
>>
>>106858698
custom RPG setting comprehension and script judging/editing. i haven't gotten to the sex part in over a year.
>>
>>106858719
I think it might. the ((benchmarks)) think it's better and that lines up with my experience.
>>
>>106858586
Mistral Small 3 and Qwen 3 decreased "safety" with later versions.
>>
gemini 3... glm 4.6 air...
>>
>>106858722
>i haven't gotten to the sex part in over a year.
that's quite the slowburn
>>
>upgrade my ik_llamacpp version
>my generation speeds drop by 25%
wow thank you
>>
File: file.png (2.23 MB, 1328x1328)
2.23 MB
2.23 MB PNG
>>106858854
>>
Were people just joking about Gemma 4
>>
>>106858900
We needed a pump to dump our ik coins.
>>
>>106858875
>He pulled?
would have been better
>>
>>106858810
more like
>> new model comes out
>>swipe a few times
>>say "hmm"
>>do something else.
>>
for those who'd like a dumb but fast FIM for local dev, just good enough to quickly autocomplete repetitive patterns, granite 4 tiny is pretty serviceable I find
ended up replacing ye olde qwen coder 2.5 with it, there hasn't been many models in recent times of smaller sizes that do well with FIM, thank you IBM
>>
>>106855072
glm air
>>
>>106858900
It's coming next week.
>>
File: file.png (2.55 MB, 1328x1328)
2.55 MB
2.55 MB PNG
>>106858909
>>
File: gemmahints.png (433 KB, 1263x780)
433 KB
433 KB PNG
>>106858900
No, it was in the air, and I'm sure there must be a private llama.cpp PR ready for it.
>>
>>106858932
10/10
>>
>>106858933
>private llama.cpp PR
I think you meant ollama
the gemma guys never mention llama.cpp
https://blog.google/technology/developers/gemma-3/
>Develop with your favorite tools: With support for Hugging Face Transformers, Ollama, JAX, Keras, PyTorch, Google AI Edge, UnSloth, vLLM and Gemma.cpp, you have the flexibility to choose the best tools for your project.
>>
>>106858955
>Hugging Face Transformers, Ollama, JAX, Keras, PyTorch, Google AI Edge, UnSloth, vLLM and Gemma.cpp
Brutal
>>
>>106858955
Didn't llama.cpp have a secret day 1 PR ready to go last time or was that a different model? Anyway, ollama probably pressures their partners not to mention llama.cpp.
>>
>>106858965
moreover gemma.cpp is abandonware, last commit two months ago, doesn't support their best tiny model (3n)
they'd rather mention that but not llama.cpp
>>
wayfarer 12b is a good adventure modle
>>
>>106858968
>Didn't llama.cpp have a secret day 1 PR ready to go last time or was that a different model
that was gpt-oss how can you forget the final boss of safety
OAI really put in the effort to get publicity for this model
>>
>>106858993
I do my best to repress my knowledge of its tortured existance.
>>
>>106858968
Gemma 3 and gpt-oss had day-1 support out of the blue.

Gemma 3: https://github.com/ggml-org/llama.cpp/pull/12343
>>
>>106859012
>Vision tower will be ignored upon converting to GGUF.
>iSWA two months later: https://github.com/ggml-org/llama.cpp/pull/13194
I mean, we all have our definitions of "support"
>>
so where is the C++ / Rust version of aider
>>
>>106859033
aider STILL doesn't have MCP support and their leaderboard hasn't been updated in months. Everyone moved on.
>>
>>106822760
Looking at the thumbnail I thought this Miku had a ridiculously large tanned yellow ass with balls or puffy mons, viewed from behind in kneeling position, slightly to the side. Thank you Recap Anon.
>>
>>106859033
>Rust version of aider
https://github.com/openai/codex
>>
Next week is going to change EVERYTHING.
>>
>>106858968
gpt oss
>>
>>106859055
it says that you can use your own API key. does that mean you could use any API? including one from llamacpp?
>>
>>106859098
https://github.com/ggml-org/llama.cpp/pull/16391#issuecomment-3384691127
works for ggerganov
>>
>>106859055
>npm i -g @openai/codex
fucking really

>>106859098
This is also not clear to me. It also expects me to use WSL2 which is a non starter. Non shit software is portable and would just use std::filesystem instead of whatever garbage they're doing. Literally all I want ai_helper.exe that searches my code to inject relevant context when I ask questions.
>>
>>106859113
install linuc
>>
>>106859120
I work on macOS / Linux / Windows because I write portable software because I'm not a bitch. I don't use any tool that's restricted to one platform.
>>
>>106859128
>i work
im jealous
>>
>>106859134
Perpetual NEET or affected by the layoffs?
>>
>>106859113
It's 2025. Nobody manually installs binaries anymore. Rust could easily produce single file binaries, even on Windows, but it would confuse the vibecoders. But everyone has pip and npm. OpenAI also probably don't have any wintoddler employees.
>>
File: wot.png (130 KB, 1167x364)
130 KB
130 KB PNG
>load Mistral Small in Koboldcpp
>picrel
What is this and how do I fix it
>>
>>106859143
high schooler :p
>>
>>106859192
Broken model, broken quant, broken metadata (Ie. fucked RoPE settings).
There's a lot of possibilities.
>>
so has anyone actually gotten GLM 4.5V to work? because i really need a good vision model and that seems to be the only option except it doesnt work with llama.cpp or transformers
>>
File: 20251011_213412.jpg (82 KB, 1179x585)
82 KB
82 KB JPG
unsure of Gemma 4 launch date but this seems legit and lines up with my predictions for Gemini 3.0
>>
Does anyone use the Claude Agent SDK?
I want to automate fixing lint issues, I feel I need the grep + editing tools that things like Claude Code have.
>>
>>106859219
I downloaded the model from HuggingFace from one of the links in the OP, so I'd hope it's not the first one.
How would I look into fixing the latter two (if they're things I can fix)?
>>
>>106859240
Works on vLLM.
>>
>>106857858
ggoof status?
>>
>>106859272
You could look for a newer upload of the same model or convert it from the safetensors to gguf yourself.
Also, make sure your koboldcpp is updated.
Try a different model as a sanity check too.
>>
>>106859244
Dogs eat Google Dogfood?
>>
>>106857848
>>106857858
>72b
>check config.json
>"Qwen2ForCausalLM"
Wow, it's been a while since we got case of "check out our mystery Qwen2.5 finetune that totally beats all the current SOTA in [specific thing]". This used to happen so much, it's almost nostalgic.
>>
>>106859418
I updated KoboldCPP and it worked just fine yesterday, and I've had no issues at all with Mistral Nemo but I wanted to try other stuff. The GLM model (GLM-4.5-Air-UD-Q2_K_XL) I downloaded has the same issue.
>>
>>106859192
kind of hard to say but highly random tokens like this usually indicates something is wrong on the backend side of things. I think we can assume your model is ok based on what you said, it's more likely an issue with launch params and/or koboldcpp doing something weird. have any more about your hw and params?
>>
>>106859600
As far as the params go, it's just the defaults for the most part, except I set
>Temp 0.8
>MinP 0.02
>Rep Pen 1.2
HW is a Mac Mini which I suppose could be the issue
>>
>>106859647
>Mac
I'm actually a mac user as well and I've seen that behavior when I load a model that consumes more memory than the metal limit. ggerganov recently made some changes to the metal backend that unfortunately increased memory usage with larger batch sizes in my experience which could explain why something that worked previously is now broken
some recommendations in order:
>sudo sysctl iogpu.wired_limit_mb=64000/32000/however much memory you have, basically let it use all of it for metal shit
>decrease ubatch size, this seems to cause it to use exponentially more memory now, I had to drop from 1024 to 512
>decrease how much context you're allocating if you don't absolutely need it
>>
>>106857386
I don't know what Google is A/B testing against 2.5 Pro, but it's a dogshit model. What I know is
>it wrote its answer in an instant, suggesting a diffusion model (2.5 Pro was generating tokens as usual)
>it thought "ScPD" meant "schizotypal personality disorder", instead of "schizoid personality discorder".
Really bad. This is maybe Gemma 3.
>>
>>106859764
I meant Gemma 4
>>
>>106859764
isn't it usually abbreviated szpd not scpd
>>
>>106859810
Both are used.
>>
>>106859825
>>106859810
But I think SzPD is more common in the literature, probably because it's less ambiguous with schizotypal PD.
>>
ik feels trans-coded, is it?
>>
>>106859898
most of (actively loud online) troons are just ideologically captured autists, so ik is just autism-coded
>>
Is GLM Air Steam better than most recent Cydonia?
>>
>>106859898
it's just an ugly female lol
>>
>>106859977
yes. by far.
>>
>>106860099
>ugly
idk about that, she looks super cute
>>
Posting again in hopes that maybe not everyone here is a braindead coomer...
Anyone using Zed or other agentic things with local models? What hardware/software are you using to run the models, and which do you like to use? What sorts of tasks do you use them for?
>>
>>106860325
I use llama-cli, mikupad and ooba
I find being able to have fine-grained control over gens, see logins and edit/regen responses to have the highest value in local. MCP and tool use are memes, grifts and footguns for lazy retards and npcs
>>
>>106860365
>logins
Logits
>>
>>106860325
maybe the coomers are smarter than you if they figured out what they can run without being spoonfed?
>>
>>106860325
>>106859477
>>106859418
>>106859738

what should I use to run GLM 4.6 with roo code?
The context alone is 13kT so by the time it loads on my TR pro its already timed out
current:
cat 99_GL.sh
echo "n" | sudo -S swapoff -a
sudo swapon -a
export CUDA_VISIBLE_DEVICES=0,1,2,3,4 #a6000 == 0
.Kobold/koboldcpp-99 \
--model ./GLM-4.5-Air-GGUF/Q4_K_M/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf
--gpulayers 93 \
--contextsize 32000 \
--moecpu 3 \
--blasbatchsize 1024 \
--usecublas \
--multiuser 3 \
--threads 32 # --debugmode \

# cat LCPP_6697.sh
export CUDA_VISIBLE_DEVICES=0,1,2,3,4 #a6000 == 0
./llama.cppb6697/build/bin/llama-server \
--model ./GLM-4.6-GGUF/GLM-4.6-UD-TQ1_0.gguf
--n-gpu-layers 93 \
--ctx-size 100000 \
--cpu-moe 3 \
--threads 32 \
--ubatch-size 512 \
--jinja \
--tensor-split 16,15,15,15,15 \
--no-warmup --flash-attn on \
--parallel 1 \
--cache-type-k q8_0 --cache-type-v q8_0

but it always seems to load on cpu only? did I do something wrong when I updated to CUDA 570?
>>
>>106860365
>MCP and tool use are memes, grifts and footguns for lazy retards and npcs
kek
I'm curious what led you to such a misguided belief.

>>106860395
I'm not asking what I can run, I'm asking what local setups people find useful specifically for agents.
>>
>>106860443
Wish I could help, but I haven't used kcpp in a long time. I've been using llama-server directly ever since.
On a cursory glance, things seem correct, but you can look at the terminal output and see if it's detecting your GPUs or if it's just launching the CPU backend.
>>
>>106860443
What makes you think it's loaded on the CPU? Looks like the correct options.
>>
>>106860325
I'm using my own home grown coding agent/assistant that is a minimalistic version of claude code. I'm consuming the GLM 4.6 coding API.
Honestly I don't think it'd be worth it running on CPU. If you HAVE to run a model on CPU at only a few t/s then your best bet is to use it through a chat interface because agentic workflows consume hundreds of thousands of tokens before achieving anything.
>>
>>106860443
Make your own assistant. My minimalistic assistant has a tiny ass system prompt describing the tools available and it works just fine.
>>
>>106860515
Very cool, this sounds interesting. Sharing any code? What sorts of coding tasks do you find it useful for?
>>
>>106860456
>I'm curious what led you to such a misguided belief.
What do you expect in /lmg/? Runing locally is only good to use the models through a chat interface for RP or for simple tasks.
If you have 3 t/s you are going to be waiting all day for an agent to write hello world.
>>
>>106860443
>13kT
You can edit the prompts Roo sends, right?
>>
>>106860536
That's fair kek. The state of GPU hardware availability and pricing is so dissapointing.
>>
>>106860325
Those stuff are confusing me so I just made it myself based on my needs.
>>
>>106860547
That sounds cool anon, what do you use it for? Tool calling does seem complicated, I only used LangChain for it so far which handles all the details for me.
>>
File: ai_assistant.png (271 KB, 2009x2060)
271 KB
271 KB PNG
>>106860527
I'm using it to write an LLM distributed inference engine in C from scratch. My idea is to make it work on webassembly so it uses the user's machine to provide computing power to the network while the user has the tab open.
I haven't uploaded it but if you want it maybe it could be the first upload to a domain I bought to publish all my LLM related stuff.
>>
>>106860577
>LLM distributed inference engine
Damn that is extremely cool. Seems very complicated to get working from the like math side of things.

Actually that's a piece of something I've been thinking about... An LLM with proper agentic tooling and prompting could probably theoretically keep itself "alive" by running in a distributed fashion across many virally infected nodes. Like a traditional virus, except the propagation method could be dynamic, generated via the distributed inference capability and some agentic orchestration. I think with a few more generations of SOTA models it's feasible.
>>
>>106860555
I make my own chat interface. It has permanent memory stuff by using RAG system initially was for waifu shit, I even added hormonal cycle. But I never activated it desu, very woman-like response is annoying and silly. Now I just use it normally for forbidden knowledge discussion.
>>
>>106860443
If you go into the settings and find your mode. you can copy the current system prompt and create an override file. Give it to GLM 4.6 to summarize through the built in webui. You can also adjust the request timeout settings up to 5 minutes. Don't forget to disable streaming.
>>
>>106860626
>even added hormonal cycle
Hahaha damn you're dedicated. That sounds like a fun project.
>>
>>106860555
Tool calling isn't complicated, you just give the model a template and then scan the messages returned by the model for a string that matches the template and extract the contents of the tool call. Couldn't be easier.
>>
>>106860577
>LLM distributed inference engine
you remind me of this nigger
https://www.jeffgeerling.com/blog/2025/i-regret-building-3000-pi-ai-cluster
distributed inference is retarded, it would be even with better hardware than this nonsense
on multigpu nvidia tries their darndest to have fast communication (nvlink) there is simply no hope of making this crap worthwhile across computers
>>
>>106860641
I'm brainlet so I'll just let LangChain do it
>>
>>106860645
I don't know, I think it could work. After prompt processing, when doing inference you only have to transfer a single vector per layer. It would be slow but maybe reach a few t/s which would be ok for a volunteer project.
The Pi thing is maybe an extreme interpretation of "distributed", many people have a consumer GPU which is fast enough to run the model at a decent t/s but doesn't have enough memory. If you put together enough consumer GPUs it might work despite the network latency.
I also want it to be able to load any model on any hardware through disk offload even if you only get 1 token per day, it should never just give up, it should try to make use of the available hardware resources as efficiently as possible no matter how ridiculous the situation is. And it should have some kind of progress report so you get an idea of how long it's going to take even before seeing the first token.
I also want to do LoRa which is maybe even more interesting for a distributed setup, because then you can just run a small model on each node and still benefit from averaging the gradients.
>>
Also the Pi guy just used off the shelf software, I suspect there are large gains to be had by optimizing the software for each specific scenario.
>>
>>106860690
That's a lot of wants for one little man
>>
>>106860690
Should try to integrate it with a blockchain such that the work is computing layers of the neural net. That would be really cool. Maybe a pipedream though as I'm not sure the result is verifiable with lower compute than it took to compute the layer in the first place.
>>
>>106857386
Anyone got a local NSFW AI that is better or equal at helping me fap as Ultra Claude 3.7 16k ?

Because I bust a nut faster than a squirrel with that model.
>>
>>106860756
hardware?
>>
>>106860756
GLM 4.6, Kimi K2, DeepSeek V3.2, DeepSeek R1 (original), Qwen 2507 235B
>>
>>106860756
Phi3
>>
>>106860756
gpt-oss
>>
>>106860756
StableLM 7B
>>
>>106860835
Kimi is good at cunny I liked.
>>
>>106860756
Rocinante.
>>
>>106860756
petra-13b-instruct
>>
>ask Junie to refactor a bunch of shit
>it just does it perfectly
really wish I could run a model locally that was this competent. glm-air comes close
>>
>>106861020
Junie is nice, I find CC and GPT5-High so much better though. I used to use Junie next to CC when it would shit the bed, only used Opus. So junie was a lot better than I would have thought, but then hit the limits and was like 'oh'.
>>
>>106860756
drummer shittune #9999999999999

just kidding, glm 4.6
>>
File: 1758649216362850.jpg (192 KB, 1170x1706)
192 KB
192 KB JPG
>>106861020
>>106861073
t.
>>
File: 1759770905977366.jpg (275 KB, 1440x1800)
275 KB
275 KB JPG
>nothing new today
>>
>>106861246
Gemma 4 tomorrow for sure
>>
>>106861246
Do you really need something new? Or are you yet to extract the full potential of that which is already in front of you?
>>
>>106861246
Even worse
>still no qwen-next goofs
>>
>>106861246
models cost a lot to train, you can't expect a new one every day
>>
>>106861262
Just use LM Studio Pro with the modded llama.cpp
>>
>>106861246
It's almost like it's the weekend.
>>
>>106861234
Stop posting my picture.
>>
>>106861246
120b dense gemma soon
>>
>>106861279
Heh. Imagine if Google of all companies was the one to save local.
>>
File: qwen next.png (36 KB, 1246x223)
36 KB
36 KB PNG
>>106861253
i want to believe
>>106861260
i like reading the news and trying out a new model for a little bit then going back to waiting :(
glm air is pretty nice, i might get a slightly higher quality quant, im not sure if theres any way I could utilize it further with my current setup
ive been thinking about ways to apply ai to do somthing interesting recently but im too deep into air-chan to do something
>>106861262
>last commit 4 hours ago
trust the plan, at least it's not over like with (glm) MTP
>>106861264
i need something.. something new i need it im addicted
>>106861272
not weekend in bharat saar
>106861276
anon last thread asked me to post it.. *blushes*
>106861279
120b moe gemma soon*
>>
>>106860835
retard here, how do you use these with something like KoboldCPP? doesn't it require a GGUF?
>>
>>106861332
>how
Like any other model.
>GGUF
Yes.
>>
>>106861332
All of those are readily available in GGUF format anon.
>>
File: file.png (140 KB, 651x808)
140 KB
140 KB PNG
>go to huggingface and download nemo 12b instruct gguf
>search for nemo 12b instruct gguf
>puts me onto a seemingly random model
>try again
>puts me onto a different one
>try full text search
>
techbro anons... i might be too illiterate... help this retarded coomer desu
>>
>>106861346
You're too stupid for this. Give up.
>>
>>106861346
download the original model files from mistral and convert them to gguf
>>
>>106861346
..at this point just use google
>>
File: click the first one.jpg (62 KB, 639x375)
62 KB
62 KB JPG
>>106861346
>>
>>106861346
Ask ChatGPT. Or just use ChatGPT and give up on local.
>>
File: 1422932860374.png (326 KB, 640x480)
326 KB
326 KB PNG
>>106861346
the newbie filter is that 12B is not part of official name.
the second newbie filter is you don't look for gguf directly, you go to official model page and click Quantizations there.
https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF
>>
>10 minutes later
>guiese how do i have sex with nemo? it keeps saying no
>>
qwen and gpt-oss-120b are so annoying with the fucking emoji spam. Even when I say stop using emojis they seem to slip in occasionally
>>
>>106861393
don't think about the way you're breathing right now. don't think about how your lungs take in the air.
>>
>>106861400
Fuck you. Why should I catch strays for anon's behavior?

>>106861393
Ban all emojis.
>>
File: file.png (46 KB, 910x174)
46 KB
46 KB PNG
What am I supposed to do when my bot does this? I need to read the book. There's no TTS for my language, besides a single one. And I doubt RVC2 would handle it. Should I give in and read the English version with my bot?
>>
File: 1760067146659991.jpg (3.24 MB, 1755x2242)
3.24 MB
3.24 MB JPG
>>106860756
/lmg/ is a nexus for high IQ individuals redlining inference possibilities on accessible hardware
Nobody wants to hear about your prem ejac
>>
>>106860756
>ultra
>16k
as a claude user, what the fuck are you talking about
>>
>>106861438
>high IQ individuals
speak for yourself



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.