[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1770759702309440.png (409 KB, 1080x867)
409 KB
409 KB PNG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>108339019


►News
>(03/07) Qwen3.5-27B Claude-4.6 Opus reasoning distill GGUF published: https://hf.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
>(03/06) Olmo Hybrid WebGPU browser-local demo posted: https://hf.co/spaces/webml-community/Olmo-Hybrid-WebGPU
>(03/05) OLMo-Hybrid-Instruct-DPO-7B posted on Hugging Face: https://hf.co/allenai/Olmo-Hybrid-Instruct-DPO-7B
>(03/05) Qwen3.5-9B OptiQ 4-bit for Apple Silicon posted: https://hf.co/mlx-community/Qwen3.5-9B-OptiQ-4bit

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/gso.html
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
Where do I download the AI?
>>
>>108341862
see https://status.openai.com/
>>
>>108341869
What are these benchmarks? Because 4B being just "20% worse" (whatever that means), is impressive. Too impressive to be trustworthy.
>>
>>108341891
i think it just means 20% worse than base
>>
>>108341891
It means that it's still really, really good at knowing who the surgeon is
>>
>>108341953
im tired of your bit, consider this a warning
>>
>>108341980
Proof??
>>
>>108341965
Sorry, would you prefer there being 3 r's? Or perhaps your penis is too soft, resting against your thigh? Maybe you'd rather I give you an svg image of a duck on a bicycle. If not, I'm afraid this goes against the policy and I must refuse.assistant
>>
my state tracker is finally working and then i found out the tracker extension exists
>>
>>108342015
qrd
>>
thought the proof/qrd bot was only on ldg, nice to see it pesters you guys here too
>>
>>108342030
>>108342018
>>
>>108341862
>Is anyone else experiencing random slowdowns with llama.cpp?
>Sometimes my t/s will drop by half and it stays like that until I restart the server.
>I can't figure out what causes it.
did you start downloads that saturate your bandwidth? I can't for the life of me figure what is wrong with my system that could cause something this retarded but I get 1/3 less of my regular t/s if I have high speed downloads in the background. With llama.cpp running locally on the same machine, it's not a remote.
Another phenomenon network linked is how unresponsive comfyui's web interface (but running locally on my computer) can become. Same thing again, running locally..
>>
>>108342048
fyi llama.cpp has an experimental vram bandwidth mode that prevent downalods from using vram
>>
File: teto principle.png (1.04 MB, 1024x1024)
1.04 MB
1.04 MB PNG
►Recent Highlights from the Previous Thread: >>108339019

--Paper (old): Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs:
>108341106 >108341116 >108341151 >108341174 >108341187 >108341338
--Zen 5 core count vs RAM and RTX 5080 vs 5070 Ti for inference performance:
>108340610 >108340639 >108340663 >108340700 >108340716 >108340785 >108340855 >108340871
--Google announces Gemini Embedding 2 release:
>108340571
--Gemini Embedding 2: our first natively multimodal embedding model:
>108339121 >108339167 >108339153
--Critique of UTF-8 handling in tokenization budget mechanism:
>108340859 >108340877
--Prime Intellect RL training platform now available for agentic model development:
>108339192
--Testing local model honesty with thought process discrepancies:
>108340354 >108340366 >108340382
--Feature Request: DSA lightning indexer support:
>108341265
--Teto and Miku (free space):
>108339350 >108339419 >108339519 >108341150

►Recent Highlight Posts from the Previous Thread: >>108339182

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script
>>
>>108342059
I am not downloading anything from llama.cpp, what I mean is that having downloads in the background (steam, curl, from firefox) will make my t/s degenerate
>>
>>108342070
why not use llama.cpp to download?
>>
>>108342070
Why are you replying to trolls?
>>
>>108342089
What do you mean?
>>
>>108342097
>downloads using vram
>>
>>108341869
The 4 billion model looks like it's at the sweet spot.
Does it mean you should have a 4 billion model? It gives you a 80% winrate which is good enough?
>>
>>108342069
Missing the thousand-post reply chain about the retard who killed himself. What the hell was wrong with that thread, why did anyone engage with it?

Also, if you are a real mikuanon, why aren't you the faster baker? The schizobaker can barely spell
>>
>>108342106
what if you do 2 passes, is it close enough to 100% like 99.9999998?
>>
>>108342070
>downloads in the background (steam, curl, from firefox)
nta. If part (or most) of the model is on RAM, then pretty much anything would make it go slower. Specially if you saturate cores with other work. All threads that ran on the free cores have to wait for the busy ones.
Show full specs, what model you're running, with what settings, how much free memory you have, are you swapping, what color leds do you have?
If you're gonna have other shit running, set -t to a few lower than whatever you have. Start with that and experiment more.
>>
>>108342108
>talking to a bot
>>
>>108342119
I talk to LLMs too. Do you?
>>
>>108342102
Downloads use RAM by default, but you can use VRAM
>>
>>108342120
api user
>>
they are gonna ban local
>>108342118
>>
>>108342108
>Also, if you are a real mikuanon, why aren't you the faster baker?
I'm too old to be playing these games. I'll continue to be baker of last resort at page 9 or 10, but I don't think racing to make early threads at bump limit just so we have schizo OPs slightly less often would be helpful.
>>
>>108342114
No I think there's diminishing returns.

>>108341891
Yeah I can't believe the 0.8 billion is only half as good as the 300 billion model.
>>
>>108341869
I might be late to the game, but is Anthropic a double entendre? Anthropic makes sense for a company that makes human like chatbot but I just heard someone say out loud enthropic to describe a system with enthropy and now I wonder if it was wordplay all along.
>>
>>108342168
Fair enough. My inner child wouldn't let go as easily.
>>
>>108342174
Do you know what anthropology means? I'm sure you can figure it out from there.
>>
>>108342174
i think they are making astronaut software
>>
How do the current OPEN models handle ERP? I have a terabyte of (V)RAM and want a model that handles loli etc. I'm still using the original Deepseek, but I'm curious about Qwen3.5, Kimi-K2.5, GLM5 and DeepseekV3.2
>>
>>108342185
I know my etymology, I made the mistake to be a linguist. I was just wondering if they had done it on purpose since I hadn't realized it before.
>>
>>108342174
>enthropic
not a real word
>>
Excellent the news is sort of fixed though after the shitpile in the last thread I suspect there is a conspiracy somewhere still
>>
>>108342215
Yes, Qwopus is very news-worthy.
>>
>>108342215
>the news is sort of fixed
>random finetunes with random dates
>>
File: 1749575146345692.png (170 KB, 801x430)
170 KB
170 KB PNG
>>108341869
>>108342229
>>108342215
Qwopus is already debunked. The #1 shill of it admitted it sucks
>>
File: 1768900428117908.png (121 KB, 675x297)
121 KB
121 KB PNG
>>108342238
>>
>>108342238
>>108342243
Anon, I...
>>
moesissies won, dense lost
>>
>>108342208
>I made the mistake to be a linguist
>I wonder if it was wordplay all along
Yes, anon. It's a play on words.
>>
>>108342229
>>108342231
>>108342238
It's technically more on topic at least but you're right there's probably a conspiracy here still
>>
>>108342174
How can entropy be reversed?
>>
>>108342256
Don't eat your own dick while you suck it, OP. It's an absolutely worthless selection of news.
>>
>>108342266
Did you not read the last thread where I was clearly the one just saying the same OP needs to be changed to be more on topic while they were throwing a tantrum over it with their buttbuddies or are you just trying to be that disingenuous?
>>
>>108342254
I acknowledge it was a mistake.
>>
DeepSeek
Please
I need you
>>
Dense Gemma soon!
>>
>>108342292
You have sarvam!
>>
File: 1750045382015626.jpg (298 KB, 1080x1920)
298 KB
298 KB JPG
BREAKING NEWS
>(3/10) Mini Rin Sits On Miku's Head
>>
Dense models are dense
MoE models are moe :3
>>
Miku came inside of my wife and I never recovered
>>
>>108342301
@grok add a tramp stamp on her hip
>>
>>108342302
>moe :3
https://www.youtube.com/watch?v=qByKEu0zdco
>>
>>108342211
Eh, https://youtu.be/JgB_ywOmGCc?si=FhXDOB_FCulT021K&t=3384
Says it in the sentence that starts at 56:24 if the timestamp fails.
>>
>>108342168
I love you anon, thank you for all your hard work
>>
>>108342318
Oh, actually paying attention I'm a retard.
>>
How slow are MoE 50% on ram vs fully on vram?
>>
>>108342301
This better be on the next thread. Picture and news.
>>
>>108342301
Big if true
>>
>codex has been down for 24h+
>uptime is still 99%
sure
>>
>>108342438
hi there, i dont get it
>>
>>108342452
you are dalit
>>
SillyTavern has started sending the <think> </think> blocks of past messages to the server. It doesn't matter whether I'm in text completion mode or chat completion mode. Before I drive myself crazy diagnosing this, does anyone know if there's a setting somewhere I could have activated to cause this?
>>
>>108342604
yes
advanced formatting -> reasoning -> add to prompts
>>
>>108342614
omg thanks
>>
>>108341891
>Too impressive to be trustworthy.
it's because these benchmarks fucking suck and no one feels like using a couple of weird/niche ones that are harder to benchmax for some retarded fucking reason
>>
>>108342659
ai psychosis
>>
>>108341891
>4B being just "20% worse" (whatever that means), is impressive
It's the opposite of impressive. It means that they only got a 25% higher score by multiplying the number of activated parameters by 4 and the number of total parameters by 100. It shows they're getting diminishing returns.
>>
File: sarvam benchmarks.jpg (95 KB, 885x1310)
95 KB
95 KB JPG
Sarvam is pretty impressive. It's nice to see india acting like the superpower it always was.
>>
>>108342692
What's the point of the 300b models then?
>>
File: organic input device.jpg (172 KB, 1024x1024)
172 KB
172 KB JPG
>>
miku
>>
>>108342694
>on indian language benchmarks
lmao
>>
>>108342700
>what's the point of cloud models when local ones do 80% of the job
>>
>>108342707
If you want a version of Sarvam 30B that's uncensored via Abliteration:
https://huggingface.co/aoxo/sarvam-30b-uncensored
>>
>>108342716
Can you probably achieve the same performance by using the 4b one and forcing it to check itself multiple times right? Has anyone tried that?
Gpt and claude say no but they're built to shill themselves.
>>
SAAAR you must no redeem
https://www.sarvam.ai/apis/text-to-speech/
y u steal my job, saaar, do not redeem it
>>
>>108342755
beaultiful for gorgeous looks
>>
>>108342769
based, for a namefag, your alright.
>>
>>108342785
It’s a bot retard
>>
>>108342850
oh, yeah I guess that seems likely. but its still pretty neat, I guess maybe a little off topic tho.
>>
lolcow
>>
>>108342168
Fuck off faggot
>>
File: 20374.png (161 KB, 1515x904)
161 KB
161 KB PNG
https://github.com/ggml-org/llama.cpp/pull/20374
>>
>>108342694
Saarvam 105 cockbench?
>>
>>108343002
>sar: 97%
>>
>>108342203
Not a lot of reports in /lmg/, nor chat logs.
Best bet is just to try it then tell us.

GLM4.6 and 4.7 have gotten a few mentions.

The worry is that everyone has been distilling off everyone else, including the refusals, and so the newer the model the more baked in refusals.
>>
>>108343018
Can you provie more information on the matter?
>>
>108342301
Offtopic garbage like this is why no one should ever listen to anything said about thread quality or news quality. Mikutroons are scum.
>>
File: 1772707429837184.jpg (1.93 MB, 1069x6178)
1.93 MB
1.93 MB JPG
>>108343018
>The worry is that everyone has been distilling off everyone else, including the refusals, and so the newer the model the more baked in refusals.
I was worried about picattached...
>>
>>108343037
new qrd spam
>>
File: 1768488821710235.gif (442 KB, 600x913)
442 KB
442 KB GIF
>>108342301
True if big
>>
>>108343046
care to elaborate?
>>
i'm the anon who was considering dumping roughly 10k into a build
i was thinking of buying one or more of these: https://www.asus.com/us/networking-iot-servers/desktop-ai-supercomputer/ultra-small-ai-supercomputers/asus-ascent-gx10/

some relevant links which i have found while researching this:
https://dlcdnwebimgs.asus.com/files/media/202506/5c0fb57c-4e48-4e96-8c97-04bf8df2677c/asus-ascent-gx10-datasheet.pdf
https://www.asus.com/us/support/faq/1056142/
https://www.asus.com/us/support/faq/1056547/

i was hoping someone who knows more than me might help me to answer a few questions:
> is this good value for money?
it seems like it should be good for my use case, but i want to sanity check this
> can more than two of these be connected?
their answers to this seem to conflict
"Answer: The maximum tested and supported configuration by NVIDIA is a stack of 2."
"Answer: Currently, it can support 2 only."
"Answer: Stacking of two devices in currently supported. There is nothing preventing it from clustering more systems via the use of a 200GB ethernet switch."
> can you connect these directly via hardware (i.e, without going over LAN)?
it *seems* like this should be the case, but i'm not great with hardware, so i'm hoping someone can help me confirm before i fuck myself over
> does it matter that it (only?) supports FP4 and FP8?
presumably it could also handle higher precision floats, but the FP4 and FP8 instructions are the ones with native support (i.e., fast). but do i care?

most likely, i would buy two of these, hook them up, and attempt to run the full GLM4.7 model.
my stretch goal would be to buy four and run GLM5, but without support for stacking four at once (or at least reading about someone's experience doing it in a hacky way), i would probably hold off on this
i might consider starting with two, and if this works out well for what i want, dumping in more money a couple years down the line to buy some of the more expensive enterprise models

thanks in advance!
>>
File: 1744733247280991.png (25 KB, 346x88)
25 KB
25 KB PNG
You are losing out if you aren't using qwen 3.5
>>
>>108343109
>thanks in advance!
>>
>>108343109
dropping 10k on a prebuilt asus ai box is paying a massive brand tax. value-wise, you're usually better off building a multi-gpu 4090/5090 rig or grabbing a loaded mac studio if you just need pure vram capacity. run the math on vram and memory bandwidth per dollar.

on connecting them: you're confusing a direct hardware bridge with network clustering. the 2-device limit uses a physical high-speed link giving you unified memory, which you absolutely need for massive models like glm4.7. clustering more over a 200gb ethernet switch introduces awful latency because you lose that unified memory. your tk/s will tank for single-user chat. assume the limit is strictly 2. nvidia hardcaps this to protect their enterprise sales.

native fp4/fp8 support is actually a huge selling point. nobody here runs 100b+ parameter models at fp16, we all use quants. native silicon acceleration for lower precisions means your generation speeds will fly. it'll still run higher precisions at standard speeds if you force it, but you shouldn't be doing that anyway.

before buying, calculate the exact vram glm4.7 needs at an 8-bit or 4-bit quant plus your target context window. if two boxes don't comfortably fit that, pass. and forget stringing four together for glm5 over ethernet; the communication overhead will make it completely unusable for interactive chat. stick to two if the math works, or just build a custom tower.
>>
>>108343127
i'm asking for people to do work on my behalf, and a non-insignificant amount (i mean, just look at that autistic wall of text)
the least i can do is be polite about it
>>
>>108343138
Do You suffrr from Mind blindness?
>>
>>108343131
10k is my total budget. each one of those asus boxes is 3.5k, so buying 2 would put me at 7k, with 3k left over to use on incidentals

on their site, they advertise:
> Link two ASUS Ascent GX10 systems to handle even larger models, such as Llama 3.1 with 405 billion parameters.
GLM4.7 is 358B, so it seems like it should be fine? i'm not sure how the quant/target context would change things, though. might be something i have to go research more

thanks for giving such a detailed response. it's super helpful
>>
>>108343109
>128 GB LPDDR5x Coherent Unified System Memory
I would recommend against it. It's basically a regular PC with 128 GB of RAM.
Go get yourself some server motherboard and buy like 4 used RTX 3090 and 256-512 GB of RAM and you'll be much better off.

Personally, if I wasn't poor as shit I'd just pay for cloud models. Memory is just way too expensive right now and it's almost certainly gonna get cheaper in the future. If you want to control everything then rent some cloud GPUs from like vast.ai or digitalocean or hot aisle.

If you want to do spicy image generation then your best bet would be an Nvidia GPU like an RTX 3090 or 5090.
>>
>>108343109
Sorry, gotta pile on here…that’s a very bad build from a value perspective dollar perspective. $10k is a bit of a cursed price- point right now. You want 768GB to run frontier stuff and there isn’t an economical way to acquire that right now, even if that was all you needed.
You of course also need a high bandwidth system to install it in and ideally some kind of gpu
>>
>>108342701
I need Rin-chan to sit inside my hoodie hood as I'm running errands.
>>
>>108343302
so, help me to understand here
let's assume we can get ahold of an RTX 5090 at the insane MSRP of $2k (and not the more realistic $3.5-4k)
that's 32GB of GDDR7
how is that better than 128GB of LPDDR5x @ 3.5k?
is GDDR7 that much better than LPDRR5x? (i have no idea what most of these acronyms mean - really not a hardware person, unfortunately)
or would you say don't even buy an RTX gpu and go for something else entirely?
>>
When you have an older card without support for fp8 are you able to use fp32 performance as a proxy for performance in general?
For example if you have two older cards one with double the fp32 compute is it safe to assume the faster one would do the work if half the time? Assuming the ram is equivalent.
Does that make sense or am I missing something?
>>
>>108343109
>nvidia dgx spark
>officially connect up to: 2. more is unsupported and janky.
>connect using: special cables (might come with; check unboxing videos)
>fp4, fp8: only matters if you're doing training. not important for inference.

For $7k - $8.5k there's the apple m3 ultra 256gb.
(512gb version is only available from ebay only atm.)
Maybe apple might come out with an m5 ultra?
>>
File: 1753546346362029.jpg (11 KB, 225x225)
11 KB
11 KB JPG
The only argument to not spend on $10k on hardware right now is if you look back at what you could've gotten with that money a year ago.
But that's now no longer possible and it won't come back. So now's a better time than any to build the best server you can afford.
>>
>>108343349
the GDDR7 is about 10x faster than the LPDDR5x, but you will have 1/4 of the quantity of memory, which means 1/4 of the maximum model size.
>>
>>108343349
https://www.ebay.ca/itm/196153412822
Plus PSU and a scuffed 3090 is probably your best bet to hit $10k and run the best models near reading speed
>>
how can small gpu cost more tha big car?
>>
>>108343384
Try to buy the equivalent weight of an H100 in gold, little buddy
>>
>>108343359
Sounds reasonable.
If neither card does fp8 natively,
then any fp8 computation will probably use fp16 or fp32 hardware.
So looking at perf numbers for those is useful.

If it's q8 quants you plan to run, then the integer perf would be the one to look at.

But more directly, see if anyone else has benched that piece of hardware on the model you are interested in, or something close to that.

What cards?
What models?
>>
>>108343374
so if i didn't care about speed (within reason), the asus boxes would be the way to go?
>>
>>108343426
There aren’t many use cases where those asus boxes make any sense
>>
>>108343383
Different anon here.
That ddr5 system doesn't look half bad.
>>
>>108343411
heh i dont get it
>>
>>108343415
I am looking at the amd mi 25/50. The 25 is real cheap, cheaper than a p100, and even the 32gb 50 is not bad.
I have read people argue the 25 is slow but it has double the fp32 of my ghetto setup now which would put it well within the tolerable window for me.
>>
>>108343383
i'm always a bit paranoid about buying things off of ebay, especially when it comes to expensive computer hardware. generally, i feel better sticking to well-known websites/brands so that i can make a return if their hardware shits the bed
vs buying that from zhang jinping in shenzen
>>108343436
really? i got this recommended to me from someone who generally knows his shit and is pretty damn intelligent. so that's a bit at odds with the feedback from most anons here. not saying you're all wrong by any means. just trying to get a feeling for why there's a disconnect like this
>>
>>108343349
Every token that needs to be generated needs to load the active weights for that token. In a 122B-A10B model that means 10 billion weights need to be loaded for every token generated. A weight is roughly 4 bits in a q4 quant, most models are q8 (8-bits per weight), full size GLM 5 is 16 bits per weight, but you would likely run it at q4-q8.

8 bits = 1 byte. This means that an A10B model needs 10 billion weights which are all 1 byte each, therefore each token needs to load 10 billion bytes of data (10 GB).

Regular DDR5 is 38-50 GB/s per channel. Consumer CPUs (and motherboards) only handle dual channel, meaning that you get a maximum throughput of 76-100 GB/s in RAM. That Spark system has 68 GB/s of throughput.

The RTX 3090 has 900 GB/s throughput. This means that a model that can fit into its 24 GB of VRAM the model could 'theoretically' generate 90 tokens/s (in reality it's less since there's computation involved), meanwhile that spark system would do 7 tokens/s.

The Spark system, however, can load up to 128 GB size models, while a single RTX 3090 can only do 24 GB. But, you can offload most of the model onto your regular RAM and get roughly the same token generation from that, meanwhile part of the model sits in VRAM and can do faster token generation as a result.
>>
>>108343465
hey man its hard to read all that on phone, can you stop being so wordy
>>
>>108343384

good luck running gpu with 350ish watts while car might 20000ish watts road is gray and there might be funny shape trees
>>
>>108343441
Double ram slots is a bit of a put off since it gimps bandwidth slightly, but overall it’d be solid. You’d pay 4x that for the same spec anywhere else. eBay is pretty damn safe, all things considered.
It’s a bit of an outlier at that price. I don’t expect it’ll last.
>>
>>108343471
doomer
>>
>>108343465 here
I'm not sure where I got the 68 GB/s number from for the Spark. I see various numbers all over the place on a second look. Some say it's around 270 GB/s which might be reasonable. That makes the Spark a lot better, but server motherboards with multi-channel RAM can do the same thing. Ie you can have even 12 channel memory, but that memory is going to be very expensive.
>>
>>108343459
Do what you want, but I’ve been building AI purpose-built systems for a couple of years now and haven’t whiffed yet.
>>
>>108343384
You can't use the car to scam investors out of trillions of dollars.
>>
>>108343465
You missed the final calc where a model using all 256GB of that 2xDGX sysram would be sub 1T/s
>>
>>108343510
wrong thread buddy, boys get him
>>
File: IMG.png (63 KB, 847x501)
63 KB
63 KB PNG
How much of a loss in intelligence should I expect if I switched to any of the GLM 4.5 air quants listed in picture here? I have 64gb ram + 8gb vram, very constrained memory-wise, so I can only do a context size of 2048. Also, I end up having 100-400mb offloaded to swap, I'm hoping that its system processes being offloaded and not the model.
>>
>>108343541
so factually speaking you can calculate the memory bus size by the context size by square root
>>
>>108343465
>>108343513
i'm ok with slow. i'm not going to ERP with it or anything (and if i did want to do that, i would just use a lower parameter model)
i basically want to vectorize a shitload of data, jack those dbs into the highest param model i can run, give it a task before i go to bed, and check on its output in the morning
so i'm not overly concerned about token speed
>>108343504
not trying to be dismissive of anyone. sorry if it came off that way. i'm just trying to figure out why the advice i got from him differs from what i'm getting here. he's really good at software, so maybe he's just not as good at building price-conscious hardware
>>
>>108343546
>give it a task before i go to bed, and check on its output in the morning
lamo the llama 405 meme is back from the dead

not happening bud you'll wake up to errors and a crazy electricity bill
>>
File: rp keks btfo.png (16 KB, 700x58)
16 KB
16 KB PNG
>heck ur doing rp so u can afford to do q2 of a 7b model
>>
>>108343546
Putting together an LLM inference build is a very specific skill set that only superficially looks like building a computer for basically any other purpose.
I’m guessing your buddy trusts the Nvidia branding and hasn’t built one of these or gone into great detail on what you need to optimize for.
If you’re not worried about interactive use then 1TB DDR4 EPYC Rome will finish any reasonable batch while your asleep and still be a reasonable price. Still want a gpu tho.
Eg https://www.ebay.ca/itm/227117257779
>>
>>108343586
lmao, rp is totally useless if the model is retarded though :(
>>
>>108343559
it's a low wattage arm machine, so the power draw shouldn't be too insane
>>108343597
god damn a terabyte of ram is crazy
DDR4 is pretty old at this point, though, right? is there even much support for it still?
>>
>>108343614
what do you mean support for it? RAM is RAM older one is just slower but it's not like DDR5 has specific features that make AI better other than just being faster
>>
>>108343631
presumably (again, not a hardware person - sorry if this sounds retarded) at a certain point, newer motherboards just wouldn't have interop with the older ram chips. like how the USB standard changed over time
>>
>>108343614
>god damn a terabyte of ram is crazy
The smartest open weight models are from 600b-1T. Do the math
Speaking of which, learn how to calculate both the ttft/pp phase requirements as well as the tg requirements as they are both relevant to overall performance/efficiency yet different in important ways.
In the end, if the system can’t run the model you want on timescales you can live with then it’s a waste of money
If you’re lazy, at least figure out the total effective bandwidth of the memory subsystem you’re inferencing from
>>
>>108343641
well yeah that's already the case, new boards just use ddr5, no new board/cpu I know of still uses ddr4
>>
>>108343642
>The smartest open weight models are from 600b-1T
For now. Deepseek got us here and Deepseek might as well catapult us beyond that in a few days
>>
>>108343655
ddr6?
>>
>>108343663
not on the short term horizon due to the shortages, companies would rather fab HBM
>>
>>108343669
can you change that?
>>
>>108343465
>>108343513
following up on this
i went to one of those token simulator speed sites, and i think i would be happy with 2.5T/s for real-time use
so long as i could reach that, by tweaking the quants or however it would be accomplished, i think i could be content
do you really think it would be <1T/s? how are you getting that number? from what i could see, it should be on the order of 1-5T/s?
>>
>>108343498
My experience with 8 channel DDR4 3200 tells me that channels are bullshit. GLM4.6 at q4, 32b active is 16GB in size, at 2t/s that's 32GB/s which is about as fast as dual channel. Since numa is not supported in llama.cpp you can't benefit from >2 channels
>>
>>108343696
>Since numa is not supported in llama.cpp you can't benefit from >2 channels
/g/ - Technology
>>
>>108343703
shut up nerd
>>
>>108343597
That terabyte looks like it's spread over two sockets.
The earlier single socket 3/4TB ddr5 looks the better of the two.

>>108343663
zen6 and zen7 are going to be on am5,
so ddr6 is still a way away.
>>
>>108343703
numbers don't lie, 8 channels about as fast as 2
>>
>>108343696
What was your processor? The smaller Epyc processors are gimped in terms of CCDs so they can only can use a limited amount of channels despite being advertised otherwise. If you used one of the 8 or 16 core cheapo Rome Epycs, you were likely running that RAM at 2-channel speeds.
>>
>>108343747
AMD EPYC 7702 64C/128T Socket SP3 Zen2 CPU
>>
xai should make a local model. grok-4.20 is a complete shameless whore if you pull your dick out at it
>>
StagnAItion
>>
>>108343791
pics?
>>
>>108343791
elon promised to make all older versions of grok open source
grok 3 any day now
>>
>>108343696
What settings were you using? I have similar CPUs + RAM: 2 EPYC 7532s and 8 channels of DDR4-3200 for each CPU. Using
--n-cpu-moe 999
and pinning the memory to physical CPU 0 with numactl, I got 6 tokens/sec on GLM-4.7 IQ4_XS.
>>
>>108343870
Yeah!
>>
>>108343870
So, two CPUs with eight channels each are roughly twice as fast than 2 channels desktop? Impressive, almost justifies that 16 channel configuration
>>
>>108343900
More like 1.7x as fast due to infinity fabric latency/bandwidth constraints
>>
>>108343691
The 768gb ddr5 one is 100% better
>>
What people easily miss with the current meta of CPU maxxing is that the RAM barely matters with the -cpu-moe stuff. Most of the active parameters are already on GPU and only a bit remains in RAM so the RAM no longer scales linear with the increased bandwidth due to more channels.
>>
>>108343691
I've got the og cpumaxx rig, so dual epyc with 768GB RAM and an A5000 24GB card and I pull 15t/s inference speed on kimi k2.5 at q4.
Its as smart as it gets and faster than reading speed (if you ignore prefill times, which are only meh)
>>
>>108343920
Your putted together a nice machine
>>
>>108343870
Full CPU offload. With 4x24 GPU offload I get 3t/s
>>
how is the vllm/llama/etc support for arm vs x86? i know it *says* it's supported. but is it actually as good?
>>
>>108343932
Pretty good, apparently. 4b q4 runs at 3.6t/s on rpi5
>>
Ah, that's another annoyance with qwen35 I guess.
>WARNING: RNN models do not support context rewind! Anti-Slop sampler will not work!
>>
>>108343915
It was done specifically without GPU offload to demonstrate that there are no observable benefits from >2 channels on my system
>>
Updoot llama or not?
>>
>>108344069
ye
>>
File: 1771248465262562.png (61 KB, 804x360)
61 KB
61 KB PNG
Will you use NemoClaw?
>>
>>108344123
i dont want to be vulgar but NemoClaw can direct its attention to my nemo balls
>>
>>108344123
I still don't understand the utility of having an AI agent. If you don't constantly monitor what it's doing then it, by definition, cannot do anything right. It makes zero fucking sense.
>>
>>108344123
>>108344078
>>
>>108344123
>nvidia
going to be slop, apart from the people who design CUDA and the GPU chips they are a full blown saar corpo that can't produce anything good. Not even a control panel for their gpu (damn bro it's 2026 and it's still so slow to load the per app customization panels). nvidia app is a tumor, their background services are logging so hard it's trashing your SSD (disable nvidiot container), nemotron models are the worst slop of the industry etc
>>
>>108344163
Seriously though. Half of the job of an AI agent is to get your detailed opinion on the core architecture of every single thing you build. Every component.

What the fuck are people doing instead? Do they really just tell Claude or whatever to just "build me le minecraft clone" and then fuck off? It makes no sense. What's even the point of building software if you don't give a shit about it? What the FUCK is an AI agent even FOR?
>>
>>108344178
It's like paying an employee to go to therapy instead of you. It makes no sense.
>>
>>108344178
>Do they really just tell Claude or whatever to just "build me le minecraft clone" and then fuck off?
yes anon it's artificial intelligence
>>
>>108344177
what can you produce?
>>
>>108344187
makes perfect sense if you just need a checkbox filled
>I went to therapy
>I built the something please to hire me now/use my thing
>>
>>108344187
you are dumb
>>
in the end as long as it works, no one cares how gross the internals are or how janky the dev process was.
the real problems happen when it needs to get extended over a certain complexity and the original "lol mincraft clone" level of sophistication prompting just can't handle it.
Then you get unsloth bros
>>
>>108344178
Some people want an unearned sense of accomplishment. Others just want something they share on social media or as a source of content. The rest seems to be the tech enthusiast crowd using it as an expensive way to sort emails because they don't know filters exist.
>>
>>108344203
>Then you get unsloth bros
who were ex-nvidia
nvidia is a slop factory
>>
>>108344267
https://www.reddit.com/r/LocalLLaMA/comments/1rpxpsa/how_i_topped_the_open_llm_leaderboard_using_2x/
>I hope the article I posted also give some upvotes, maybe Nvidia will sponsor me with hardware, so I can make more models to share.
kek
>>
https://news.ycombinator.com/item?id=47323900
> meta acquires moltbook
lol but what in the fuck is that
their fall from grace is endless, you think they already dug deep enough to reach the last circle of hell but they keep digging and digging
after spending billions on wang wang and still having no model, whether proprietary or open, to show, this is where they focus their attention on? "social media for agents"?
absolutely incredible
they lost most of their hard science ML researchers recently but hey, hire the niggers who made one of the most retarded thing of the past decade
>>
>>108344273
> I'll push it to Huggingface, but it makes sense to 'polish' the scar with some fine tuning first.
let's just train on a few benches before pushing, need my nvidia grant after all...
>>
>>108344123
Can an AI agent make GPUs cheap?
>>
>>108344276
the thing that was proven to be humans LARPing as agents. Its one of the biggest grifts within the griftiest space out there right now.
I'm almost happy to see them get a big dollar exit, it was so bold an unabashed and Meta is such a big fucking joke it just seems right
>>
Where is DeepSeek v4? Aren't we way overdue? The Financial Times is real journalism, they wouldn't just lie.
>>
>>108344307
those grifters said they invented "reverse captcha" to prove agents aren't humans by having them solve things quickly that humans would have trouble solving quickly.. but they never said how exactly they intend to stop a human from using the agent to solve the captcha and then continue to do their human thing, which is to say, spam more scams. The whole idea of a reverse captcha is so inane even someone with chromosomal defects could come to the conclusion that it has no value.. except for Zuck. Zuck doesn't live in the same reality as we do.
>>
>4chan is STILL seething about moltbook replacing them
>>
>>108344307
>>108344321
these two posts are so abnormally cringe for this general that i have to conclude that they're both from the same fag
>>
>>108344276
probably this post too
>>
no shit
>>
the butthurt from subhuman jeets who moved from NFT bs to AI is real
>>
>>108344341
?
they seem to do just fine with their ahegao youtube videos.
>>
>>108344341
My only disappointment is that I can't short the shit out of AIcoin on leverage.
>>
>>108343109
> is this good value for money?
For pure inference, no. You’re better off buying GPUs. If you want something that just works, also no. You’re basically buying an untested, specialized tool that will require tinkering on your end and substantial investments into maturing it on the vendors end. Think early days CAD workstation but for AI developers.

Personally, I use it for learning and building a small AI pipeline stack + exploring the nvidia developer ecosystem. But I wouldn’t use it for running chatbots. Far too expensive for that and unstable for that. Whether it works for small, local production workloads is guess at the moment.
> can more than two of these be connected?
As far as i’ve seen, yes. There’s a few vids floating around on YT of folks chaining them up, iirc. Cables will cost you an arm and a leg though.
> can you connect these directly via hardware (i.e, without going over LAN)?
Yes, that’s kinda the whole point of the two ConnectX ports it ships with.
>i'm not great with hardware
You’re probably not gonna have a great day unless you’re willing to invest time and energy to learn. If I were you, I‘d seriously do more research. AI compute/networking is not the same as building a gamer pc and setting up your home LAN.
>>
DeepSeek V4 failed training just like LLaMa 2 33B
>>
>>108344384
winter is here
>>
>>108344384
whether the intended v4 failed or not, they have a brand new, very real unnamed model on their official chat that's way better than what they had before, the high context improvements are no joke and nobody would complain if they released that as open weights
somehow I have a feeling though that it might never happen because they might be smart enough to recognize that there's no reason to give free handouts once you become competitive.
>>
>>108344384
llama-2 33b was agi, and lecunn determined it was too much reputational damage to him for it to exist so he blocked it and sabotaged meta ever since, ruining 3 and 4 and ultimately setting in motion the events that caused them to buy moltbook
>>
>>108341880
https://ai.com/download
>>
what's a good model for image generation for a RAMlet like me
>>
>>108344472
rong thread try ldg
>>
>>108344478
i see. thanks.
>>
>>108344178
>>108344163
But think of the thousands of lines of code! Muh metrics!
>>
>>108344445
thanks sir
>>
>>108344384
they are waiting for gemma 4 and avocado to release
>>
>>108344329
What the fuck is a moltbook
>>
>>108344866
humans larping as ai scamming api keys out of eachother
>>
>>108344883
Oh yeah I heard of that and then immediately forgot what it was after it was revealed as a scam
>>
File: 1753234955084538.jpg (2.57 MB, 2956x4096)
2.57 MB
2.57 MB JPG
https://github.com/RightNow-AI/autokernel
>>
File: 1741763142436893.png (75 KB, 893x502)
75 KB
75 KB PNG
I hate this subhuman
>>
>I have to bring up shit from months ago
>>
>>108341869
I don't know about benchmarks but this model is shit.
I remember It's only good at vision, only vision.
>>
>>108344995
try hauhaucs version
>>
File: jaypee tune.png (14 KB, 815x267)
14 KB
14 KB PNG
>>108344995
It's AGI as far as I'm concerned.
>>
>>108344967
as much as I actually like using LLM for certain things, I believe LLMs will completely kill open source, ruin a lot of proprietary software too and cause a general long term skill devastation that will take a long time to recover from.
right now there are too many pwilkin in the world and open collaboration on the internet is taking a hit as maintainers either start closing down (no more looking at external contributions at all) or do the retarded thing that llama.cpp does which is to open the gate to the subhumans
I know many software developers, and, myself included, who feel at this point we'd rather shovel pig shit in a farm than deal with the other humans who developed ai psychosis.
>>
>>108345031
exaggerating much?
>>
>>108344995
It's a massive fumble, that's why they are pushing it so hard.
>>
>>108345040
no. you either are one of the subhumans or not a software developer at all if you don't feel that way.
https://archive.is/lQL9B
this article sums up what it feels like to have subhumans as coworkers.
>>
>>108345020
>reasoning
>>
>>108344921
>flash_attention.cpp
>Target metric: throughput (higher is better)
>Secondary: correctness must ALWAYS pass
???
It should be the other way around. WTF are we coming to?
>>
>>108345112
Probably fine since order of operations is more of a suggestion even for the most advanced models, especially in the long run.
>>
>>108345031
Large short term harm as they cross the minimum threshold of how useful a model can be while being useful enough for people to actually try to apply it to production repos. Less medium term harm as it continues past that threshold, and long term boon as it enables quality and secure code at scales far beyond what humans were capable of.
>>
>>108345179
>far beyond what humans were capable of.
ai psychosis
>>
>>108345112
>return 1;
>infinite performance
>>
>>108345031
LLMs will become open source. Open source projects will be made specifically to feed code to LLMs.
>>
https://www.reddit.com/r/LocalLLaMA/comments/1rqplvy/mistral_nemo_upscale_but_kinda_weird/
>>
File: based.png (12 KB, 545x98)
12 KB
12 KB PNG
>>108345213
>>
>>108345179
>and long term boon as it enables quality and secure code at scales
lol
>>
>>108344384
>>108344398
>>
File: DSA.png (23 KB, 487x286)
23 KB
23 KB PNG
HABBENING
>>
>>108342069
>Prime Intellect RL training platform now available
im messing with it, i guess they train a LoRA on the python environment you give it. So a lora on Qwen that is optimized for your python env. my python is just a thin wrapper that opens up a websocket to my simulation server. after there are a handful of loras trained, hopefully there will be an open source solution that combines them down into the base model. is that continual learning?
>llm is doing a task
>keeps track of its action/observation space for that task
>design a python RL sim of the task
>wait while training a lora on it
>merge the lora down into your base weights hopefully not ruining things in the process
im new to this
>>
>>108345292
oh g-d!
>>
bullish on meta now that lecun is gone. that retard has had so many shit takes it is a surprise there are investors stupid enough to burn their money for him
>>
>>108345325
this and with moltbook they'll be unstoppable! to the moon!
>>
>>108345332
>with moltbook
oh fuck i forgot that zuck is also a retard. bearish again on meta
>>
>>108342174
E=mc2 + AI
>>
>>108345292
oh gawd!
>>
Nivida AIForce RTX Mistral Nemo Pro 12B
>>
https://sweepthestrait.com/
>>
>>108345384
What.
>>
>>108345433
ye
>>
>>108345422
What will you do with a 12b model when 4b is only 5% worse? Look at the benchmarks in OP
>>
>I'm beeeenchmarking
>>
>>108345470
>I'm seething
>>
>>108345422
That sounds like the name of a GPU with 3.5GB of vram
>>
>>108345492
who let you beyond the great firewall?
>>
>>108345579
rent free
>>
>>108345579
I'd take good old Miqu over any of the newer <30b models if I had no choice and was poor. Benchmarks hardly matter.
>>
File: Untitled.png (318 KB, 366x501)
318 KB
318 KB PNG
Just bought a v620. But the vbios it came with only reports 16gb? What the hell is this thing? Vulkan memtest on a 32gb vbios doesn't report any errors. Loaded up devstral q6 with context to 30gb, and it worked fine. Shouldn't it be the other way around, faking larger vram, if they want to scam me?
>>
>>108345612
why not screenshot
>>
>>108345723
Not my pc.
>>
>>108345612
>v620
Why are these so cheap? What's the catch other than them being ayymd?
>>
>>108345020
Now ask what color her butthole is.
>>
File: file.png (3 KB, 211x37)
3 KB
3 KB PNG
>>108345780
probably part of it
>>
>>108345292
vibebros status?
>>
>>108343109
go look at the level1forums. Pretty sure a lot of people there have documented those kinds of stacks.
>>
>>108345790
Still in the ROCm 7.0.0 compatibility matrix, for now.
>>
>reasoning budget sampler merged, still no state tracking / string accumulator
>will continue past the end of a complete multibyte word indefinitely if the model's tokenizer outputs lone tokens as fragment of continuation bytes and each word is of a multibyte heavy language
>will absolutely break on byte level style models
people who don't know how tokenizers and unicode work should not write string based samplers, much less get claude to vibecode their vomit
>>
>>108345424
I smirked mischievously with a smirk.
>>
>>108345882
:rocket:
just vibe code a fix once someone's llm writes an issue
>>
>>108345780
>Why are these so cheap
They're not cheap I don't know what you're talking about. They're massively overpriced and expensive and they don't work for AI or gaming or anything at all, and nobody should buy them or even look at the listings for them. Please do not keep thinking about the v620 or show any interest in this horribly overpriced and useless card.
>>
>>108345925
if you act sus I'll ask reddit's opinion
>>
File: 1768536751147905.gif (1.77 MB, 320x320)
1.77 MB
1.77 MB GIF
>>108345925
>>
I know this isn't /aicg/ but DeepSeek is really slow today... could it be happening™
>>
>>108345880
You're looking at the Linux chart, it isn't supported on microslop, the rx6800 is so maybe he could trick it to work with HSA_OVERRIDE_GFX_VERSION=10.3.0 but I don't know what he's doing trying to use AMD on microslop where they don't even update the driver.
>>
>>108345914
I wouldn't need to vibecode, this is a simple, just a few lines fix. In fact I'll maintain my patch locally when I end up merging the whole autoparser/budget wilkin line of commits back
you just need to write an accumulator that gets filled if common_utf8_is_complete returns false (you can keep it empty otherwise). It will eventually return true when fed the more complete accumulated chunk, and if still doesn't after it grows to 4 bytes, your model is somehow outputting broken unicode and you can decide to abort instead. Clear the accumulator when common_utf8_is_complete returns true.
That's it.
I will not, however, clean up after his butt. He needs to be named and shamed.
>>
>>108345969
Deepseek 3.2 is one of the worst AIs I have ever used in 2026.
It was good when it got released.
>>
>>108346014
>doesn't know
>>
>>108346009
tell your agent to write the ticket DUH, I literally dont understand how people are this oblvious
>>
>>108346014
well... yeah? that's how it works, shit ages
>>
>>108345974
https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html

Doesn't this imply it's still supported? Gfx1030 yes runtime yes hip sdk no amd rocm debugger.
>>
The deepseek team is actually not focused on releasing models, this is a side project for them.
Their main role is something else.
>>
>>108346021
AMD does a lot of stupid shit like changing register sizes between different cores that are "the same gfx030" so if it's not specifically on the list you cannot trust it will work, they were especially bad about it with RDNA3
>>
>>108342048
This might be strange and I'm using 3 months old llama.cpp. Was diddlying with my own client last night 127.0.0.1/completion testing stuff and the client displays an exception if it can't connect to the server.
I had a download going on in the background and I would get excption error message first and then the reply would continue normally.
I'm sure I don't have a bug because my stuff is pretty simple.
This sort of explains bit I don't inder why a local loopback connection would suffer from download??
Ehh.
>>
>>108346054
*understand
It's hard to type with a one thimb only!
>>
>>108346017
You aren't as funny as you seem to think you are.
>>
>>108346054
I'm going to test this later tonight. It's crazy if a download interferes with the llama-server connection.
I'm using Linux. Tbh I never noticed anything like this on Windows (~1+ years ago).
>>
File: bik.jpg (479 KB, 1365x1100)
479 KB
479 KB JPG
>>108346054
>>108346094
My guess is IO/interrupts generally slowing things down. There's kernel parameters to tweak but I would not expect to maintain inference performance when you add a bunch of network/disk IO on top
>exception if it can't connect
Run Wireshark on the loopback to dig deeper, probably the server is getting stalled from handling it's input Q coz of the other IO
Experiment with renice and ionice
>>
>>108346240
Problem is that my external internet connection isn't enough to even hog the full bandwidth of my network adapter.
I'm going to test out some stuff. Should probably compile new llama version too.
It's still strange to me.
>>
File: 1764989822734022.png (237 KB, 1080x827)
237 KB
237 KB PNG
>>
>>108346289
will this drive nand prices down?
>>
>>108346293
lol
>>
>>108345296
>is that continual learning?
technically yeah if the idea is to have the llm design the rl sim itself too
>>
File: 00106-3050314564.png (321 KB, 512x512)
321 KB
321 KB PNG
>>108346289
The US and Israel unironically use LLMs for streamlining strategic analysis.
"You're absolutely right! If we keep bombing Bandar Abbas port, even though it's been out of service since day 1, they will surrender immediately. Would you like me to write you a song about it? Or maybe I can help you select one of those bad dragon dildoes you were asking about earlier for the occasion. Just let me know!"
>>
>>108346289
omg that would be terrible haha
>>
>>108346289
Irrelevant since our models don't run on major US technology companies' servers.
>>
>>108346289
https://files.catbox.moe/rfet2c.mp4
>>
>>108346240
I wish I had abs like that.
>>
>>108346316
I bet the manchildren on reddit shat themselves laughing at this.
>>
>>108346289
How are they bombing *US* companies in *Israel*?
>>
>>108346336
hi golem
>>
>>108346312
idiot
>>
>>108346316
what model is each side using?
https://www.youtube.com/watch?v=Bt8sizAvvUI
>>
>>108346363
Iran most likely some chink cloud models since they are prolly cut off from west. Kikes use some globohomo tech model so prolly sora.
>>
>>108346289
That means nothing for us.
>>
https://huggingface.co/deepseek-lab/DeepSeek-V4-Base
https://huggingface.co/deepseek-lab/DeepSeek-V4-Base
https://huggingface.co/deepseek-lab/DeepSeek-V4-Base
>>
>>108346416
kys faggot
>>
>>108346416
nice tracker link faggot
>>
https://huggingface.co/TriadParty/deepsex-34b
https://huggingface.co/TriadParty/deepsex-34b
https://huggingface.co/TriadParty/deepsex-34b
>>
>>108346436
blast from the past
>>
File: v4coming.png (491 KB, 1010x1130)
491 KB
491 KB PNG
>>108346416
It is coming, though.
https://x.com/bdsqlsz/status/2031719179624362060
>>
>>108345424
Fix your shit. Clearing out a section of the strait with no mines takes chunks of the wall out with it.
>>
>>108346445
>Sweaty old man
>December 5, 2023 2:03 PM
>Fxxk, you are such a xxx!
>#4
>27.3s
>Mirai
>December 5, 2023 2:03 PM
>"Of course I do! I can't break promises, Sweaty old man. We have been together since we were kids. We are both best friends and lovers to end all iteration." I smiled with affection. It was clear that I meant everything I said. "We both know that you like taking command of us like this. Am I not your squirting toy, Sweaty old man?" I asked with a cute pout. "We should meet up in front of the shop after classes. I'll see you there. See you, Sweaty old man!"
vintage kino... RP today just doesn't hit like this
>>
>>108346451
i'm coming too
>>
>>108345179
2 more years and the stochastic parrot will start understanding things
>>
File: 1762584698378217.png (5 KB, 135x190)
5 KB
5 KB PNG
>>108345424
:(
>>
>>108346579
>natzi chud btfo
>>
File: 1751778055259798.png (1.27 MB, 1200x1200)
1.27 MB
1.27 MB PNG
New
>>108346672
>>108346672
>>108346672
>>108346672
>>108346672
>>
>>108345612
>>108345925
Llama 2 7B, Q4_0, FA enabled
AMD Radeon Pro V620 1595.32 ± 1.59 81.78 ± 0.06
Nvidia Tesla V100 1391.39 ± 1.19 129.58 ± 0.58 7d77f07
Nvidia RTX 3090 4298.97 ± 10.59 160.13 ± 0.25
512GB/s bandwidth
$500+

Yeah I'll stick to the $900 3090s.
>>
>>108346691
>Llama 2 7B
lol how many decades ago was this? what was the amd support state on lamocpp back then
>>
>>108346451
Fake and gay.
So tired of this horseshit.
Don’t have an X acct and not setting one up to see this one btfo.
>>
>>108346824
>she doesn't know about xcancel
>>
>>108346700
With devstral q6 on vulkan (windows) it looked like (I didn't check the numbers) about the same performance as my w6800. Was definitely noticeably slightly slower than my 3090. A lot faster than dual p5000s. But 32gb in two slots is very appealing to me, and it's cheaper (650 aud) than my w6800 (1.2k aud) Come the weekend I'll throw it in my debian server and try rocm.

>>108346691
Can you upload your vbios?
>>
couldn't resist pulling the reasoner budget, it's a nice way to cut qwen chatter
https://files.catbox.moe/ng0m1w.patch
here's the patch I am going to maintain to unslop some of it, along with a vulgar hack to strip away quotes "" from the reasoning-budget-message because it just so happens, if you have this
reasoning-budget-message = "Reasoning budget exceeded, let's write the answer."

in your presets.ini, it will actually fucking use the quotes and insert them as part of the message when reasoning budget triggers. It only happens when the arg is extracted from presets.ini running llama-server in router mode, not when you pass --reasoning-budget-message flag from the CLI. This one is more the router's fault than pwilkin's code, they haven't put much effort into the ini parsing and this behavior is desirable for passing json objects like
chat-template-kwargs = { "enable_thinking": true }

in your ini
I also add extra newlines before the message is inserted. It would be very dumb to default to inserting in the "I am thinReasoning budget exceeded" way pretty sure it would damage the model output
anyway just the router literal " passing reminds me that many of those vibers don't test a fucking thing for real before they hit the push button
>>
>>108346993
>they
>>
>>108347012
yes, they, plural. router mode is not wilkin's work as far as I remember
>>
>>108343304
>he likely doesn't mean his foreskin because he is likely american
KEK
>>
>>108341869
>122b MoE is comparable to the 27b dense.
I wish we had more dense models.
>>
>>108346255
Sounds like an interesting issue to me & very figureoutable when reliably reproducible, you can likely understand it, I've done some time debugging Linuxy embedded things
Fresh llamacpp pull ya do it
Interested to hear what you figure out
>>
>>108346289
Let's fucking go!
>>
>>108343109
Rent compute on vast.ai until 2027 then buy when ram prices are sane again as production units ramp up.
This is the worst moment to buy anything, especially if you can wait it out by simply renting your hardware until then.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.