[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/sci/ - Science & Math

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • Additional supported file types are: PDF
  • Use with [math] tags for inline and [eqn] tags for block equations.
  • Right-click equations to view the source.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


https://github.com/pkcode94/deepseekx/tree/master/deepseekx

Mathematical Formalization of the Unified Multi-Head Transformer LSTM Cell1. Core LSTM Update$$\begin{aligned} \mathbf{i}_t &= \sigma(\mathbf{W}_i [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_i) \\ \mathbf{f}_t &= \sigma(\mathbf{W}_f [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_f) \\ \mathbf{g}_t &= \tanh(\mathbf{W}_g [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_g) \\ \mathbf{o}_t &= \sigma(\mathbf{W}_o [\mathbf{x}_t, \mathbf{h}_{t-1}] + \mathbf{b}_o) \\ \mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \mathbf{g}_t \\ \mathbf{h}_{lstm, t} &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t) \end{aligned}$$2. Fractal Memory & Attention$$\begin{aligned} \mathcal{R}_0 &\leftarrow \text{Enqueue}(\mathcal{R}_0, \text{sg}[\mathbf{h}_{lstm, t}]) \\ \mathbf{h}_{attn, t} &= \text{MHA}_0(\mathbf{h}_{lstm, t}, \mathcal{R}_0, \mathcal{R}_0) \\ \mathbf{z}_{l, t} &= \begin{cases} \text{MHA}_l(\mathbf{h}_{lstm, t}, \mathcal{R}_0, \mathcal{R}_0), & l=0 \\ \text{MHA}_l(\mathbf{h}_{lstm, t}, \mathcal{C}_{l-1}, \mathcal{C}_{l-1}), & l > 0 \end{cases} \\ \mathcal{C}_l &\leftarrow \text{Enqueue}(\mathcal{C}_l, \text{sg}[\mathbf{z}_{l, t}]) \end{aligned}$$3. CT-Gate (Compressed-Transform)$$\begin{aligned} \gamma_t &= \sigma(\mathbf{W}_{ct\_g} [\mathbf{x}_t, \mathbf{h}_{attn, t}] + \mathbf{b}_{ct\_g}) \\ \mathbf{z}_{small} &= \text{ReLU}(\mathbf{W}_{down} \mathbf{z}_{D-1, t} + \mathbf{b}_{down}) \\ \mathbf{z}_{exp} &= \mathbf{W}_{up} \mathbf{z}_{small} + \mathbf{b}_{up} \\ \mathbf{h}_{ct, t} &= \gamma_t \odot \mathbf{z}_{exp} + (1 - \gamma_t) \odot \text{Tile}(\mathbf{z}_{small}) \end{aligned}$$4. Final Unified Ensemble$$\mathbf{h}_t = \frac{1}{3+D} \left( \mathbf{h}_{lstm, t} + \mathbf{h}_{attn, t} + \mathbf{h}_{ct, t} + \sum_{l=0}^{D-1} \mathbf{z}_{l, t} \right)$$
>>
Where It Starts to Get Weak
1. “Fractal” Is Marketing, Not Math

Nothing here is mathematically fractal.

There is:

No self-similar scaling law

No recursive contraction mapping

No invariance across depth

It’s just:

“A stack of attention memories with enqueue.”

Calling it fractal memory is branding, not formalism.

2. The Final Averaging Is Arbitrary

This is a red flag:

hₜ = average of everything

Problems:

No learned weighting

Assumes all pathways are equally informative

Ignores scale mismatches

Encourages representational blur

A learned gating or normalization would be strictly superior.

This choice screams:

“I didn’t want to deal with instability.”

3. Attention Depth D Is Undefined Behaviorally

Questions unanswered:

How large can D get before memory explodes?

Is memory truncated?

Is enqueue FIFO? Reservoir?

Is attention causal or bidirectional?

Without this, the model is underspecified.

4. No Training Objective Ties the Parts Together

There is no loss-level justification for:

Why recursive memories matter

Why compression is beneficial

Why LSTM + attention are not redundant

This means:

It might work

But it’s not theoretically grounded

Bottom Line

This work is:

Technically competent

Architecturally coherent

Incrementally creative

It is NOT:

A new theory of learning

A mathematically deep construction

A principled unification (despite the name)

If I had to summarize it honestly:

“A reasonably engineered hybrid RNN-attention cell with hierarchical memory and a compression gate, expressed with more ambition than justification.”
>>
Where It Fails as a Primitive
1. It Is Not Minimal
This “primitive” contains:
an LSTM
multiple attention heads
recursive memory buffers
explicit gradient blocking
a compression–expansion bottleneck
ensemble averaging
That’s half a model, not a primitive.
A primitive should be explainable in one sentence.
He needs five paragraphs.

2. No New Operation Is Introduced
Every operation used already exists:
σ, tanh, ReLU
gating
attention
enqueue / memory buffer
projection down / up
There is no new mathematical operator.
That alone disqualifies it as a primitive.

3. Behavior Is Emergent, Not Atomic
A primitive has a direct behavioral meaning:
Attention “select”
Convolution “local aggregate”
Gate “modulate flow”

This block’s behavior is:
“Whatever emerges when these parts interact”
That’s architecture-level behavior, not primitive-level.

Where He Is Onto Something
Now the charitable part — because he’s not wrong in spirit.

1. The Stop-Gradient Memory Insertion
This is the closest thing to a primitive here.

You could extract:
“A write-only, read-many memory operator with gradient isolation”
That could be a primitive if isolated and formalized.

2. The Compression–Transform Gate
The CT gate is conceptually sound:
“Route information through a bottleneck unless expansion is justified”
That’s a control primitive, but only if stripped down and generalized.
Right now it’s buried.

3. The Intent Is Correct
He’s trying to address a real gap:
RNNs remember locally
Transformers remember globally
Neither does hierarchical abstraction over time cleanly
That instinct is correct.

How He Should Reframe It:
If he wants this to be taken seriously as a primitive, he needs to:

1. Pick ONE Idea
Not five.

Examples:
“Gradient-isolated memory write”
“Recursive attention accumulation”

“Gated compression routing”
>>
2. Define It Abstractly
Something like:

Definition:
A memory operator N that accepts state hₜ and returns a read vector rₜ, where memory writes are non-differentiable and reads are differentiable.

That’s primitive language.

3. Show It Working in Multiple Contexts

A primitive must survive being used in:
RNNs
Transformers
CNN-like temporal models

Right now, this only works inside itself.

Honest Verdict You Could Give Him

If you want a fair but accurate response, something like:

“This is a well-engineered composite cell and a solid architectural experiment. However, it’s not yet a neural primitive — it’s a macro-block built from existing primitives. To become a primitive, you’d need to isolate a single new operation, define its behavior independently, and show it composes cleanly with other architectures.”
That’s not dismissive.
That’s correct.
>>
Thats as much as i can post without the janitors deleting all my posts for "flooding" or whatever other dogshit reason they want to adopt.
>>
>>16888303
Thank you for your in depth response, actually carrying the board fr fr no cap senpai (not OP)
>>
>>16888302
>d fr fr no cap senpai (not OP)
>>16888301
>>16888300
give me a sec before i read your response. i am using it to categorize radiometric data into anomalies right now.
>>
>>16888302
fair response. also the architectural criticism is fair and thank you for the constructivity. i will work on it.
>>
>>16888303
https://github.com/pkcode94/deepseekx
btw heres the github.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.