RNN with hierarchical attention

 The paper "Hierarchical Reasoning Model" has recently been released. In shows that a recurrent neural network can be used for LLM.

The big thing missing is that they still rely on RoPe and transformer architecture, so handling large context windows with high precision is still limited.

Also they mention that HRM is turing complete, and while it is much closer to being Turing complete, I would argue that to be fully Turing complete the system should also be able to use infinite memory.

But it is very hard to have an end-to-end trained model for that, since it has to make complex decisions.

This article will describe a model, where infiinite mwmory is not aolved, and I imagine that,bthat functionality can be bolted with RL using context space thinking.

But the underlying end to end trained model, will have recurrent thinking on very big context windows.

 We also have to have performance thought into training. When releasing the model to the end users, it should work fast and efficient. If done correctly training will also be faster, making the end result even more capable with the same training budget.

This optimization can probably  be boltes on, and  be part of  end-to-end training.

For the actual architecture we don't use any transformers with attention like RoPE.

 

Instead we pair up each token and calculate the next 'layer' token. We use RNN so the output from one layer is the input to the same neural network.

Think of it as mip-mapping. The last output of this pyramid is then input to the next thinking steps.

And we use backpropagation through all these pyramids from the final output token.

Each token pair is latent space using MoE, to make it more efficient.

Another optimization is autoregression - when processing the first layers in the pyramid we don't give information where in the pyramid we are, but when we get close to the end we provide what layer we are at and how many layers are left. 

To make it work for later thinking step, each step just add ad it as new thinking token in a thibking section in the context. In that way previous thoughts can be mostly reused since it is only one 'side' of the pyramid that has changed.

So the top and side of the pyramid is recalculate at each thinking steps, and the complexity is n plus a constant, where n is the size of the context window. 

 

 

Kommentarer

Populære opslag fra denne blog

ASI - Experiments

How to create super intelligence

Zero day super intelligence