RNN LLM with small models

 A lot of avenue have been explorered like RWKV and it descendents.

This proposal is very similar, but add some additional elelementsntinmake the recurrent part shine.

The main takeaway is that instead of having a lot transformer blocks, we just have one transformer block for entering thinking mode one for exiting, and the RNN block for doing the thinking.

We have different sizes of the RNN thinking block, and uses RL as the last step in the training to unlock it.

The end result is a turing complete LLM, but that is too hard to train. So this proposal is a middle ground, where the thinking transformer due to its recurring nature will be able generalize to skill a knowledge not in the training set.

We still use MoE, and the key idea in the algorithm is that we remove the typical 30-60 transformer blocks, and rely on even more mixture of experts.

Why is it powerful? Instead of having a fix number of steps, where each transformer get closer and closer to a solution, it has the ability to transform easy when the problem is easy, and due to generalization it can keep thinking for extremely hard problems.

The novel idea is how this is trained. We take inspiration for MoE, where there is gating and only variables that are used in the forward pas is back propagated.

What we do under training is to go through different number of thinking steps, and just choose the best one. If multiple rounds give the same result, we use the lowest rounds. This means that we keep optimizing for the best result, except we don't know how many thinking rounds we need to converge it.

 

You can think of the algorithm as thinking along, at at some point the thinking has produced something that snaps to the correct result hen combined with the final output block.

When training we don't need to know how many rounds is needed for this snapping to happing, since we know the end result and jsut tries to think to the max and see which one snapped the best.

But at test time we don't know when to stop thinking and do the last block.

This is where RL comes in. So we freeze all variables of the network, and add additional selector block. This block only purpose is to detect when a snapping should occur.

This not only works with this type of RNN with only a single thinking block, but also if we have different sizes of the thinking blocks.

We use the same training pattern, except we need to add some additonal measures to make sure all network sizes are trained. The important part is that all network sizes are connected together during training in different combinations, so a large network can be in RNN with a smaller.

 

As mentioned we know how to train, but not how to infere at test time, since we don't know what combination and order of small/large networks will work for a given sample. And here we use the RL again.

The RL only make simple choices, so which thinking block size is next and if the thinking is done. We know that there is a solution (if we assume it has generalized), so the RL is all about finding that.

This doesn't make the architecture Turing complete, since we need the memory tape. And it also doesn't a way to say i don't know, which is also a critical feature.

 

But it does gets us much closer to something useful, since the generalization is much more powerful than existing LLM that have a fixed number of blocks.

 

This means that we hopefully can build on top of the trained model with prompt engineering and maybe RL to make it turing complete.

Kommentarer

Populære opslag fra denne blog

ASI - Experiments

How to create super intelligence

Zero day super intelligence