How to create super intelligence

With the advent of thinking models, or test time compute as it is called, AI has reach a milestone. It could probably be called AGI, and continuing refining this path can give very strong models.

But it is still not very close to the theoretical maximum of how good these models can get.

To get a little close here are six ingredients for achieving super intelligence. The underlying reason why this will probably work can be found in this paper:

https://arxiv.org/html/2305.17026v3 "How Powerful are Decoder-Only Transformer Neural Models?"

If the transformer architecture is Turing complete (meaning it can run like a computer program, and thus solve any problem), then we can probably also assume that we can have the architecture have Turing like properties, but not run as a program, but as reasoning. 

In other words, when the number of layers and parameters rises, we can probably simulate something that is more complex than are circuit (and we already know we can simulate a circuit of any size).

We have already seen this effect with the reasoning models spearheaded by OpenAI-o1 series. But even that is very far from the theoretical optimal. So lets see all the ingredients we need. There are of course other ways super intelligence can be achieved, but this is just one way.

Ingredients

Context reasoning

Let the AI think in the context before a result is returned. This is already widely used, and is achieved by using reinforcement learning.

Context cleaning

During reasoning, the AI can choose to clean up its own context window. This can be done as simple as just having a range that gets replaced with some text.

Global database

A global read only database can be looked up in by the LLM, and the result is added to the context window. Then knowledge is in this database instead of being in the transformer weights.

Local database

A database or vector store that can be written to. This works in tandem with the cleaning. Sometimes the reasoning done in the context window should be used later, sometimes there is no need and it should be completely gone.

In context code execution

A way for the reasoning part to run code as part of the reasoning. So a coding language that is very fast to compile and efficient to run. It should be implemented in such a way, that it can also control the flow of the reasoning. Imagining that the reasoning comes up with some number of steps it needs to do to solve the problem, and then it makes a program that goes through these steps. And some of the steps are validation steps, to validate the previous calculations. Then the reasoning is forced to go through this, which mitigate hallucinations where a complete different result is achieved. 
It can also be used for simple math or statistics, so the model does not have to use weights for that.

Model selection

Not all tasks for the AI is going to be the same difficulty. So changing the model during reasoning seems meaningful. This can be achieved that before each token is chosen, a specialized model is deciding what reasoning model should be used. The model selection works especially well with the in context code execution, since the execution can make sure validation steps are done. Since that is know, a much smaller reasoning model can be used for most parts, and only a big model is used for validation. If the small model can not solve the problem, a bigger model is chosen. But since validation is always done by a big model, there are no risk in going for the fastest to try out.

Test time

After training the models, we have an extremely capable super intelligence.

  • Fast - model selection makes sure the fastest model to solve the job is used. Small context windows means it will be fast. Code execution means some reasoning steps can be replaced with traditional computing. Database lookup is much faster than having knowledge in weights.
  • Accurate - Most knowledge will be in the database. The code execution makes it much easier to be correct. The training is done using reinforcement learning, so hallucinations is only in extreme corner cases.
  • Capable - The model selection chooses a big model, when problems are hard to solve. The reasoning uses global knowledge database, and will put information into the context window as needed. On the other hand the context window is kept small, since it can be cleaned up, which further enhances reasoning capabilities. And the local database ensures that reasoning can be retrieved again, when it is relevant at a later point.

Training time

The end result is that we have a series of reasoning models of different sizes, and a selection model. But how do we actual train it?

That is probably a more open question, since doing it efficiently is probably very valuable. But the expensive way is to just to use the existing best model to generate training data.

So the big model will be given question/answer pairs, and will then provide the thinking steps. Including setting up the global and local database.

For the model selection the big model also just chooses what is think is appropriate. Some self validation can be done, since the in context code execution and database usages can be run.

After the training data is generated, a standard transformer model (or any other appropriate architecture) can just use that as normal token prediction.

The above is almost how thinking models are already produced, but with some extra steps for the additional components.

Conclusion

Test time reasoning have already shown to be extremely capable. But it can be improved tremendously, since some big components have not been implemented. If they are implemented the price for accurate and fast AI, could be much lower.


Kommentarer

Populære opslag fra denne blog

Week 17

Week 21

Strawberry I-don't-know, and an agent implementation