Zero day super intelligence
Current state
Grok4 has been released and improves synthetic benchmarks by a large margin. But using it for actually coding it falls through.
When I am coding and want fast and correct responses, that only changes the minimal amount in existing large code base.
And Gemini pro and Claude are on par, and Grok 4 has no improvement for that use case.
One of the reason it does well on benchmarks is it's prompt engineering with an agent specialization framework
- Analysis Agent focuses on data interpretation
-Synthesis Agent combines multiple perspectives
-Verification Agent cross-checks reasoning accuracy
-Communication Agent translates findings coherently
While this is interesting, it doesn't really solve the benchmarks completely, it just enhances thr capabilities of the underlying LLM, but is still limited by the underlying structure.
What if we could have an LLM that we could just train, and it would get 100% accuracy on all benchmarks, and would be optimized for speed?
This is the idea zero day super intelligence. You train your model, and out comes super intelligence.
Model
The first thing we need to establish is the model we will train.
It consists of different elements, that together both ensures that we have intelligence and speed.
The model works exactly like existing LLM that you provide some context and it continues from it.
So one element is the context window.
Next we have what I call the neural state. This is just the short term memory, and are the input to internal models.
Then we have the internal state. This is a longer term memory used for the duration of the prompt.
We can the make an action loop with the following actions:
-read from context
-read/write internal state
-read/write neural state (thinking)
-complete output (final step)
When ever we do an action, except the final step, the neural state gets updated. Updating is done with patching, so the internal models only have to output a minimal amount of tokens for each iteration.
With the elements above we have an AI Turing machine. The internal state acts like infinite memory, that the AI can jump around in, and the neural state is the decision where to jump to.
By doing this we have replaced a lot of logic that are currently hand written in LLMs with a an approach where the algorithm of handle big context window is solved by training.
Then we have a lot of models - we have a lot of selector models in different sizes, and for each action a number of actions expert models, that both have a dimension of different sizes and different knowledge.
So we have m+(n × o) total models. All the models a standard transformer models, that has a context window the size of the neural state plus dditional input for what is read from context/internal state, for those actions.
The flow is that given the neural state, the selector model chooses an action and a corresponding action model. We have a special DontKnow token, that the model can return, and if it does that a bigger selector model I chosen. Since it is only one token that needs to be replied, finding the right size is fast.
And since we have a lot of different models to choose from in a lot of different sizes, we will get a fast response when that is possible and a thoughtful response when that is needed.
The DontKnow token can by used by any model, and acts like an exception that can happen at any time. When that is returned a bigger selector model is chosen, until the biggest has been selected. And if that doesn't work, the LLM will just respond with a hard coded phrase that it doesn't know the answer.
The reason is two fold. Primarily optimizing for performance inside the LLM, but also optimizing for correctness for the agentic use of the LLM.
With all this implemented and trained we should have an LLM that will be very capable.
-The size and complexity of the context window is almost infinite, since the model itself have figured out how to analyze in optimal ways.
-The hardest problems it can solve is very high, since it can use a lot of memory for storing intermediate calculations, and it can do simple calculations fast, and when a harder sub-problem needs to be answered it switches briefly to a larger model.
-The understanding of the context window is very good since analysis is put into the internal state, so anything that can be split up into smaller pieces or combined from smaller pieces can be done. It can go through the context window or part of it as many times as needed, if it has gotten new insights while it is thinking
-Performance should be really good since the smallest model that can handle a task is chosen.
All in all it will be a revolution compared to existing models, when coding in tools like Cursor, where you want a tight loop between the end user and the AI, to avoid the AI implementing things that are not intended.
It should get close to 100% on all synthetic benchmarks, since it really is just a turing machine with a thinking head, and we know Turing machines are capable of solving any problem. This is also why it is called zero day super intelligence, since after training you will have something that super intelligent and fast from one day to the next.
Training
We know that if we get training to work, we should get super intelligence.
But is that even possible?
The hard part is to avoid getting into local minima when doing RL.
An example could be the DontKnow token, it could be that it just returned DontKnow on most data and just remembered the precise data on some data. That wouldn't lead to generalization, and would only be able to answer what is in the input data.
The solution to that is just to wait a little with the DontKnow functionality until after it has generalized. In that case it has a lot of generalized and correct knowledge, that will be kept since reducing it will give a lower score.
A much harder problem is that every time we change between a model, we go through the neural state. Which means we looses back propagation.
Also - the problems that needs to be solved needs big jumps in functionality, which will give cascading effects for the calls to the following models.
To solve all of this, we use a sleugh of different techniques
-We use SL on the completion step, since it is possible to do on the final step. This means that we at least are as good as a normal LLM in that step, give us a good starting point.
-We update the weights of each model independently. This has the result that we know the final result actually give the improvement, and won't be obstructed by other changes.
-We use separate models for each iteration of the loop. This is the same reason as above, to actual make the model continuously improve, and avoid recursive effects.
-We remember the output for each internal LLM and neural state for each iteration and training example. This is a key point, since that let's us reset all models in a new configuration, but still keep the training results so far. So the important parts are not the model weights anymore, but the intermediate training data.
-We combine the intermediate training data and use SL to re-configure models. By combining data multiple thinking steps we generalize to an even higher number of thinking steps than when we are training (remember we have a new model for each step)
-Since all we are interested in is the intermediate training data, we can train on smaller data sets, with a very small model. In that way we know that we still generalized, and after we have generalized thinking for that problem it can be combined to a bigger model.
-We can also generate intermediate training data with a powerfully LLM, since translation into weights will be done when combining.
- We don't care if the intermediate training data doesn't match, since RL will converge it, since translation is not a hard task for RL.
-After we have trained with an independent model for each thought step, we switch to the final configuration as described in the model section. Since all models will have the same intermediate training data, we can now use RL to make small changes, that improve recursively through the thinking process. The same model will be used in multiple thinking steps, but we can ensure that the updated weights had the correct effect, by making sure we are having a sufficiently small step size. Remember at this point we are not trying to let the LLM figure out any new insights, but just optimizing what it already knows and distributing that knowledge to different models for the nest performance.
Training data
We have two types of training data. Data that are just completed, and data that we know only have one correct result.
The latter is to make the LLM think logically, since we want to have a very clear signal for that part. We should be able to do without it, but the training would take much longer. But given the ease it is to get the data, and the benefit it gives it would make sense.
It can be math and logic problems, and problems related to understanding a large and complex context window.
The other is just normal completion, and any data can be used. For the final step model we of course just use SL, but for the RL we use LLM as a judge.
RL is also used to align with the model spec, which should be enough to avoid the final step model to go out of spec.
Kommentarer
Send en kommentar