ASI framework strategy

The previous experiments have shown that a path some of the critical components for ASI are possible is https://hardai-omnia.blogspot.com/2025/08/asi-experiments.html.

But this page tries to better explain the overarching framework, since many pieces are still missing in the implementation. 

I am open to ideas what to train next.

To be precise what it is we are describing. And LLM, that works exactly like a normal LLM. Except it is extremely good at retaining details and use those from a very context window. And when training it will optimize both for correctness and performance.

But it is limited just like a normal LLM, and should be put into an agentic framework. So agentic frameworks that does RAG, coding, connect to databases and so on, are still very much needed.

But this LLM will use whatever you put in the context window in the best way possible. 

We should probably also touch a little bit about how AI companies will make money in the future.

There is a race to the bottom for some AI. If you want a recipe or a birthday card, even very small models running on a personal computer or even smart phone is capable of that.

So AI companies want to compete where the money is -and that is where people want to pay a premium.

Are you a CEO and want to make a decision that affects the whole company - you probably want to pay a lot for AI.

Are you having a medical issue - again, you probably want to pay a premium, to be certain that you get the best advice.

Are you in a lawsuit, you probably a good AI, depending on how critical it is.

And are you voting - you probably want to pay a premium to be able to know what politicians will actually improve your life, and which is just blowing hot air.

There are three main ways AI companies can have the competitive edge, and get customers that will pay a premium.

  1. Having a good model
  2. Pay a lot to train that model (both initially and to keep it updated)
  3. Have a quite large computer. MoE can scale to many trillions of parameters and those experts can be split out into a large amount of servers. Not all experts will be used the same, so some form of LRU can be employed to make it cost effective. This means that if you want to run a model of the same size in a local machine, it will be very expensive due to memory requirements.

Point 1 and 2 above can probably be done to a large degree in some open source way, but point 3 will be extremely cost inefficient, if you will try to run it on a private server.

We should probably also touch a little on robotics. The LLM will be perfect as the central thought center online. This means robots will always need to be online, and some other model needs to be used for things that needs fast reaction, and for dead reckoning.

With that said, the model is very effective, so it could be possible to put in robots that needs to work in hostile environments like warfare or space - in situations where the robot is big enough that it can run a scaled down model.

Main parts

My take is that we have three main parts, that are hard to solve, and when they are solved, we can build additional functionality on top.

Turing complete

The first part is to be Turing complete. And it is not enough just to be able to do loops and if, but we must also be able to have expandable memory.

Luckily the idea I got working is quite simple. Just have the concept of soft attention and hard attention, and use that on both the context window and the any state that is generated when thinking.

So when the LLM makes it first thought it has attention on the beginning and end, and depending on that it shifts the attention for each though, and also generates thinking state as needed, and also sets attention for that.

To make this even more powerful, a co-processor can be added that can do math. Or it can force the thinking into a specific algorithm.

Performance optimizable

 The next important thing is that we need the model to be able to be optimizable for performance. The lower bound for how fast it can do a simple thought, should be extremely low, and then it should be gradually scalable to think big and long thoughts.

We have three key ways to optimize. The size of a thought, how long to think and how much to think in parallel.

Trainable

It goes without saying that the model also needs to be trainable. If you make a model that have a really good forward pass, but it can not be trained, it is really not usable.

 Getting all of these parts working together are probably the hardest part - and as shown in the previous post, there is a good indication that it is actually possible.

Secondary parts

When the main parts are working, we want to add additional requirements. My guess is they can extend the existing code, without having to come up with a new novel algorithm.

Model specification

It goes without saying that the AI needs to follow a model specification - probably the open source one OpenAI are maintaining. My suggestion is to just generate the  <system><user><assistant> pattern in the test data, and make sure all assistant answers adhere to the model specification.

So most of the training data is auto-regressive on the full data (except some header for metadata about the data) - and doesn't care for the model specification. And the model specification is then used when having the model used as a assistant.

It is quite important that the model follows model specification - if we let it run uncontrolled, it can very easily do nefarious things to achieve a goal. 

I am no that worried that it only follows the specification when used as an assistant - since that is how it will be used when people are using it some kind of agentic work or when it is used to provide some service.

Jail breaking

In general jail breaking is really not that interesting - if people want to use the model unintended, they can just download some open source model. But some things can be done to prevent jail breaking. We again use the  <system><user><assistant> - and add training data with jail breaking techniques and how to respond to them.
The Turing complete RNN architecture, should be able to figure out when jail-braking is done, and stop itself.

I don't know

A simple extension is for the LLM to have a specific token it can output when it is in doubt. This is something that boost correctness and performance tremendously, since early wrong thoughts can be aborted immediately, and another path for the solution could be tried.

And for the end user, it also means that wrong answers are much more rare.

Hallucinations

The model is trained just like any other model - if you have more data than you have variables it will start to generalize. This also means that it will hallucinate just like any other model.

So it is up to the user or agentic framework to use the LLM in the right way. So sources have to be verified. What makes it special is that if you put something in the context window, it will be much better to retain details.

So no special code needs to be changed in the model for this to work. The combination of binary expression of data instead of embeddings (huge increase in compactness of thoughts), and the gradual change of state during thoughts, will make almost any (sensible) problem shallow in the eyes of the LLM - as long as it is in the context window.

Self aware performance metrics

 One thing that is quite important before it can be useful in a very general sense, is for the model to be self aware about its performance characteristics. The use case is that it gets a hard problem, and instead of churning on it for days, it should be aware that it will probably take days, and take a sensible choice. Either explaining it to the user, or solve part of the problem and notifying the user.

I am not sure how exactly this is done. But since the LLM can understand the context-window, that it self could contain information performance of external systems (how much it cost to run some code etc).

Training updates

It is quite important that the model can be retrained with new data. If we look at the big picture, some use cases can use old data. But sometimes people will pay a premium to have recent data. And the money for AI companies will be in the premium 

This is especially important for scientific progress, as described later.

Mixture of experts

The model is a RNN, and it is anticipated that it will run in a MoE setting. Using gated mixture of experts and the binary expression of numbers, we can have practically infinite experts. To be able to do the performance optimization, we want to have experts of many different sizes. The training is done by first having a training run with only one expert (for each size). Then that expert is copied to two and trained again with gating. That continues until some number is reached, lets say 10, and from there on not all experts are used in the back propagation.

So if we have 10.000 experts that are 10 gigabytes each, we are looking at 100 terabytes of experts.

Layered training

The training uses a lot of different ways to get to the final model.

The first one is supervised learning with RNN and a fixed number of thoughts. This uses backpropagation through the thoughts, and gating is used to make some hard decision soft. 

The next layer is curriculum learning where thoughts are defined. So instead backpropagation needs to figure out some complicated algorithm, we have already figured that out in advance, and the training just need to understand when to use the algorithm.

This is then combined with the first way to train, so instead of used a fixed number of thoughts to get the final result, it just need to use some mini-thoughts, to produce one curriculum defined thought. 

This combination is what makes the LLM extremely powerful. The RNN based supervised learning can only solve low complexity problems - but if a hard problem is made into a lot of low complexity problems, it can suddenly learn it.

This is then combined with the third layer. We run RL interleaved with the training. This make the thoughts faster and better.

This is quite an important point for understanding why we achieve ASI. If we think of autoregressive training, we try to predict the next character. And we have some thoughts before outputting that character. Each training round we do, the thoughts gets better, and it will relentlessly try to understand exactly what thought is needed to output just that. And as soon as a thought 'clicks', that character will be very fast to generate.

The next layer relies on the fact, that the model it self will get better than the best existing LLMs. When it does that it can start generate synthetic training data used for the next iteration.

Curriculum learning

As mentioned above, a big part of the LLM is the synthetically generated curriculum data. So lets dive a little bit into how that works.

The idea is that you let an LLM generate a specification for thought transformations. The transformations can either be end-to-end, so it has some input prompt and generates the final output.

But it can also be random - so instead you say generate a random step of this algorithm, and in that step we have some random input state (for that step) , and then output is the state transformation.

When we have these training sets, we run a simulator to validate it actually works. So we have more input data than variables on a small model, and we need to see we get perfect accuracy. When that is achieved, we know the curriculum works, and we can use the training data in the full model.

Autoregressive vs system/user/assistant

As mentioned before we have training data either for any raw data, or for system/user/assistant pattern. For raw data we give a specification where it is from like

'Reddit comment on date XXX' or 'PDF file from http://xyz// on date XXX. 

The autoregressive training then starts with at a random place in the content, and tries to predict characters from that point. It does not need to predict all characters. but maybe just a single character or the next section.

For system/user/assistant pattern, it will always try to predict characters when the assistant answers, and outputs characters until the end. And the assistant answer is of course following the model specification.

Since we have more data than we have variables, this will let the model learn about everything in an autoregressive manner, and use that when it needs to output as an assistant.

Scientific progress

This is not something that needs to be coded in the model, but something that will happen if it gets updated regularly - combined with a good way to tag training data. The idea is that it takes a lot of time to make a paper. If it is an algorithm, a lot of algorithms needs to be tried and compared until the best one is found. And if it is in other fields a lot of experiments needs to be run. So a paper takes a lot of effort to produce - but when it is published it takes only very little effort to understand.

So the LLM is trained on all peer-reviewed papers (which have been tagged in that way), and it will learn that peer-reviewed papers are better for basing thoughts on (as explained in the layered training). So when new papers are published, and the model is trained, it can base a new paper on all new ideas since the last time it was trained.

This is much more effective than adding papers to the context window, which would be the alternative if there was no regularly updated LLM. 

Kommentarer

Populære opslag fra denne blog

ASI - Experiments

How to create super intelligence

Auto generalization