ASI - Experiments

Problem - coding on large code bases

I have been doing a lot of vibe coding, and current models deteriorate in quality when the problem gets too large and too detailed. This makes complete sense if you know how current transformers work, and is well described here:

https://research.trychroma.com/context-rot

The LLM simply glosses over details, and when it then rewrites some code, it forgets some instructions or previous details in the code, and you as a human will have to keep adding these things back.

It is possible for the framework to do patches - but the problem is the same. The LLM is not smart enough to do a good patch.

I have a lot of experience in this, and if it was solved my life as a coder would be much easier. So I assume it is not a solved problem.

So this page is description of an LLM that should be able to both handle large code bases, having deep knowledge of algorithms that exists and be smart enough to solve the problem asked.

I have made some small experiments, with the most problematic pieces, and they have shown no signs of any fundamental problem, that would make this problem hard or impossible to solve. It is just a matter of engineering and putting more capability in the algorithm, generating more synthetic training data and of course train it a lot.

The idea consist of different components, and when all put together gives the final result.

Component - BC-model

One of the problems I encountered was that a lot of variables was used for embedding and input to LLMs. This is solved by a concept i call Binary Composable model, or BC-model for short.

A BC-model is a model that takes binary encoding things like chars, numbers and switches and produces an output. The things can be composed as you want it, so you can use the same BC-model for different things.

In that way we have an abstract thing we can use for multiple purposes, and it is especially suited for being in a RNN framework.

So an example is that we use a bit for describing if character or a number, and we could have these two inputs to the BC-model.

An example could be "12345 plus 6789 equals" is encoded as "n2345c cpclcucsc n6789c cecqcucaclcsc"

So every number is marked with something and every letter is marked with something. And the model knows how to give a result.

In other words, you can reuse the same model as long as the output is the same. It should also scale with data, so having more data/cases should not make the model deteriorate.

So a BC-model is just a specification for some minimum requirements for a model, and the actual implementation can be improved independently of the rest of the system.

The experiment I have done, I have taken a u-net with a lot of skip connections and combined it with FiLM. This means that we can calculate an alpha and beta value, that we add and multiply to each variable in our neural net. In that way it can - based on the input - change the way it thinks.

The result of this implementation is that I can do an addition calculation with the numbers between 0 and 1000.000.000 (32 bit numbers) at the same time it does multiplication of numbers between 0 and 1000.000.000.

I stopped training after a few minutes when it achieved 100% accuracy of the addition case. We do not need to show it can do every task perfectly, but just that it can discern cases, so it can do easy cases while it does hard cases.

The repercussions of a good BC-model cannot be underestimated. It means that there is no need to do any embedding, since the network learns it implicitly. This makes it possible to give much bigger pieces of context window at once, that the neural network it self learns to understand. You can pack a lot of different cases into the same neural network, also reducing the number of input variables needed. In the algorithm proposed later, that actually means reducing it with a combinatorial amount, and big help in achieving ASI more easily.

The BC model should probably also use mixture of experts to be able to scale. We can use gated mixture of experts, and start by training with only one expert, then split to two, then four and so on. And when we get above a threshold, lets say 10, we only back propagate to the best 10 networks. In that way we can continue to split up, without learning getting slower. We of course back-propagate through the selector, so we have what is inside is expert and what they know is learned in combination.

We could imagine 10.000 experts in this way, making the BC-model very capable across a large number of different tasks.

You might be wondering what the difference is between a GRU cell and BC-model, and it is mostly about the defined capabilities. BC-model supports binary encoding of input, and also the ability to combine items for more dynamic gating.

Continuous thinking with RNN

Another experiment I did, was to see how an RNN could generalize. We already have the HRM paper ( https://arxiv.org/abs/2506.21734), so it is not surprising it was possible.

What I did was going through the context window on character at a time on repeat, and producing a thought token. Both the context window and thought window would be continuously summarized.

This was then put into a fixed number of thinking steps (in my case 50), and after the last step a model was run to produce an output token.

Back propagation is done across the full thinking across all output tokens, which maybe is overkill, but shows that deep backpropagation with an RNN is possible.

The test I did was

"How many 'r's are in '6e9Sz5iq0r'" - so trying count a random letter in a random string.

And the other one was

"Repeat 'tv3oE5h76I'".

Both could be trained at the same time, and the method produced 100% correct result.

This shows that RNN can solve complex behavior from completely supervised learning.

Supervised thinking layer

While the continuous thinking with RNN is impressive, it only works with small context windows. And it can also not learn complex behavior where you need to make choices throughout the thinking.

So my proposal is you make very expressive state that is used as input to the RNN cell, and then RNN cell produces a new state.

We can then define through training exactly what state should be as input and output. This can be combined with additional possibilities, so choices can also be defined.

The example below is from the counting case, where a mathematical co-processor is available (it also works without the co-processor, but that takes longer to train and is slower).

{
  "description": "Scan position where target character matches, and increment and advance.",
  "prompt_example": "Count 'A' in AABAA",
  "explanation": "soft_range_0 is used for understanding the current task described in the context. char_at_index_0 is the character to be matched.  char_at_index_1 is the current character to be looked at. integer_0 is the current count.",
  "state": {
    "soft_range_0": {
      "target": "prompt",
      "start_index": 0,
      "end_index": 7
    },
    "char_at_index_0": 0,
    "integer_0": 14,
    "integer_1": 1
  },
  "transition": [
    {
      "description": "advance",
      "operation": "addition",
      "param_0": "integer_0",
      "param_1": 1,
      "target": "integer_0"
    },
    {
      "description": "count up",
      "operation": "addition",
      "param_0": "integer_1",
      "param_1": 1,
      "target": "integer_1"
    }
  ]
}

We can do different cases, and make simulator to see that the different cases produces the correct end result. In that way we know the test cases produces the final result.

We then train the LLM through supervised learning to think. And we can use RNN to learn each thinking step. This is especially important if you want to gate different input, it can learn how to gate correctly in multiple steps.

So the idea is that we use the currently best available LLM we can find to generate synthetic test cases on how to think. And then we train on that in a supervised manner.

That doesn't exclude normal auto regressive learning, since we can just have a thinking mode that just produces output characters.

The results are not surprising - in a few seconds we can learn the LLM to generalize the two cases with counting and repeating.

Final experimental results

After 36 experiments, using the current state of dreadful vibe coding, I got all pieces put together.

So I can take any question like "How many r's in 'xyzr'" and it will reply with the number. The length of the context window can be up to a few gigabytes, but it of course takes time to answer that. But I tested it with 100 MB context window and got a 100% correct answer.

For comparison the same question with a context window of a few hundred characters, fails on ChatGPT 5

To test that it also can do generative question, I trained it on generating names. And it could do both tasks at once, with 100% accuracy.

Additional framework besides the LLM

It could be interpreted that if we just make this LLM described above we will have ASI. That is not the case, but having a good LLM is a corner stone. While in theory the LLM can handle extremely large context windows (gigabytes) it will still start from scratch, and maybe not have the best tool at the disposal.

So RAG where we preprocess the data, and select the data that is needed is still extremely important.

Agentic LLM that have resources like databases, disk, running programs and of course MCP are still needed. The agentic part will also handle task management, and maybe large part of the thinking process.

But when you put things in the context window for the LLM, you know it will do its best to solve it. This is especially for science, where a good chunk of scientific papers can be put in the context to get and understanding of it. And also for coding, where a lot of files have to be in the context at the same time, to be able to give the correct answer.

We also use reinforcement learning, both for the LLM it self, but also for the LLM in context of the scaffolding around it.

What we want to optimize for is of course correctness, speed and price. When the LLM thinks it can either do it in big chunks, in many small thoughts or it can parallelize. When doing the SL training we let it do many of these things, and then we can run RL to make it choose to do the best thing at any moment, to optimize for the three requirements.

The ability for 'I don't know'.

The first experiment mentioned was the ability to do addition at the same time multiplication. But multiplication gets harder when the numbers gets larger. So if the model had the ability to detect small numbers, and give the correct result for that, and for large numbers it would bail out, it would not give a wrong result to calculations based on that number or to the user.

This should increase the correctness a lot, since these errors compound. There are probably many ways to achieve this, with RL being the most obvious choice. But since we have a co-processor, we could also do the same thing as they did for the moon landing - running the same computation on 3 computers. If all 2 of the computers produced the same result, it continued with that result. But if they all 3 produced something different, it would stop.

You may not want an LLM that just answers "I don't know" all the time. But this is where RL shines, and will minimize the amount of times you see it. And since errors are compounded the more thoughts that are combined, this improvement will increase what types of problems the LLM can solve, simply by being able to think correct for longer.

In my experiment above with 100.000.000 context window, the precision was good enough since it only needed to understand binary numbers. But we could easily imagine that each count had a little error, and then we wouldn't be able to train on a few samples to expand to billions.

Different sized models

With the proposed solution we can adjust how long a model thinks with RL. But an optimization above that is to train with models of different sizes.

In that way we can both adjust how much it thinks, how wide it thinks and how much in parallel it thinks, getting an LLM that does the best it can do in any situation. That is especially important for tasks, where a lot of small repetitive things must be done (and it wants it to be done in context).

The case for large datacenters

A lot of people seems to think we can get AGI at home, and you just need a decently sized machine. That is probably very possible to get very high quality, but if you want to have the best LLM, that both have up to date data, broad knowledge and can know about very complex issues, it is almost only possible in big datacenters.

The reason is the time it takes for loading the model into GPU memory. When it is loaded into for inference memory the variables are hardcoded, and if we use M-o-E for 10 thousands of models, a lot of them has to be loaded to give an answer.

By having a big datacenter we can use LRU or similar pattern, to make sure that we have loaded the experts that are used the most. So some experts are maybe called much less than others, and the big datacenter make sure that no matter what distribution we have across these, you will get a fast reply.

Scientific progress and continuously training

The way i imagine this ASI to do scientific progress, is through scientific papers. The idea is that it loads new relevant scientific papers into the context window, and use agentic framework to produce a new paper (doing experiments etc).

The paper is then published and peer reviewed, and used for training of the LLM. In other words - new ideas are made possible by combining on all human knowledge every day (with whatever new research have been done since the last day), and converging these ideas into scientific results is done with an agentic process.

The reason is that getting scientific results, usually involves a lot of failures to find the thing that works or have expensive experiments. So you fail and try new things in an agentic manner until you succeed, and when you succeed you publish the result. In that way the LLM only needs to trust and learn the final result, without going through all the details to get to it.

Next steps

Currently we are only seeing incremental improvements for coding task for frontier LLM's. It could be that AI companies, have not put any groundbreaking technology in, since the cost of training and thus risk, is high.

The AI companies probably know what they are doing, and I am guessing it is something like this.

Say we have a yearly budget of $3 billion. Then we can use $1 billion for finding new model, $1 billion for training it from the ground up and $1 billion for continuously training (every day).

Then we have a lot of researchers that split $1 billion. Lets say half of it go to salary, and the rest for compute, and the amount varies between researcher, on how you believe in the researcher.

Then the researcher split the compute money up into 4 piles. One pile for continuously training, doing all the micro experiments. One pile for a daily experiment to see if it works on larger data. One pile for weekly experiment to see if it works for even larger data. And one pile for monthly experiments. The result for the monthly experiments, is then compared across all researchers, and a new frontier model is chosen that is then trained with the $1 billion budget.

Søg i denne blog

Hard AI