Strawberry I-don't-know, and an agent implementation

september 12, 2024

Reinforcement with 'i-don'know'

OpenAI has just released strawberry/o1, and we are now very close to AGI.

It is pretty much using the same technique that I have already outlined in a previous blogpost.

Create synthetic data with trains of thought, where each thought is a Q/A. This turns the problem into a reinforcement problem, where the evaluation of each answer can be scored.

One crucial thing they unfortunately didn't do, was to mark answers that are wrong as i-don't-know.

Then the algorithm would be to train with a small nerual net the first round, find all answers that are wrong, and then train again with a bigger neural net. Scoring wise a wrong answer is worse than saying i-don't-know, which is then worse the correct answer.

Since final neural net is bigger than the one that was original trained, it should have a high chance of answering i-don't-know, when it needs to.

Agent

Having a strong LLM is just one part of AGI. The AGI also needs to be able to take actions.

I have been implemented a wrapper in the following way:

The overall idea is you start a conversation with the AI, and the AI runs in the background to solve your problem. While the AI is running it can AI can ask questions to the user. And the user can also add additional information to the conversation at any point. It feels very much like having a human assistant.

To do this it consist of multiple programs, that talk to each other through the console. Each program has a thread where it can take input, that runs independent of the work the program does. This means that information can pass down/up through the stack, independently of the work that is being done. The work does not need to be interrupted, since it writes its progress into info memory. When the process than gets done with a step, it can view new information, and react on it.

The AI have some interesting ideas, that maybe can be inferred by the above picture. But one interesting one, is that the context window is kept as small as possible at any time. This means that a task has no information when it starts, but have the ability to search and add additional information from other tasks or elsewhere. And only the minimum amount of information needed to succeed, is added to the task.

Another thing is that task are retried if the success criteria fails. This means:

If the tasks are correct, and the success criteria is correctly evaluated, then overall plan will always succeed if all steps can succeed once.

The last thing is the response improver, which are very much tied to the OpenAI strawberry. The first time a request is done to an LLM, it is done to a very small an fast LLM. But the response improver runs in the background, and improves responses in a order where it make the most sense. It could be to improve the overall plan, or steps that are currently failing.

That means you original task you start as a human, never actually completes. But you can see the latest response, and how good the AI think the response is, and how much work has gone into it so far.

Søg i denne blog

Hard AI