Strawberry I-don't-know, and an agent implementation
Reinforcement with 'i-don'know' OpenAI has just released strawberry/o1, and we are now very close to AGI. It is pretty much using the same technique that I have already outlined in a previous blogpost. Create synthetic data with trains of thought, where each thought is a Q/A. This turns the problem into a reinforcement problem, where the evaluation of each answer can be scored. One crucial thing they unfortunately didn't do, was to mark answers that are wrong as i-don't-know. Then the algorithm would be to train with a small nerual net the first round, find all answers that are wrong, and then train again with a bigger neural net. Scoring wise a wrong answer is worse than saying i-don't-know, which is then worse the correct answer. Since final neural net is bigger than the one that was original trained, it should have a high chance of answering i-don't-know, when it needs to. Agent Having a strong LLM is just one part of AGI. The AGI also needs to be able t...