My current strategy is two-legged. The first leg is to find papers with benchmarks that current LLMs struggle with, and show my Turing complete LLM can solve these. The second is to show that it can solve challenges that current LLMs are good at.

One of these things is that LLMs are really good at answering questions if they have enough knowledge of a problem. If we ask ChatGPT how it have knowledge about cities it answers this:

" I rely on knowledge encoded from geographic and linguistic data up to my training cutoff (mid-2024), which includes global place name databases such as Geonames, Wikidata, national census gazetteers, and academic or development sources (e.g., UN and World Bank geographic datasets).

So when you mentioned “Boulma,” I recognized it as matching entries from Burkina Faso, where several small villages with that name appear in those official datasets.

I didn’t search the web — it’s based on general geographic knowledge and structured data I was trained on."

So given the same data point multiple times, and having more data than variables, it starts to generalize and memorize knowledge. I went to google maps, to see if there was any small village in any part of the world it didn't know of, and I didn't succeed. It was also very hard, but not impossible to make it hallucinate about cities that didn't exist. Usually it would reply:

"There isn’t a well-documented town or city named Tahouali in standard global or regional geographic databases."

Auto generalization

But to replicate that behaviour we need a lot of training time. So what if we instead of having a lot of knowledge, only had very little, and forced the LLM to find the structure?

In this test I had only 28 cities and their corresponding countries. And the different ways to ask about them:

"Madrid is in which country?",

"Where is Moscow located?",

"What nation is Vienna part of?",

"Where is Berlin located?",

"What nation is Beijing part of?"

The test was the same cities, but with different way of asking. When trained normally with a auto-regressive RNN, the test accuracy was 0 to 5%. That make sense since it just learn to memorize the question, and supply the answer.

So how can we get the test accuracy up, without having more training data?

This is solved by having the idea we have knowledge and representation. So we generate different test sets we use for different purposes, and each test set have all the knowledge (in this case city/country pairs), but each set have a different expression of that knowledge.

We have 4 different sets:

General training
Producer
Generalizer
Reinforcement

And then of course a set used for testing, that is not used when training at all.

The core idea is that supervised learning is much easier than reinforcement, so we get as close as we can with supervised learning, and then in the last phase we use reinforcement.

The architecture is only for user/assistant style LLM. The autoregression runs until the assistant needs to answer. Then we have an information wall, which is just a vector, that we can adjust the hardness of with a single variable. And after the wall we have a producer network that produces the actual token in the response.

The idea is that when the wall is hit, some handover needs to be done between understanding the question and producing the answer. The phases are to get the two models co-operate. And phase5 is then to optimize for the hardest wall - which means the generalizer and producer needs to co-operate with minimal information exchange between them.

The wall can be illustrated in the following way.

When the strength of the wall increases the flow of information through it decreases. This is just a value from 0 to infinity, and the purpose of the reinforcement learning is to get the value as high as possible.

When trained on the set, we got these test accuracies:

Phase 1 and two: ~5%

Phase 3: ~15 %

Phase 4: ~60%

Phase5 beginning: After the RL have optimized for the new set, we get 90%, but before the wall have started to have an effect.

Phase5 end: 100% after the wall has forced generalization.

Conclusion

In conclusion we have a model that can be trained to generalization without having more data than variables, that a usual machine learning model requires. This is not a requirement for the Turing complete LLM to be ground breaking, but it can accelerate how fast it can achieve usefulness in different areas.

Søg i denne blog

Hard AI