Universal Learning Machines
🧠

Universal Learning Machines

🫡
Building Universal Learning Machines
Universal Learning Machines
Universal Learning Machines

Intro

The first several pages will attempt to describe the design in everyday language, keeping numbers to a minimum and avoiding formulas and jargon. I apologize in advance for my loose use of language and imperfect analogies.

The second section is for those with a technical background. There are no doubt errors of various kinds and superior optimizations for elements of the system. Feedback would be most welcome – please send to michael@aliem.ai

How Big is a UML Brain?

Background

AGI has long been the dream of AI but there is still no clear, practical, path to produce it.

I’ve been long frustrated by the brute force approaches to AI that have dominated and continue to dominate. After tens of billions have been poured into autonomous vehicles and mega LLMs it’s time for a different approach. There is no need to spend hundreds of millions on each iteration.

If we are going to spend this amount of capital, after a decade, we should actually have real life altering technology. As fast as XLMs have been progressing I have not seen a fundamental shift in approach and so have my doubts about when I’ll have autonomous vehicles, robots, and smart agents that can do real work. I have little confidence that OpenAI, Google, Meta, or any of the other companies will achieve “AGI” for a few reasons:

  • Solvable definition in logical and mathematical terms
  • Inability to make safety guarantees
  • Scalability of approach
  • Inability of introspection
  • Inability of explicit control
  • Speed of progress

Brute Force will Fail

Let’s put a scale to this, let’s just assume that the human rate of data production per year today (about 100 zettabyes), is roughly equivalent to the data all life has produced per year since life began (as a lower bound). This is likely woefully conservative — animals genetic information is about a Gb of storage and every breeding cycle we remix them. There are 10^28 microbes estimated on the planet. There is an estimated 20 quintillion animals on the planet, but just as a lower bound, that’s 3e32 bytes processed by evolution vs. 1e23 bytes produced by humans per year now. So we would need to be at least 10 orders of magnitude more efficient then evolution. Just the animals represents 1.6e37 bytes of data processed in the last 800 million years. To process the same amount of data it would take:

total data=(20 quintillionanimals avg population1 gb per genome800 million years)years for human data to match=total data/(100zettabytes/year)=1.6e14 yearstotal\ data = (20\ quintillion animals\ avg\ population * 1\ gb \ per\ genome * 800\ million\ years ) \\ years\ for\ human\ data\ to\ match = total\ data / (100 zettabytes / year) = 1.6e14\ years

That’s about 1.6 hundred trillion years just to simply produce a similar amount of data that evolution has processed. With our puny processing power how can we believe that we can provide the data necessary for generalization just by brute forcing it? But, the problem is even worse, our distribution of data does not reflect reality, it reflects us. We should expect that AI tools will be exceptionally good at parroting us but not good at novel problem solving, self-generalizing, or creating novel data, let alone insights.

So, the idea that we can provide all the relevant data that is needed to generalize seems unlikely, and furthermore, that by praying, that somehow, the generalization of seeing a large enough swathe of data will lead to the emergent ability to take over the training process itself seems like a wild hope.

This is why I do not put much merit in the Scaling Hypothesis. Sure, theoretically with truly infinite data you would know everything and generalize perfectly, but the problem is this is not practical. If anything represents the Scaling Thesis it is evolution and that process is continual and progressive.

Now, don’t get me wrong, current trends are productive and useful I just don’t see them becoming AGI.

If we are going to truly move towards autonomy and a parallel stack of silicon intelligence we need to reframe our approach or wait yet more decades and spend yet more billions. Eventually we will get there.

Reframing the Problem

Producing an auto-progressive system, which I’ll call a Universal Learning Machine, is the right way to frame the goal at a high level. The core question is how do you get the tiniest self-learning loop kick started and how do you scale that up? In order to answer this question we need to define the dimensions of scaling and what the self-learning loop is.

A simple layman’s definition is that intelligence exists between sensor inputs and actuator outputs. While this seems obvious this is important because your model of the world is constrained by the dimensions in which you can view it which is constrained by your sensor inputs. From this perspective learning, and hence intelligence, should have a scaling relationship to the dimensions the machine can perceive in the world and it’s ability to act on those dimensions to learn how to causally change these dimensions in the environment. This simple definition will entirely change the way we think about moving towards AGI for a few key reasons.

For the intuition simply consider humans. What makes our intelligence “general”? Our ability to add arbitrary senses through machines like microscopes, infrared cameras, etc., the building of models from that sense input, and the ability to add novel actuators to perturb the world. I call this Universal Learning.

Now, consider a lesser version of this where instead of creating new sensors and actuators we focus only on the integration of known sensors and actuators. I call this Bounded Universal Learning. So, the test of Bounded Universal Learning is simply that you could plug in a new sense modality and paired actuator and the machine could learn to manipulate the modality and build a progressively more accurate model over time.

I’ll walk through a set of key insights towards solving this problem.

Humans are Weak Universal Learning Machines

These operating dimensions are directly established by the senses and actuators that the actor has evolved with because this has established the “container” of all their cognition. Contrary to popular believe we do not see humans as a universally general intelligence because our reference frame, our “inner world”, our internal “physics engine” does not encompass all levels of existence, instead it encompasses a middle world of existence. We are able to, with great effort, contort our processing to learn about these other levels of existence but clearly we are not suited to it and it is likely that we understand far less than we think we do.

Humans are an example of a weak Universal Learning Machine. We can learn and abstract because we have the ability to build in new senses and actuators via tools. Literally our observable abstractions are tools to compress actions and sensors to collect hidden dimensional information. This allows us to build more general models of the world, use that to expand the ways we can manipulate the world, and so on in a virtuous cycle.

Basically we take in sensor data, model the environment, project future environment possibilities, notice incongruenty, innovate new sensors, incorporate them, re-model, re-project and so on. We do the same with tools to abstract actions away. This unique abstraction ability ultimately let’s us progressively enhance our capabilities.

The reason we are a “weak” version is that we cannot plug an electromagnetic sensor directly into our eyes and see the entire spectrum of electromagnetism. So, we have to play telephone through our existing senses and use analogy to theorize about how mysterious non-native dimensions work. Consider the orders of magnitude more effort that goes into understanding non-native dimensions like electro-magnetic waves, nuclear forces, and so on. We indirectly adapt our morphology by learning to interpret a non-built-in sense through another sense — we play telephone with new senses by piping them through our limited native set to form our model of reality.

Ditto for actuators.

Note, technically we could invent the technology to modify our biology to become strong universal learning machines in the distant future.

Learning is By Bounded Model Disproving Experiments

At the end of the day the Scaling Thesis is the lazy man’s proxy for something else — edge experiences. For the intuition simply consider that seeing the same scenario 100 times will have diminishing returns. All learning happens when you have model disproving experiences. Therefore, at best the Scaling Thesis is a proxy for these experiences.

This leads us to our first assertion. Learning is bounded by model disproving experiments. In other words efficient learning is inherently adversarial. We see the fusion of RLHF and other ways of giving feedback to models being central. This is intuitive.

This raises the question what bounds model disproving experiments?

Disproving Experiments are Bounded by Dimensions of Sense/Actuator Pairings

Consider the task of sorting blue and red balls that are identical. If you cannot see the color how could you learn the success criteria of placing red balls in the red basket and blue balls in the blue basket?

To disprove your model of the world you need to increase the dimensions with which you view the world. Now, not all dimensions matter equally in every frame of reference. Clearly, if you are operating at the microscopic layer the dimensions worth perceiving will be different than at the human “middle world” level.

Still, the dimensions you experience the world with are sense/actuator pairs. If you see color you can manipulate it and learn tasks that have to do with color. This means our learning is ultimately bounded by sense/actuator pairings which means we should focus on multi-modal models.

This will crystallize our first “law” — building BLMs depend on the ability to adversarially experiment and test it’s model of the world by (1) adding new sensors or interpreting sensor data differently and (2) adding new actuators or using actuators differently.

Disproving Experiments are Bounded by Novel Experiences

Let’s assume we accept all modalities into our model. We can universally inject arbitrary inputs through a Universal Transformer or similar device and encode the relationships between them. The limiting factor then becomes, not the amount of experiences we can accumulate (the scaling thesis), but the amount of novel experiences. But what are the dimensions of novelty? How do we learn to seek out these novel experiences? How do we bias towards producing them?

There are two parts to this, first we will never be able to seek them out perfectly, we need to learn efficiently when we detect a disproving experience. Instead we need to do our best to maximize the diversity of our training experiences. We can think of many ways to do this with the following dials:

  • using different morphologies (if embodied)
  • using many modalities
  • sourcing as much training and feedback as possible

We see these techniques happening now. If we throw truly infinite data at such a model it will produce AGI the question is simply of efficiency, time, and so on. The scaling thesis is effectively what evolution used to produce humanity after all.

Learning from Language is Learning by Proxy

There is no need for it to operate at our complexity level to start. Instead of worrying about language, which is the top of the foodchain, the most abstract and is symbolic proxy for a tiny assortment of models

  • intelligence precedes language

Birth the Baby then Grow the Baby up

A Learning Machine simply:

  • constructs a model of the world by integrating new sensor data (including whole new types of senses) or reinterpreting existing sensor data to build new models of the world
  • tests the model by integrating new actuators or using old ones in novel ways to perturb the model of the world in new ways
  • repeat

What we don’t see is self-error detection and resolution. What we don’t see is the models assessing for themselves when they have an internal mini-model of something that doesn’t seem to be quite right. For intuition consider how people will ask clarifying questions to check their understanding. We have the possibility to model multiple potentials as true and select and discard them based on new evidence.

No matter how much data that exists we will never have efficient auto-progression without this capability. In essence, what I’m describing here is the auto-regressive nature of models. This is an important component of learning but there is another layer that defines what data/experiences are generated to drive those auto-regressive model formation algorithms.

We need to auto-generate experiential data and the known way to do that is by having an environment, walking around, testing things. For intuition, consider a baby. Crawling around, putting everything in it’s mouth, touching everything, climbing everything. It is exploring it’s environment and learning a model of the environment.

Rather than try to build AGI straight out of the gate we should focus on making a baby and then growing it up.

The Dichotomy of Acting and Learning

While we are acting we want to minimize the unexpected results. What we predict to happen should happen. While learning we want to maximize the unexpected results. What we predict to happen should NOT happen. This should then trigger a round of learning that progressively increases the learning on steps of that kind.

In order to accomplish this we have to auto-generate our own test cases by theorizing the fault line — what is the dimension that we failed to model appropriately? We generate a set of test cases, attempt trial and error to learn better. Then repeat. Then continue on with the learning process.

At the same time, we should also have opportunistic learning. While we are acting we should note when a mistake was made and file it away for later training but then we need to bias towards always taking the best action.

Preliminary Technical Design Study

1. Abstract

Our current AI approaches are at best inefficient and worst uncertain to reach “AGI”. The first problem is there is no formal definition for “AGI”. To avoid the baggage of that term I propose a new device called a Universal Learning machine. A Universal Learning Machine is able to invent new sense/action pairs and efficiently perturb the environment to build progressively more general models of the world. We call it “Universal” because it can learn at any frame of reference by generating sense/action pairs and pruning them down to the contextually relevant pairs. This means it could be equally intelligent at the quantum scale as it could be at the “middle world” and larger scales.

image

This is in contrast to a Bounded Learning Machine. Like a cat. It is unable to learn to read an oscilloscope or use tools. It cannot learn outside of the dimensions on which it has available to integrate with. It is still able to incorporate new sense/action pairs if given but they need to be given by a human operator.

image

We refer to the shared subset of traits as a Learning Machine. Which is effectively a Bounded Learning Machine. This is where we will focus our discussion. What separates a Learning Machine from current techniques in ML is that it has the ability to “intrude” on the learning process when it has insufficient data to generalize. It can do this when embodied by direct testing or when dis-embodied through request to human operators.

This enables it to seek out and optimize the active learning of the machine. We take this process one step further and propose adversarial active learning. Whenever we train a model we should have model generation (traditional techniques) and synthetic data simulation. This training process requires that three models that are inter-related train adversarially.

  • the synthetic data generator aka the imagination
  • the actor aka the traditional model
  • the expectation policy

Then there will be two modes, the “learning” mode and the “acting” mode.

2. Table of Contents

  1. Abstract
  2. Table of Contents
  3. Background
  4. The Bounded Learning Machine
    1. Overview
    2. Acting Mode
      1. Goals Through Foveated Target States
      2. Reverse Kinematics Imagination for Planning
      3. Expectation Function for Selecting
    3. Learning Mode
    4. The Model — Universal Transformer
    5. Imagination — Progressive Self-Prompting
    6. Expectation Function — Contextualized Loss Function
  5. Conclusions
  6. Future Work

3. Background

4. The Bounded Learning Machine

5. Conclusions

6. Future Work

SLAM for N-Dimensional Spaces

Learning hierarchical representations for N-dimensional domains. Learning multiple levels of representation lets us build self-prompting models similar to LLMs. An LLM is able to produce a high level plan because it is able to represent information at a higher level based on the input from the user. The user effectively sets the reference frame. Then given a high level reference frame a model can self-prompt to recurse through down to the lowest level and take action. For LLMs we are able to train on a large corpus of data for planning purposes. How could we do something similar for spatial reasoning? We could manually do it… How is this done in games with dynamic resolution drop out? Perhaps there is the answer. If we were able to take our model of the environment… what if the output of our transformer was two-fold. A representation of the world at the highest level it’s capable of and a representation of what is seen through it’s eyes at that point. From this perspective its like a meet in the middle situation.

How would we test this idea? First we need to understand how this is done for games right now, how do we degrade the entire map if we do a zoom out? In games this is done with Loss of Detail techniques. What we might do is just simply train the model to do SLAM on it’s own by having it give a projected environment in simulation and then doing a loss on the projected environment to see if they match up. In fact the model should be able to predict the camera view of anything within the reach of the scene that it projects. Including it’s own sensor input..? Yes. Hmm, you don’t have control over the whole scene though.

Strong Universal Learning Machines

Testing For (Weak) Universality

Testing for Strong Universality

[Needs Review] Starting Points & Problem Decomposition

[Needs Review] The Challenges

Really this process though it may appear linear has common sub-systems that are non-linear. For example, we need to both learn from stored memories during “sleep” as well as use stored memories as a template while learning in real time. That learning in real time will impact how we build-sim-select-act-learn but it’s still a useful scaffold to tie the process together.

It’s worth addressing how current approaches, like the scaling theory are unlikely to produce a Universal Learning Machine. See 😖The Fallacies of AI .

The Approach

The Roadmap

Experimental Results

Implications

Prior Drafts

🔁[Archive] Universal Learning Machines Whitepaper