My goal in this dev log was to make the simplest possible first custom environment and take the first baby step towards auto-curriculum generation.
My goals long term are to show how we can frame the problem such that feedback signal and speed are maximized and thus, by extension, sample efficiency so that we can train smart systems for 100-1000x the cost in both compute and time.
This will lead to exponential growth that will appear to be nothing but I believe will rapidly overtake companies that are spending O(100 m) on brute force training today.
Demo
Federated learning:
Curriculum generation in the barest form of the word. We generate the simplest possible synthetic scenarios (moving the block around randomly). To be a true curriculum we need a higher level skill made from a Russian doll of behaviors. The basic mod to this that makes it a “curriculum” is starting the block on top of the agent and on each intersect moving the block a bit farther away.
Discussion
It is central to my thesis that we are able to project goals onto the environment and use those projections to provide unambiguous geometric feedback in real time creating the tightest possible feedback loop with the highest possible signal.
I’m not there yet, simply learning to use the tools available to me.
In this case the agent doesn’t get any aid on how to move to the target. But you can see that it quickly generalizes once it figures out how to achieve a reward. In less than a few minutes a “mini-brain” is trained from scratch that can see and navigate towards targets.
That’s amazing. I haven’t had this much fun since my first program.
Decision
What matters most is proving out the core concepts of auto-didactic learning.
- generate their own curriculum (theorize and imagine)
- self-interrupt (like a pseudo conciousness)
- generate regressively easier curricula
- request help for training
- decide to test in the “real”
- interleave with samples from the “real” world (experiment)
The next step will be to aid non-ambiguous overlays for path traveling. They can be overlaid algorithmically and randomly with drop out. This will be one form of a “tutor” under 2b.