This post is Part 2/3 describing our roadmap to self-hosting AIs. In the last post, I presented a prototype neural programmer. We need to unpack the takeaways from that experiment. Below, I discuss 1) the critical need of adding theorem proving to deep learning, 2) inventiveness, and 3) generalization.
1. Deep learning needs theorem proving to transition to the next stage
Meaningfully combining theorem proving with gradient descent will unlock applications that go beyond the “one second rule”1. In particular, it can open up avenues for AI apps requiring reasoning over structured documents.
Pinball networks would do well over any structured documents, not just programs
In structured documents, a misplaced token can change the meaning of the entire document. Programs are the stereotypical example of structured documents. These qualities are not limited to programming though. Observational datasets such as time series logs, or financial data have these qualities. Tabular relational data such as schedules have these properties. Legal contracts are similarly structured. Applying pinball networks to domains outside of programming is an open area of investigation.
Theorem proving adds “treble” to deep learning’s “bass”
Theorem proving brings the details, while the network bring the intuition. Programs have high-level organizational structure and fine-grained details that dictate the semantics of the whole. The combination addresses both. It is unlikely that either one by itself can successfully reason about structured documents:
Can a deep network by itself write functioning code? The answer to this is as yet unknown. I would conjecture some form of discrete reasoning is necessary and pure continuous-domain optimizations will not suffice.
Can theorem proving by itself write code? As prior results showed, it might, but not scalably. So in practice the answer to this question is no. The deep network bring the “human connection” by limiting search possibilities to those that humans comprehend and have written in the past.
Theorem proving provides the lower bound: any structure that is lower in the partial ordering is invalid and eliminated by the constraints. The deep network provides the upper bound. While the space of expressions, statements, control flow, functions, and programs is infinite, the learnt network limits itself to those similar to those seen in the wild.
2. Human readability vs machine invention
The other important consideration is the tradeoff between human-readability and machine-invention. Pinball networks can be architected to be network-heavy or constraint-heavy. The current network is network-heavy. Making it network-heavy produces more interpretable results. Making it constraint-heavy enables it to produce programs from first-principles over atomic programming tokens. This leads to code that might be hard for humans to understand. In 2009, my pure constraint-based generator would take 100ms to build tiny programs that took me 10-20 minutes to understand. In 2017, I believe scalability and readability go hand-in-hand. I should probably revise this old tweet to “Unscalable AI produces cryptic but correct programs: No scare needed”. In 2019, let us see if we are ready for “Scalable AI produces readable programs: Still no scare needed.”
scared AND elated. My system is generating proofs that are at the limit of my reasoning capabilities!— _saurabh (@_saurabh) April 23, 2009
The last and very important consideration is the existence or lack of generalization. What if the network is simply memorizing, and at best doing a fuzzy hash comparison? Pinball networks have considerations similar to DNNs2 3. Possibly, slightly better because of “snap-to-grid” properties.
We need to generalize enough so that the usual benefits of encoding in a high-dimensional continuous space are not lost.
E.g., we should have the ability to do arithmetic over programs. An input of
[linked list, reverse] + [array, sorting] - [array, reverse] should give linked list sorting, even if the dataset only contained array sorting.
In the last post, the neural programmer showed a hint of generalization when it built a linked list
struct for a previously unseen combination of list algorithms. That gives us hope, but a lot more work is needed.
Should we be avoiding memorization?
I am conflicted about how much I want the system to memorize. It is obvious that if we have the perfect sweet spot of memorization to the extent that it can be generalized, I would be happy. E.g., humans can easily construct a topo sort over a string-graph, if they had previously memorized the algorithm over integers. Vanilla deep networks have an unlimited capacity to memorize3. Pinball networks’ memorization ability is smaller, but still worth worrying about.
In this prototype I tried hard to make exact memorization virtually impossible. Data-set generation deliberately introduces noise. 1) Semantics breaking transformations were permitted during mutation. 2) Semantics-preserving but syntactically distinct mutants force the neural programmer to learn semantic patterns rather than syntactic ones. E.g., the programmer has to understand that function order usually does not make a difference.
Deep learning desperately needs an acceptable quantitative measure for generalization. Something akin to
number of unseen *valid* variants /
number of training variants over each class.
Validity is impossible to define in images/audio but is trivially available in programs, which makes programming a great test-bed.
Programming is a great test-bed for generalization ability
Suppose the deep network can draw broad strokes giving us the right components of the mix. There is hope it will succeed as evidenced in the prototype earlier. When asked for a program to
reverse + filter over
linked list it rightly outputs a
struct data structure at the top. The network gets us in the approximate vicinity of the solution. That is useful. Have theorem proving be informed by these approximate solutions, and we can “snap-to-grid” to precise code.
Imagine asking a human to sketch a bedroom containing a window, a bed, and an airplane. I would bet most humans would draw elements inspired by their memory, and would ensure internal consistency in the scene using implicit constraints. E.g., windows will not be on the floor, the bed would have 4 legs, the plane will be a toy instead of a real 747. Similarly in programming, it is rare for day-to-day programmers to invent new algorithms. Our aim is to have the neural programmer be in the 25-50th percentile of human programmers. Should be doable.
The presence of hierarchical organization, abstraction, and modularity, make program synthesis a very unique domain to study the generalization of deep networks. Generalization for the neural programmer will be a post by itself later.
Follow us on sparkz_ai
Come work with us. Email email@example.com
Pretty much anything that a normal person can do in <1 sec, we can now automate with AI.— Andrew Ng (@AndrewYNg) October 19, 2016
“Understanding deep learning requires rethinking generalization”. Same in pinball networks as in DNNs. Possibly slightly better because pinball networks have snap-to-grid properties. ↩ ↩2