2/3: Theorem proving elevates deep learning.
This post is Part 2/3 describing our roadmap to selfhosting AIs. In the last post, I presented a prototype neural programmer. We need to unpack the takeaways from that experiment. Below, I discuss 1) the critical need of adding theorem proving to deep learning, 2) inventiveness, and 3) generalization.
1. Deep learning needs theorem proving to transition to the next stage
Meaningfully combining theorem proving with gradient descent will unlock applications that go beyond the “one second rule”^{1}. In particular, it can open up avenues for AI apps requiring reasoning over structured documents.
Pinball networks would do well over any structured documents, not just programs
In structured documents, a misplaced token can change the meaning of the entire document. Programs are the stereotypical example of structured documents. These qualities are not limited to programming though. Observational datasets such as time series logs, or financial data have these qualities. Tabular relational data such as schedules have these properties. Legal contracts are similarly structured. Applying pinball networks to domains outside of programming is an open area of investigation.
Theorem proving adds “treble” to deep learning’s “bass”
Theorem proving brings the details, while the network bring the intuition. Programs have highlevel organizational structure and finegrained details that dictate the semantics of the whole. The combination addresses both. It is unlikely that either one by itself can successfully reason about structured documents:

Can a deep network by itself write functioning code? The answer to this is as yet unknown. I would conjecture some form of discrete reasoning is necessary and pure continuousdomain optimizations will not suffice.

Can theorem proving by itself write code? As prior results showed, it might, but not scalably. So in practice the answer to this question is no. The deep network bring the “human connection” by limiting search possibilities to those that humans comprehend and have written in the past.
Theorem proving provides the lower bound: any structure that is lower in the partial ordering is invalid and eliminated by the constraints. The deep network provides the upper bound. While the space of expressions, statements, control flow, functions, and programs is infinite, the learnt network limits itself to those similar to those seen in the wild.
2. Human readability vs machine invention
The other important consideration is the tradeoff between humanreadability and machineinvention. Pinball networks can be architected to be networkheavy or constraintheavy. The current network is networkheavy. Making it networkheavy produces more interpretable results. Making it constraintheavy enables it to produce programs from firstprinciples over atomic programming tokens. This leads to code that might be hard for humans to understand. In 2009, my pure constraintbased generator would take 100ms to build tiny programs that took me 1020 minutes to understand. In 2017, I believe scalability and readability go handinhand. I should probably revise this old tweet to “Unscalable AI produces cryptic but correct programs: No scare needed”. In 2019, let us see if we are ready for “Scalable AI produces readable programs: Still no scare needed.”
scared AND elated. My system is generating proofs that are at the limit of my reasoning capabilities!
— _saurabh (@_saurabh) April 23, 2009
3. Generalization
The last and very important consideration is the existence or lack of generalization. What if the network is simply memorizing, and at best doing a fuzzy hash comparison? Pinball networks have considerations similar to DNNs^{2} ^{3}. Possibly, slightly better because of “snaptogrid” properties.
We need to generalize enough so that the usual benefits of encoding in a highdimensional continuous space are not lost.
E.g., we should have the ability to do arithmetic over programs. An input of [linked list, reverse] + [array, sorting]  [array, reverse]
should give linked list sorting, even if the dataset only contained array sorting.
In the last post, the neural programmer showed a hint of generalization when it built a linked list struct
for a previously unseen combination of list algorithms. That gives us hope, but a lot more work is needed.
Should we be avoiding memorization?
I am conflicted about how much I want the system to memorize. It is obvious that if we have the perfect sweet spot of memorization to the extent that it can be generalized, I would be happy. E.g., humans can easily construct a topo sort over a stringgraph, if they had previously memorized the algorithm over integers. Vanilla deep networks have an unlimited capacity to memorize^{3}. Pinball networks’ memorization ability is smaller, but still worth worrying about.
In this prototype I tried hard to make exact memorization virtually impossible. Dataset generation deliberately introduces noise. 1) Semantics breaking transformations were permitted during mutation. 2) Semanticspreserving but syntactically distinct mutants force the neural programmer to learn semantic patterns rather than syntactic ones. E.g., the programmer has to understand that function order usually does not make a difference.
Deep learning desperately needs an acceptable quantitative measure for generalization. Something akin to number of unseen *valid* variants
/ number of training variants
over each class.
Validity is impossible to define in images/audio but is trivially available in programs, which makes programming a great testbed.
Programming is a great testbed for generalization ability
Suppose the deep network can draw broad strokes giving us the right components of the mix. There is hope it will succeed as evidenced in the prototype earlier. When asked for a program to reverse + filter
over linked list
it rightly outputs a struct
data structure at the top. The network gets us in the approximate vicinity of the solution. That is useful. Have theorem proving be informed by these approximate solutions, and we can “snaptogrid” to precise code.
Imagine asking a human to sketch a bedroom containing a window, a bed, and an airplane. I would bet most humans would draw elements inspired by their memory, and would ensure internal consistency in the scene using implicit constraints. E.g., windows will not be on the floor, the bed would have 4 legs, the plane will be a toy instead of a real 747. Similarly in programming, it is rare for daytoday programmers to invent new algorithms. Our aim is to have the neural programmer be in the 2550th percentile of human programmers. Should be doable.
The presence of hierarchical organization, abstraction, and modularity, make program synthesis a very unique domain to study the generalization of deep networks. Generalization for the neural programmer will be a post by itself later.
Get involved
Follow us on sparkz_ai
Come work with us. Email saurabh@sparkz.org
References

Pretty much anything that a normal person can do in <1 sec, we can now automate with AI.
— Andrew Ng (@AndrewYNg) October 19, 2016HBR: What Artificial Intelligence Can and Can’t Do Right Now ↩

“Opening the black box of Deep Neural Networks via Information” and talk video ↩

“Understanding deep learning requires rethinking generalization”. Same in pinball networks as in DNNs. Possibly slightly better because pinball networks have snaptogrid properties. ↩ ↩^{2}