This post is Part 1/3 describing our roadmap to self-hosting AIs.

By 2019, the Sparkz marketplace will host an AI that can write C, Python, Java code like humans. We are aiming for it to be at the 25-50th percentile of human programmers. The current state-of-the-art in AI will not get there, and we need to invent new techniques. Below I describe initial steps in that direction.

Pinball networks

Recent work in learning to learn (i.e., discovering hyper-parameters1, or architectures2 3 4) should not be surprising to anybody in the field. The more important next step is whether we can break out of the conceptual sandbox of a deep network? I.e., can a neural programmer write C or Python code? We need that for an unbiased comparison against human programmers. Techniques beyond pure continuous domain optimization are needed5.

I conjecture that we need to augment continuous gradient descent with discrete theorem proving. To evaluate this combination, I built a toy prototype.

Let’s call the new architecture pinball networks. Pinball machines have both gradients and step changes, so seems appropriate. This architecture searches in the discrete domain using theorem proving, in the form of SMT6. I like Z37 as the solver for this part. The architecture searches in the continuous domain using stochastic gradient-descent variants typical of deep networks.

Human programmers can pick up some level of programming ability in a few weeks to months of learning. Let us observe an AI learning to do the same. The video below shows the AI learning over 36 hours. For clarity it only shows two of the many algorithms the AI is simultaneously learning. The video looks prettier as a high-res mp4 (8 MB) or animated gif (9 MB).

Initially, between hours 0-5, it generates gibberish. Soon after at about 6 hours, it stumbles upon the right form for code. It starts getting some details correct in the latter half at about the 25 hour mark, and eventually generates human-readable code after 36 hours.

It takes 3 steps to get here:

  1. Build an algorithms dataset to train pinball networks. Re-purposing techniques from optimizing compilers, we synthetically expand to a dataset of billions training examples. We apply (mostly) semantics preserving transformations to mutate source code. E.g., if (C) then B1 else B2 can always be replaced with if (!C) then B2 else B1.
  2. Build a neural compiler. A neural compiler is a continuous domain approximation of a proper compiler.
  3. Build a neural programmer. We use the neural compiler in a GAN framework to train a neural programmer.

Dataset and neural compiler

I built a bare-bones lexer, parser for C 8. Transformation rules are applied post-lexing. Dataset creation takes about a day on a 24 core machine. I am very liberal in rule application. We are OK with semantics-breaking transformations because we can test for correctness later. We approximate functional equivalence using testing against a test suite. Seed programs come with labels such as compiles, data structure, algorithm, space complexity, and time complexity. Mutants are generated using transformation rules. Compilation and testing allows appropriate labels to be recomputed. I can envision the full scope of static analysis being applied to algorithmically infer labels. For instance, you could apply worst-case execution time analysis to infer time-complexity. Variants of type inference algorithms could generate type-varying mutants. E.g., BST over objects from a seed of BST over floats.

Now that we have a proper dataset, we can train our neural compiler. I trained the neural compiler for about 24 hours. We get 97% precision. Clearly on the prediction of “does compile” its performance is upper bounded by that of a native compiler. There are other benefits though. It can “predict” functional equivalence, space, and time complexity in P-time! Undecidable problems all. Of course its “prediction” are nothing but fuzzy hashes against what it has seen before. No theoretical results are being harmed in the making of this. Additionally, there will be biases encoded as well. Suppose, we trained it over every program from Knuth, CLR, and Stack Overflow. That will still leave it limited to understanding the minuscule fraction of all algorithms and data structures that humans have invented. Yet at the same time, it is cool to see it predict:

{
  "compiles": "false",
  "algorithm": "search",
  "data structure": "binary tree",
  "time": "O(log n)"
}

If you wanted to go crazy, you would take predicted output, and spit out machine code for the best known optimized implementation of the same. Imagine the potential of cloning 100x engineers: undergrad code => neural compiler => Jeff Dean code9 :)

Neural programmer

Once we have a neural compiler we plug it into a GAN setup and train a neural programmer. Results are better than anticipated.

For this prototype experiment, I trained the system over a few basic algorithms. This version has programs in C. A Python front-end is in the works8. Basic unit operations are covered: arithmetic (commutativity, associativity), expressions (boolean and arithmetic), statements, loops and conditional. Three data structures are covered: arrays, linked-lists, and trees. Many algorithms over those data structures are covered: sorting in O(n^2) and O(n log n), list reversal, delete, filter, map; tree traversal inorder, preorder, postorder, and search.

This version of the neural programmer does not care about descriptive names and comments. So alongside each generated code, is another manually edited version with two modifications: A) search-replace of function names with a more human-readable name, B) comments added. Both of these can be easily automated using a ReadabilityNet10.

Below are some programs generated by the neural programmer11. For each, there are 2 code dumps. The one on the left is the raw output from the neural programmer, and on the right is the exact same program but with the two readability edits describe earlier. You can click on the images below for full resolution code.

  1. 1st algorithms assignment: Input = {array sort, O(n log n)}

    code code
  2. 1st data structures assignment: Input = {binary tree preorder traversal}

    code code
  3. 2nd data structures assignment: Input = {linked list delete}

    code code
  4. 1st interview assignment: Input = {binary tree search, binary tree inorder traversal} – Here we have an under-specified input. Also there was no training program that did a traversal and search together in any data structure. So our expectation are low.

    code code

Each one of the generated programs is broken 🤖 . The non-obvious takeaway is that they are not hopelessly broken. In fact, the output is much more encouraging than I anticipated for v0 of the neural programmer 🤔 . Just the fact that anything resembling programs comes out is a big win!

Now, lets have some fun and confuse this programmer. Let us give it combination of data structures and algorithms, some that make sense to a human programmer and some less so.

  1. First, we look at a few inputs that would make sense to a human programmer. Below is the raw output for {linked list filter, linked list reverse} and {linked list reverse, linked list delete}. The only legible part of the output is the struct linked list definition at the top, followed by gibberish.

    code code
  2. Next, we give it input that would be insufficient even for a human programmer. Below is the raw output for {binary tree preorder traversal, linked list delete} and {linked list delete, array merge sort}. Neither of these specs makes sense without understanding of data structure combinations and need further elaboration. The output programs look like keywords hurled at a wall.

    code code

Next updates

The current neural programmer trains in 50 hours using 4 NVIDIA M60s and learns 14 algorithms over 3 data structures. It is a medium sized network. We use standard techniques (dropout, batch norm etc) to prevent overfitting. I believe that with 10x larger architecture, this network can learn all known algorithms and all known data structures and be able to mix and match them “intelligently”.

Related work

This is not the first attempt at program synthesis. In fact, there is a very rich body of literature going way back. I should know, I wrote a PhD on the topic12. The important consideration is the input to the synthesizer/programmer.

Counter-intuitively, we should aim for the input to be as vague as possible. My preference in decreasing order is Noise > English text > IO examples > Declarative formal specs > Program traces.

People have build some incredible systems all along the spectrum. Going from right to left: Program traces are great as input when we have observations on each step of some hidden process13 14 15 16 17. Declarative formal specs are helpful when building mission-critical applications or hardware12. Instead of writing specs, humans might find it easier to sketch programs18 and have the synthesizer fill out the holes in the code. Input-output IO examples are useful for cases where only end-to-end function can be observed19. Text comes closest to what professional programmers do in their day jobs. Noise is what they take as input on weekends. Some very interesting work has shown the feasibility of program synthesis from Text20 21 22 23 and Noise24.

Our neural programmer here takes Text as input. Our toy prototype shows feasibility of synthesis from keywords. Converting English problem definitions into keywords is straight-forward. The first attempt should be a simple statistical approach: extract coding-specific words (TF-IDF weighted). Later on, we could build a more elaborate net: mine text from programming books and online tutorials to (paragraphs → keyword) mappings. At that point synthesizing programs from text of programming contests, or interview questions, should be very much within reach.

Get involved

Follow us on sparkz_ai

Come work with us. Email saurabh@sparkz.org


References

  1. “Deep Learning Hyperparameter Optimization with Competing Objectives” 

  2. “Neural Architecture Search with Reinforcement Learning” 

  3. “Learning Transferable Architectures for Scalable Image Recognition” 

  4. AutoML: Using Machine Learning to Explore Neural Network Architecture 

  5. Francois Chollet (Keras) discusses program synthesis as important for the future of deep learning 

  6. SMT stands for Satisfiability Modulo Theories. Think of SMT as specialized theorem proving. Most SMT solvers are built using propositional SAT-solving using DPLL for search. They cleverly mix theory modules into the engine that reason about arithmetic, uninterpreted functions, memory, arrays and pointers. 

  7. Z3 is an industrial grade SMT solver available under the MIT license. Another equally compelling solver is CVC v3,4

  8. We have a hand-written C frontend right now. A python is to come very soon. A move to LLVM’s front-end is imminent.  2

  9. Reddit: “gcc -O4 sends your code to Jeff Dean for a complete rewrite” 

  10. It would be trivial to train a separate network that correlates ASTs to likely “local” names. We could mine Github for AST patters that go with function and variable names, and comments that show in the vicinity. 

  11. The obligatory caveat about cherry-picking holds. For each code fragment, the neural programmer generated 4 samples and I picked the one that “looked best”. When the programs start looking much much better, we’ll use the compiler to automatically “pick best” for us. Then if many compile equally well, we’ll use testing to get a numerical score for correctness. 

  12. In 2010, I showed that you could go from program verification to program synthesis, i.e., if automated systems can read programs, they can write them as well. This leveraged SMT-solvers, which around 2005 became industrial strength. CVC, Z3 are the front-runners. This line of work has decades long history going all way back to Manna and Waldinger ‘79 2

  13. Vinyals, Meire, and Navdeep proposed Pointer Networks, which they show can learn algorithms for convex hulls, triangulations, and TSP. They do this by using attention mechanisms to allow the output to contain references to the input. 

  14. In Neural GPUs learn algorithms Kaiser and Sutskever proposed a parallel architecture that can learn small bit algorithms, e.g., reversing, copying, counting, duplicating, multiplication. 

  15. In Hybrid computing using a neural network with dynamic external memory Graves et al. present the a differential neural computer (DNC) which has a differential memory and controller to allow end-to-end gradients to flow, allowing it to learn algorithms directly from data. 

  16. Zaremba et al. attempted Learning simple algorithms from examples by training a controller using Q-learning. Their RNN controller once trained can copy, reverse, walk, add, and do single digit multiplication. 

  17. Program synthesis by Sketching is a great technique that balances human and machine abilities to generate programs. The human writes the “broad-strokes” in the form of a sketch of the program with holes in it, while the SMT-based solver synthesizes all the difficult values that go in the holes. 

  18. Deepcoder defines a DSL over components and has a neural net search over combinations. Neuro-symbolic Program Synthesis recursively expands a program tree over a DSL while ensuring consistency against IO examples. Flashfill is the most widely used approach in this space, by virtue of being in every Excel install. Synthesizing string manipulating functions is getting to scale using Learning to Learn Programs from Examples 

  19. Neelakantan et al.’s Neural programmer learns a model that induces a program over a DSL, where the DSL operations reflect the query types anticipated. 

  20. Neural Enquirer trains a network to answer questions posed in English for table QA queries. It learns by supervision over end-to-end examples of queries. 

  21. In Learning to compose neural networks for question answering Andreas et al. describe ways to map text QA queries into executable functional programs to run as queries over structured data. 

  22. Program Synthesis Using Natural Language converts input English text into programs in a DSL for text manipulation. 

  23. Karpathy et al.’s char-rnn was a great demonstration of generating code-like text. It was obviously a syntactic learner by design. By introducing a prefix tag (e.g., the specific algorithm of each function) it could be extended to encode more semantics. Conceivably, by building a RNN variant of pinball networks we could bring theorem proving reasoning to it, and be at par with results of my prototype.