Code is written for machines to execute and for humans to maintain. And to be maintainable, a piece of code should be understandable — by humans. This of course is conventional wisdom of the software engineering world and manifests itself as the emphasis on coding conventions, human readable comments across the code, and an idiomatic approach towards software development (a.k.a design patterns). This in turn implies that the software that people develop and maintain tend to have a certain “repetitive and predictable structure”. [1]

In the words of Hindle [2] et. al. —

Programming languages, in theory, are complex, flexible and powerful, but programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks.

Now combine these two characteristics of software — i.e. having repetitive and predictable structure — with two recent trends and these insights become actionable items.

First, during the last half a decade or so there’s has been a proliferation of online collaborative development environments like Github and others, that have placed an increasingly large numbers of code corpora in the public domain. This data can now serve as the building blocks for complex statistical models that can learn useful (in some sense that is to be defined) repetitive patterns across different programming languages, domains and even library usage etc.

Second, and almost simultaneously, there has been an evolution and maturing of distributed computing platforms and big data technology stacks which has made statistical modelling of terabytes of data more tractable.

Now, putting these proverbial two and two together, it can be argued that time is now ripe to leverage machine learning (and NLP in particular) over this huge amount of publicly available software to deliver solutions that can ease the software development tasks.

KodeBeagle at Imaginea is a step in this direction.

[The next few posts will talk about the current workings of KodeBeagle, what’s currently in development and our future vision for it.]


[1] Abram Hindle, Earl Barr, Mark Gabel, Zhendong Su, Prem Devanbu, On the Naturalness of Software
[2] Ibid.