To start with a new codebase, we mostly want some idea about

  1. go to guys
  2. important files
  3. related files (functionality wise)
  4. code flows and third party dependencies

This information does not need to be complete or precise, but just a guide about the starting points. With this in mind, I stitched together a few programs (github) using spark. As a test I ran it on the spark code base.


In a over simplified version we can assume that these are the guys who have been closer to the code meaning have more commits than others. But more commits is a relative thing, so we will like to see a histogram of commits and then run top n queries. So all we have to do is read up the git log, load it into a collection, plot a histogram of commit frequency, find the top n committers and if possible allow for filtering based on module names to find module experts.

With spark this is as easy as

For histogram where distSql is a query like select count(*) as colCount from $tableName group by

The stats package helps us with generating histograms, a little bit of my customization which makes the driver web-ui an editable workbench and we can plot these histograms and run the top 10 query very easily. The top 10 committers when I ran the analysis (about a month ago were)

Matei Zaharia, Patrick Wendell, Reynold Xin, Tathagata Das, Ankur Dave, Mosharaf Chowdhury, Reynold Xin, Joseph E. Gonzalez, Prashant Sharma, Aaron Davidson

This result in a way makes sense, most of the spark community knows most of these names and they own one or multiple modules in the spark eco-system.

important files
All files are not equal, we will use the number of times a file has changed as an indicator of its importance (code churn), we will use the commit RDD as our input.

The results are
SparkContext.scala, SparkBuild.scala, pom.xml, RDD.scala, DAGScheduler.scala, Utils.scala, BlockManager.scala, Executor.scala, PairRDDFunctions.scala, Broadcast.scala,,,, Master.scala

We have SparkContext the entry point for everything interesting, the DAGScheduler for high level scheduling, RDD the distributed collection and other pieces. Most spark users may agree that these are some of the more important classes (at least one of top 10 committers in spark (Prashant Sharma) who is a colleague seems to agree “in a way”, knowing him, “in a way” is as good as it gets).

related files (functionality wise)
By now most of you are bored and wondering if I have ever heard of ETL and the comfort with which databases can run queries, well for the next part I need to run a frequent pattern algorithm, let’s take FPGrowth and find the files that get changed together. Our idea is that each commit contains a subset of files, these files are more likely to be related as there is a direct relationship between commit and functionality.

One of the frequent patterns is
ReflectionUtils.scala SparkSQLSessionManager.scala small_kv.txt beeline SparkSQLCLIService.scala SQLConfSuite.scala SparkSQLEnv.scala SparkSQLDriver.scala HiveThriftServer2.scala CliSuite.scala spark-sql TestUtils.scala SparkSQLCLIDriver.scala spark-shell.cmd scalastyle HiveThriftServer2Suite.scala SparkSQLOperationManager.scala commands.scala SQLConf.scala SparkSubmitArguments.scala spark-shell SparkSubmit.scala .gitignore HiveContext.scala HiveQuerySuite.scala SQLQuerySuite.scala run-tests pom.xml SparkBuild.scala

Most of these seem to be related to SparkSQL with Hive thrown in between which is actually pretty correct. There are many such frequent patterns and we could even run queries to filter on basis of some of the important files that we have observed before, for example I may want to find out the frequent pattern for DAGScheduler, since this is mapped as a table it is as simple as

One of the results of this is

ApplicationPage.scala BlockManager.scala Utils.scala DAGScheduler.scala RDD.scala SparkContext.scala

For effective scheduling knowledge of the block locations and RDD lineage should help, so this group seems to make sense in a way.

For the above code, I could have used a UDF or filtered in sql, but the nice way spark lets me combine sql with map and filter means I can choose how I want to express such filtering logic. Since sql runs on SchemaRDD it actually understands the following filter operation and is able to optimize the execution till the collect stage.

code flows and third party dependencies
Would be great if we could look at the code, see who calls whom, the methods and the code flows. But things like these are better represented as graph, if we could run PageRank algorithm that could also give us some insights into the more interesting methods. Well with spark creating a graph is also easy. With the spark jar as input and a few lines using ASM we can create a graph.

The graph is about 0.7 mill vertices and 1.2 mill edges. But I can easily query if spark has a direct dependency on say log4j packages, which is as simple as checking if there are any edges connecting a spark vertex with a log4j vertex. No discussion of code flow is complete without a flowery picture of the code graph, so here is our flowery picture of methods from just the graphx package.


The nice part about spark is that with small modifications to my existing code, I can make the same solution work for big data. Spark is essentially created for Big Data, but as it offers a seamless transition between programmatic crunching, sql querying and graph data, it is very effective for a wide array of analysis.