Analyzing spark with spark

To start with a new codebase, we mostly want some idea about

  1. go to guys
  2. important files
  3. related files (functionality wise)
  4. code flows and third party dependencies

This information does not need to be complete or precise, but just a guide about the starting points. With this in mind, I stitched together a few programs (github) using spark. As a test I ran it on the spark code base.


In a over simplified version we can assume that these are the guys who have been closer to the code meaning have more commits than others. But more commits is a relative thing, so we will like to see a histogram of commits and then run top n queries. So all we have to do is read up the git log, load it into a collection, plot a histogram of commit frequency, find the top n committers and if possible allow for filtering based on module names to find module experts.

With spark this is as easy as

val commitIterator = GitLogProcessor.iterate(new BufferedReader(new InputStreamReader(new FileInputStream("./data/detailed.spark.git.log"))))
val commitRDD = sc.parallelize(commitIterator.toArray)

For histogram where distSql is a query like select count(*) as colCount from $tableName group by

val doubleCounts = sqlContext.sql(distSql) {row=> row(0).asInstanceOf[Long].toDouble}
val stater = new DoubleRDDFunctions(sc.parallelize(doubleCounts))

The stats package helps us with generating histograms, a little bit of my customization which makes the driver web-ui an editable workbench and we can plot these histograms and run the top 10 query very easily. The top 10 committers when I ran the analysis (about a month ago were)

Matei Zaharia, Patrick Wendell, Reynold Xin, Tathagata Das, Ankur Dave, Mosharaf Chowdhury, Reynold Xin, Joseph E. Gonzalez, Prashant Sharma, Aaron Davidson

This result in a way makes sense, most of the spark community knows most of these names and they own one or multiple modules in the spark eco-system.

important files
All files are not equal, we will use the number of times a file has changed as an indicator of its importance (code churn), we will use the commit RDD as our input.

val fileGroups = sqlContext.sql(s"select metrics from commits")> row(0).asInstanceOf[Seq[String]])
val fileGroupsRDD = sc.parallelize(fileGroups.flatMap(identity).map{new Churn(_)})

The results are
SparkContext.scala, SparkBuild.scala, pom.xml, RDD.scala, DAGScheduler.scala, Utils.scala, BlockManager.scala, Executor.scala, PairRDDFunctions.scala, Broadcast.scala,,,, Master.scala

We have SparkContext the entry point for everything interesting, the DAGScheduler for high level scheduling, RDD the distributed collection and other pieces. Most spark users may agree that these are some of the more important classes (at least one of top 10 committers in spark (Prashant Sharma) who is a colleague seems to agree “in a way”, knowing him, “in a way” is as good as it gets).

related files (functionality wise)
By now most of you are bored and wondering if I have ever heard of ETL and the comfort with which databases can run queries, well for the next part I need to run a frequent pattern algorithm, let’s take FPGrowth and find the files that get changed together. Our idea is that each commit contains a subset of files, these files are more likely to be related as there is a direct relationship between commit and functionality.

val fPairRDD = sc.parallelize( => FPair(fPair, fPair.count(_ == ' ') + 1)))
// register this list of groups of files that are changed together as a table

One of the frequent patterns is
ReflectionUtils.scala SparkSQLSessionManager.scala small_kv.txt beeline SparkSQLCLIService.scala SQLConfSuite.scala SparkSQLEnv.scala SparkSQLDriver.scala HiveThriftServer2.scala CliSuite.scala spark-sql TestUtils.scala SparkSQLCLIDriver.scala spark-shell.cmd scalastyle HiveThriftServer2Suite.scala SparkSQLOperationManager.scala commands.scala SQLConf.scala SparkSubmitArguments.scala spark-shell SparkSubmit.scala .gitignore HiveContext.scala HiveQuerySuite.scala SQLQuerySuite.scala run-tests pom.xml SparkBuild.scala

Most of these seem to be related to SparkSQL with Hive thrown in between which is actually pretty correct. There are many such frequent patterns and we could even run queries to filter on basis of some of the important files that we have observed before, for example I may want to find out the frequent pattern for DAGScheduler, since this is mapped as a table it is as simple as

sqlContext.sql("select fileGroup from fpairs where groupSize > 5 order by groupSize desc").filter(row => row(0).asInstanceOf[String].contains("DAGScheduler")).collect

One of the results of this is

ApplicationPage.scala BlockManager.scala Utils.scala DAGScheduler.scala RDD.scala SparkContext.scala

For effective scheduling knowledge of the block locations and RDD lineage should help, so this group seems to make sense in a way.

For the above code, I could have used a UDF or filtered in sql, but the nice way spark lets me combine sql with map and filter means I can choose how I want to express such filtering logic. Since sql runs on SchemaRDD it actually understands the following filter operation and is able to optimize the execution till the collect stage.

code flows and third party dependencies
Would be great if we could look at the code, see who calls whom, the methods and the code flows. But things like these are better represented as graph, if we could run PageRank algorithm that could also give us some insights into the more interesting methods. Well with spark creating a graph is also easy. With the spark jar as input and a few lines using ASM we can create a graph.

val methodNodes = FileClassReader.processJar(sc.getClass.getProtectionDomain.getCodeSource.getLocation.getFile(), false)
sMethodNodes.foreach{ mN => val mName = deriveName(mN.methodName); edgeRaw.addAll( => Edge(methodMap.get(m), methodMap.get(mName), "invokedBy")))}
val vertexRDD = sc.parallelize(methodMap.toList).map(x=> (x._2, x._1))
val edgeRDD = sc.parallelize(edgeRaw)
val graph = Graph(vertexRDD, edgeRDD)

The graph is about 0.7 mill vertices and 1.2 mill edges. But I can easily query if spark has a direct dependency on say log4j packages, which is as simple as checking if there are any edges connecting a spark vertex with a log4j vertex. No discussion of code flow is complete without a flowery picture of the code graph, so here is our flowery picture of methods from just the graphx package.


The nice part about spark is that with small modifications to my existing code, I can make the same solution work for big data. Spark is essentially created for Big Data, but as it offers a seamless transition between programmatic crunching, sql querying and graph data, it is very effective for a wide array of analysis.

Mobile Web ListView for Bidirectional Infinite Scroll with Lazy loading

Recently we were exploring infinite scroll listview that would support thousands of items without performance impact in terms of memory and processor speed for mobile browsers. This should be similar to the listview/tableview implementations on Android/iOS app frameworks. And the web app should be light-weight without using any heavy framework using components that just serve the purpose without bloated code.

The technical requirements of listview that suits the purpose are:

  1. Bidirectional infinite scroll with lazy loading
  2. List elements should be recycled and fixed based on the device screen resolution
  3. Smooth animation of scroll and fling
  4. Lightweight and modular to fit into any framework
  5. Handle memory efficiently to work on low end devices as well
  6. When the list item is an image, thumbnail should be displayed

There are several javascript frameworks( that were available but not many were matching the above criteria. Hence we had a choice between writing our own listview or finding a list view that fits our requirements.

After filtering this frameworks we narrowed our focus to the javascript libraries which support listview with infinite scroll

  1. ifininty.js by AirBnb
  2. JQuery iScroll infinite
  3. RAD.js infinite scroll

Infinity.js was not suited for mobile as it was heavy both in terms of the framework dependency and memory.

JQuery iScroll infinite was considered good. But on further analysis and usage we realized that the code is not customizable easily. Hence we had to abandon this approach.

Lastly with RAD.js toolkit had the flexibility to just reuse a listview without the burden of any heavy framework and also ability to modify the code suited our constraints well. RAD.js covers most of the aspects of listview with infinite scrolling suited for mobile.

The table below captures the observations of running the RAD.js integrated listview on various mobile platforms and browsers.

Platform      Version            Device                                     Browser                       Supported
iOS 8.x iPhone 5S Safari Yes
iOS 7.x iPhone 4 Safari Yes
iOS 7.x iPad Safari Yes
Android 5.0 (L) LG Nexus 5 Chrome Yes
FireFox Yes
Android Browser N/A
Android 4.4.2 (KitKat) Samsung Galaxy S4 Chrome Yes
FireFox Yes
Android Browser Yes
Android 4.1.2 (JB) Samsung S2 Chrome Yes
FireFox Yes
Android Browser Yes*
Android 4.0.2 (ICS) Samsung Galaxy Tab Chrome Yes
Firefox Yes
Android Browser Yes*
Android 2.3.3(GB) LG-Optimus/HTC Aspire Firefox Yes
Android Browser Yes
Chrome N/A


*The CSS transform used for scroll animation during rearrangement of list items requires hardware acceleration. And the stock Android browsers on ICS and JB fall short on this front. An alternative strategy has to be arrived at to support this browsers.


Moving to the Cloud

Cloud can save you time and money, it also has the potential to change the way you do business.

View this infographic to know more…
For more informaiton,


Key highlights related to Mobile Technology

Here are a few highlights that are related to Mobile Technology. You will see some interesting statistics in this infographic.

Read on…


Key Cloud Computing Statistics


In the infographic, we find some key cloud computing statistics that highlight the growth and adoption trends of this strategic technology.

Learn more about Cloud spending, Cloud adoption, Cloud data and Cloud impacts from this infographic.

Cloud Computing

Thanks to


Android App development – 5 points to consider

Read some cool tips and tricks in Android App development from our experts in the following infographic.

We worked with a social shopping company to create an Android app to accompany their already successful iPhone version. Working with their amazingly talented team, we guided them towards creating an app that genuinely adapted to the Android platform instead of being a mere copy of the iPhone version.

5 points for developing Android App

Using PaaS and going beyond

In this infographic, you will find information about the details of a survey which suggests a high growing demand for PaaS architecture from organizations looking for faster development and deployment circles.
Please go through the infographic in detail …


Using PaaS - results from a Survey

For more details on the solution, you can visit


Test Link setup on Windows:



Configuration Setup:


Old path:

$tlCfg->log_path = ‘/var/testlink/logs/';

$g_repositoryPath = ‘/var/testlink/upload_area/';


To be updated path:

$tlCfg->log_path = ‘D:/xampp/htdocs/testlink-1.9.7/logs/’

$g_repositoryPath = ‘D:/xampp/htdocs/testlink-1.9.7/upload_area/';












Cloud Security – Part 2: “Security with multiple tenants sharing same infrastructure”

Welcome to the part 2 of the blog series ‘Cloud Security’. In Part 1 of the series, we raised some important questions about the security in the cloud. Now, in this blog post, we would like to answer one of the most important questions that we encounter when we talk about Cloud Security.

“How secure is my data when multiple tenants share the same infrastructure?”

Well, this is a tricky question that keeps cropping up again and again. In this blog post, we place a few of the different components in perspective to see which areas need to be addressed. First of all, a question arises, why do multiple tenants share the same infrastructure? The answer is, organizations want to gain price and performance advantages, and thus end up sharing the same infrastructure.

Let us understand the term ‘multi-tenancy’. It simply means, many tenants share same resources and this turns out to be very efficient and scalable. In IaaS, tenants share infrastructure resources like hardware, servers, and data storage devices. In SaaS, tenants source from the same application (for example,, which means that data of multiple tenants is likely stored in the same database and may even share the same tables. When it comes to security, the risks with multi-tenancy must be addressed at all layers.


Shared Premises / Shared Data centers:

In a ‘shared premises’ context, a dedicated rack is the safest unit you can own. However we need to ensure that the power cables are secure and redundant paths are available for power. Also, we should check whether the network cables are secure and whether the redundant paths are available for network. A point to be noted here is that the rack is always locked and cameras monitor the rack and are capable of a playback for a determined period of time.



Whereas in a ‘shared racks’ context, there is always an element of risk as multiple tenants have access to the Rack. An ideal condition would be to make it a managed service and provide access only to the service provider. Doing so ensures that the untrained / semi-trained hands may not affect the services of a co-tenant.

Shared Hardware:

In an instance where one cannot afford dedicated hardware, one has to settle in for one of the following:

Out of the above, a separate VM is the next best secure element. In order to ensure that the VM is secure, we first need to encrypt the VM image and ensure that the bios password is in force so no one can tamper with the boot order. For additional security, we need to ensure that a boot loader password is in force.


As we look upon the Shared Hardware scenario, we encounter that there are other elements where we need to be careful about, such as, Disks, Processors, Memory, Hypervisors etc.

Let us look at each one of them in detail:


We first need to ensure that the disk should be encrypted with a key recorded by the administrator and no user-end encryption should be enabled. Many a times, we find this feature and it is done to facilitate data recovery in case the employee is not available to recover sensitive / important data. Another important security measure would be to dispose or reassign the disk after due cleanup.


We need to ensure that the processor should have a secure ring architecture so that the hypervisor operates in a higher security zone than the VMs.


When multiple tenants share the same infrastructure, we need to check the OS specially for extra security. We need to facilitate jails / chrooted environment for different tenants, so one can not see the other’s data.


A hypervisor or virtual machine monitor (VMM) is a piece of computer software, firmware or hardware that creates and runs virtual machines. Hypervisors main job is to map traffic from VMs to the underlying VM host hardware so that it can make its way through the data center and out to the Internet and vice versa.  As the  hypervisor intercepts all traffic between VMs and VM hosts, it is the natural place to introduce segmentation for the resources of IaaS tenants where VMs might be housed within the same VM host or VM host cluster. We should not share direct access to any devices to the VMs.


Also another major security concern in the virtualized infrastructure is that the machines are owned by different customers. These machines can escalate the risk for many types of breaches such as unauthorized connection monitoring, unmonitored application login attempts, and malware propagation.

VM segmentation and isolation is an absolute requirement for VMs containing regulation and compliance intense data like employee details, customer information, etc. Most regulatory mandates such as ISO 27001, SAS 70, Payment Card Industry Data Security Standard (PCI DSS), SSAE 16 and Health Insurance Portability and Accountability Act (HIPAA) require that access be limited to a business’ need to know, and that control policies be set in place to enforce blocking of unwarranted access.

Hope this post has answered the question completely. If you have any further queries, do not hesitate to contact us. You can also comment / share your observations about the topic here.

We are waiting…

Security in the Cloud – Part 1

Security in the Cloud 

We will publish a series of blog posts on Cloud Security. This is the first blog post in the series.

One of the ‘security-as-a-service’ providers conducted a survey of their 2,200 customers about cyber-attacks. The results are startling, they reveal that cyber-attacks on cloud environments are increasing at an alarming level as more and more enterprises move their data to the public cloud.  According to the report, as more and more enterprises transfer their data and processing activities to the cloud, traditional on-premises cyber-attacks have also moved to the cloud. The report highlights a 14 percentage points year-on-year increase in brute force attacks while vulnerability scans on cloud setups have risen by 17 percentage points year-on-year. More info about the report can be found here.


Truly, enterprises and businesses have always been reluctant to move away from traditional IT to adopt cloud model. They always were skeptical about data security, and their doubt is genuine whether the data is protected to the same levels as in an on-premises setup.


This topic brings us to a very important point: Who controls the data that is hosted in the cloud? Before the Public Cloud came into the picture, enterprise data was safe within the premises and IT could have complete control over it. Now with the cloud, data is under the organizational control, but it rests elsewhere physically and is managed by someone else.


Questions such as the following arise:


You  can share your answers / ideas / solutions in the comment box.

We are waiting for your response…