Problem Statement

Cluster the GitHub repositories  and find the influential(important) actors using centrality

Data Set

The dataset is in the form of JSON dumps of GitHub activity of various repositories and users. Please find the sample files here

Solution

The solution attempts to cluster the repositories based on description and the language using kmeans algorithm and finds the important actors in repository clusters using eigenvector centrality

Why Clustering

Organizing data into clusters shows internal structure of the data like common interests. For example repositories on machine learning field

Algorithm for parsing the data for clustering

  1. parse the github json data and insert into the repository table
  2. delete the repositories based on the following conditions (outlier removal)
    1. delete from repository where length(description)<5;
    2. delete from repository where description like ‘%my first%’ ;
    3. delete from repository where length(description) < 30 and description like ‘%test%’;
    4. delete from repository where description like ‘%first%’ and length(description) < 50;
    5. delete from repository where name like ‘%demo%’;
    6. delete from repository where name like ‘%sample%’ and length(description) <50 ;
    7. delete from repository where description like ‘%sample app%’ and length(description) <50 ;
  3. retrieve the repository description from the repository table and clean the data.
      1. split the description into words using the following characters ( _-/,)
      2. remove the special characters in the words
      3. remove the stop words like the, a, an, the etc..
      4. do the stemming, used the nltk python library
  4. insert all these words into the keywords table
  5. select distinct words and count of these words from keywords table and insert into clean_keywords table

Implementation

The python implementation for the above algorithm can be found here

Algorithm for preparing the input (sparse matrix) for kmeans algorithm

  • calculate the term frequency and document frequency of each and every word
  • prepare the sparse matrix with the size of number of repositories and number of words
  • Iterate over the repositories,
    • get the words with document frequency> 5 and language
    • calculate the TFIDF using tf*log(N/df) formula

      tf- term frequency (word appeared in one repository)
      df – document frequency
      N- Number of repositories
    • make an entry in the sparse matrix(repoidIndex, wordIndex) with the TFIDF value

run the kmeans algorithm

Implementation:

The python implementation for the above algorithm can be found here. Find the database scripts here

Used Libraries

Scipy for clustering and sparse matrix
Numpy for manipulating the data
Nltk for stemming

Results

Please find the clustering results here

Observations

1. People used Javascript, Coffescript, Ruby, Objective-C, Haxe languages for developing the chrome extension. Mostly used language is JavaScript

2. People used Php, python, Java,  Shell, Ruby for developing OpenShift applications

3.  Some graphs of the clusters are not connected (checked the connectivity of the cluster by creating the graph with repository owners) , so we can recommend the users in the cluster to follow these repositories/owners of the repositories to know more about their interest

EigenVector Centrality

Eigenvector centrality is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Google’s PageRank is a variant of the Eigenvector centrality measure.

Algorithm for creating whole user connected graph for Calculating Centrality

  1. Take the cluster of similar repositories (read from file), find the owners of it.
  2. Add repository owners to the unExploredQueue.
  3. loop unExploredQueue until empty
  4. add the user to the explored map if the user is not explored
  5. find the following, followers for the user
  6. add these nodes to the graph
  7. create incoming edges for user, followers. ex: if user1 has followers user2, user3 then the edges (user2, user1) (user2, user1) will be added to the graph
  8. create outgoing edges for user, following. ex: user1 follows user2, user3 then the edges (user1, user2) (user1, user3) will be added to the graph
  9. loop ends when the unExploredQueue is empty
  10. find the eigenvalue vector of the created user connected graph using python networkx library

Algorithm for creating user connected graph for single cluster for Calculating Centrality

Take the cluster of similar repositories (read from file), find the owners of it and add to the userSet
For every user,

  • Take the cluster of similar repositories (read from file), find the owners of it and add to the userSet
  • for every user in userSet, find the following users of the user add to the clusteredUserSet
  • for every user in clusteredUserSet,
  • retrieve the following users and add to the followinglist
  • check the followuser in followinglist is present in  clusteredUserSet if present then create an outgoing edge (followuser, user) and add to the graph

Used Libraries

networkx for calculating the centrality
matplotlib for plotting the graphs

Implementation

The python implementation for the above algorithm can be found here

Results

Please find the results here

Generated Plots

Eigenvector centrality graph for whole user connected graph

Degree centrality graph for whole user connected graph

Plot Properties

X- axis indicates centrality value
Y- axis indicates the number of followed users

Observations

1. Eigenvector centrality value is high for some of the actors even though number of followers are less, means these actors are highly influential

Actor centrality value followers
hcilab 0.226 477
tblobaum 0.154 109
Marak 0.151 856

2. Degree centrality value depends upon the number of users