## Problem Statement

Cluster the GitHub repositories and find the influential(important) actors using centrality

## Data Set

The dataset is in the form of JSON dumps of GitHub activity of various repositories and users. Please find the sample files here

## Solution

The solution attempts to cluster the repositories based on description and the language using kmeans algorithm and finds the important actors in repository clusters using eigenvector centrality

## Why Clustering

Organizing data into clusters shows internal structure of the data like common interests. For example repositories on machine learning field

## Algorithm for parsing the data for clustering

- parse the github json data and insert into the repository table
- delete the repositories based on the following conditions (outlier removal)
- delete from repository where length(description)<5;
- delete from repository where description like ‘%my first%’ ;
- delete from repository where length(description) < 30 and description like ‘%test%’;
- delete from repository where description like ‘%first%’ and length(description) < 50;
- delete from repository where name like ‘%demo%’;
- delete from repository where name like ‘%sample%’ and length(description) <50 ;
- delete from repository where description like ‘%sample app%’ and length(description) <50 ;

- retrieve the repository description from the repository table and clean the data.
- split the description into words using the following characters ( _-/,)
- remove the special characters in the words
- remove the stop words like the, a, an, the etc..
- do the stemming, used the
**nltk python library**

- insert all these words into the keywords table
- select distinct words and count of these words from keywords table and insert into clean_keywords table

## Implementation

The python implementation for the above algorithm can be found here

## Algorithm for preparing the input (sparse matrix) for kmeans algorithm

- calculate the term frequency and document frequency of each and every word
- prepare the sparse matrix with the size of number of repositories and number of words
- Iterate over the repositories,
- get the words with document frequency> 5 and language
- calculate the TFIDF using tf*log(N/df) formula

tf- term frequency (word appeared in one repository)

df – document frequency

N- Number of repositories

- make an entry in the sparse matrix(repoidIndex, wordIndex) with the TFIDF value

run the kmeans algorithm

## Implementation:

The python implementation for the above algorithm can be found here. Find the database scripts here

## Used Libraries

Scipy for clustering and sparse matrix

Numpy for manipulating the data

Nltk for stemming

## Results

Please find the clustering results here

## Observations

1. People used Javascript, Coffescript, Ruby, Objective-C, Haxe languages for developing the chrome extension. Mostly used language is JavaScript

2. People used Php, python, Java, Shell, Ruby for developing OpenShift applications

3. Some graphs of the clusters are not connected (checked the connectivity of the cluster by creating the graph with repository owners) , so we can recommend the users in the cluster to follow these repositories/owners of the repositories to know more about their interest

## EigenVector Centrality

Eigenvector centrality is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Google’s PageRank is a variant of the Eigenvector centrality measure.

## Algorithm for creating whole user connected graph for Calculating Centrality

- Take the cluster of similar repositories (read from file), find the owners of it.
- Add repository owners to the unExploredQueue.
- loop unExploredQueue until empty
- add the user to the explored map if the user is not explored
- find the following, followers for the user
- add these nodes to the graph
- create incoming edges for user, followers. ex: if user1 has followers user2, user3 then the edges (user2, user1) (user2, user1) will be added to the graph
- create outgoing edges for user, following. ex: user1 follows user2, user3 then the edges (user1, user2) (user1, user3) will be added to the graph
- loop ends when the unExploredQueue is empty
- find the eigenvalue vector of the created user connected graph using python networkx library

## Algorithm for creating user connected graph for single cluster for Calculating Centrality

Take the cluster of similar repositories (read from file), find the owners of it and add to the userSet

For every user,

- Take the cluster of similar repositories (read from file), find the owners of it and add to the userSet
- for every user in userSet, find the following users of the user add to the clusteredUserSet
- for every user in clusteredUserSet,
- retrieve the following users and add to the followinglist
- check the followuser in followinglist is present in clusteredUserSet if present then create an outgoing edge (followuser, user) and add to the graph

## Used Libraries

networkx for calculating the centrality

matplotlib for plotting the graphs

## Implementation

The python implementation for the above algorithm can be found here

## Results

Please find the results here

## Generated Plots

### Eigenvector centrality graph for whole user connected graph

### Degree centrality graph for whole user connected graph

## Plot Properties

X- axis indicates centrality value

Y- axis indicates the number of followed users

## Observations

1. Eigenvector centrality value is high for some of the actors even though number of followers are less, means these actors are highly influential

Actor | centrality value | followers |
---|---|---|

hcilab | 0.226 | 477 |

tblobaum | 0.154 | 109 |

Marak | 0.151 | 856 |

2. Degree centrality value depends upon the number of users

Very interesting concept and approach. After reading the article and reviewing the source, I am at a loss to explain the observations.

Assuming the Actor is synonymous with the GitHub account ID, the only central actor that matches GitHub activity and centrality would be Marak. The other two actors seem significantly less central, almost even insignificant.

Have I misunderstood the observations?

@Peter

Your observation is absolutely correct, and that is the point we are trying to make here.

Even though they are not very active but they have got a high score based on the follower who are in turn high scoring. Informally the followers of followers are very large in number for the other two actors and that contribute to their centrality value. It is not their activity on the Github that is reflected in the centrality value but their followership.