Hadoop (Stack Overflow)

 

Abstract : gaming.stackexchange.com is a repository which is technical group  related to gaming. There are around group of users who does  contribute  or help others in their queries .

The abstract of this paper is  to find the top 20 users in this group and their average scores.

DataSet : We can download  the dataset from link https://archive.org/download/stackexchange

Download the  gaming.stackexchange.com dataset from  the above link

 

After  you untar the above zip file. The below files are displayed.

The Above table have a correlation between each other.

As part of our use case here  we  need  to find the  top  20 users and the average of their scores.

 

Score : score is a  point  of appreciation given by the users who go the help regarding that particular issue.

 

 

So here we can correlate both the table  Posts and Badges.

Here we can use Hadoop to convert the xml files to csv format  using Mapreduce written in Java

 

Badges.xml 

Posts.xml

 

Develop a Driver  class  and  Mapper class  for processing the XML files ( Badges,  Posts)

 

Driver Class

 

Mapper Class

 

 

Reducer Class

Not Required.

Running the  Mapper  jobs  on badges.xml and posts.xml and extracting the

Score and owneruserid  – post.xml

 

Console output of  Posts

 

 

userid and Name  – badges.xml

Console output   for   Badges.xml

Posts.txt  seperated by (|) colon

Badges.txt

MySQL

  1. Create a Database in MySQL ( Games).
  2. Create table posts and badges .
 

SQOOP

Now export the Pipe separated data into MySQL using SQOOP

 

Execute a Join on both the tables

The Final output of the Users and their average score is displayed below

Final Report.csv