Hadoop (Stack Overflow)


Abstract : gaming.stackexchange.com is a repository which is technical group  related to gaming. There are around group of users who does  contribute  or help others in their queries .

The abstract of this paper is  to find the top 20 users in this group and their average scores.

DataSet : We can download  the dataset from link https://archive.org/download/stackexchange

Download the  gaming.stackexchange.com dataset from  the above link


After  you untar the above zip file. The below files are displayed.

The Above table have a correlation between each other.

As part of our use case here  we  need  to find the  top  20 users and the average of their scores.


Score : score is a  point  of appreciation given by the users who go the help regarding that particular issue.



So here we can correlate both the table  Posts and Badges.

Here we can use Hadoop to convert the xml files to csv format  using Mapreduce written in Java





Develop a Driver  class  and  Mapper class  for processing the XML files ( Badges,  Posts)


Driver Class


Mapper Class



Reducer Class

Not Required.

Running the  Mapper  jobs  on badges.xml and posts.xml and extracting the

Score and owneruserid  – post.xml


Console output of  Posts



userid and Name  – badges.xml

Console output   for   Badges.xml

Posts.txt  seperated by (|) colon



  1. Create a Database in MySQL ( Games).
  2. Create table posts and badges .


Now export the Pipe separated data into MySQL using SQOOP


Execute a Join on both the tables

The Final output of the Users and their average score is displayed below

Final Report.csv