What is Map-Reduce?
In a very simple way, MapReduce can be defined as a framework for processing vast amount of unstructured data. The simplicity of the previous sentence does not undermine the vastness of MapReduce.
There are two steps to process data using MapReduce:
a. User specifies a map function that processes a key/value pair to generate a set of intermediate key/pairs.
b. Then user specifies a Reduce function that merges all intermediate values associated with the same intermediate key.
The map function emits each word plus an associated count of occurrences (just `1′ in this simple example). The reduce function sums together all counts emitted for a particular word.
Details of the above process are in Section 3.1 in the pdf at:
MapReduce implementation in MongoDb:
a. Create a collection in MongoDb and insert some records.
- Created a collection named “mapReduceCollection” and inserted four records with three columns, “cust_id”, “amount” and “status”
b. Apply MapReduce function on the collection
- Applied “MapReduce” function to the “mapReduceCollection” which first executes the query (status: “A”) and filters the data. Then the result set is mapped according to the Map function defined. The key/value pair obtained is then “Reduced” to the “order_totals” collection.
c. See the aggregated result
References:
- http://research.google.com/archive/mapreduce.html
- http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf
- http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
- http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Mapper
- http://docs.mongodb.org/manual/core/map-reduce/#map-reduce