We have been working with hadoop for the last couple of years Patches, but we still find it tough to get other people in our company started on it. I came up with this blog as a starting point and was kinda popular internally, so am moving it here now.
Introducing hadoop in 20 pages is a concise document to briefly introduce just the right information in right amount, before starting out in-depth in this field. This document is intended to be used as a first and shortest guide to both understand and use Map-Reduce for building distributed data processing applications.
Topics covered or concisely presented.
- Introduction to hadoop.
- What is Map-Reduce and how it works ? (With example on how to write an algorithm)
- What is hadoop streaming ? ( A great tool for a newbie ).
- What is HDFS and where is it most suitable ?
- Serialization in hadoop – “how to go about it” and why not use java serialization ?
- Distributed cache.
- Job scheduling in hadoop.
Appendix: 1A on Avro serialization and its benefits over standard techniques.
Appendix: 1B documented examples from hadoop repository.
Accepted 1 MAPREDUCE-3360 2 MAPREDUCE-3532 3 MAPREDUCE-3316 4 MAPREDUCE-3708 5 MAPREDUCE-3723 6 MAPREDUCE-3212 7 HADOOP-7971 8 MAPREDUCE-3686 9 HDFS-2725 10 MAPREDUCE-3952 Available 11 MAPREDUCE-3504 12 MAPREDUCE-3115 13 MAPREDUCE-3131 14 MAPREDUCE-3870 15 MAPREDUCE-2493 Involved 16 MAPREDUCE-3070 17 MAPREDUCE-3354 18 MAPREDUCE-3193 19 MAPREDUCE-3204 20 HADOOP-7726 21 MAPREDUCE-3140 22 MAPREDUCE-3494