Hadoop  Installation as single node cluster

 

Environment Setup

  1. sudo apt-get install update
  2. sudo apt-get install default-jdk

Creating a Group and User as ( hadoop and hduser)

  1. sudo addgroup hadoop
  2. sudo adduser –ingroup hadoop hduser ( Add the password and other information)
  3. sudo adduser hduser sudo ( Make hduser as admin)
  4. sudo apt-get install openssh-server

Login with hduser and generate a key for hduser and add to the authorized keys

  1. su hduser ( type the password given)
  2. ssh-keygen -t rsa -P “”
  3. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
  4. ssh localhost ( type yes to save password permanently)
  5. exit

HADOOP

(Download , installation and configuration)

  1. wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
  2. tar -xvzf hadoop-2.7.1.tar.gz
  3. sudo mv hadoop-2.7.1 /usr/local/hadoop
  4. sudo chown -R hduser /usr/local ( Setting the hduser as owner/administartor)
  5. Edit the ~/.bashrc file and append the below text at the end of the file

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”

 

source ~/.bashrc

 

sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh ( set the Java Path)

 

Edit the core-site.xml file

<property>

<name>fs.default.name </name>

<value>hdfs://localhost:9000 </value>

</property>

 

Edit the hdfs-site.xml

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>

</property>

<property>

  <name>dfs.namenode.data.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>

</property>

 

Edit yarn-site.xml

  • <property>

                   <name>yarn.nodemanager.aux-services</name>

  <value>mapreduce_shuffle</value>

                 </property>

    <property>

 

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

      </property>

 

Edit the mapred-site.xml

  • cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
  • <property>

<name>mapreduce.framework.name</name

<value>yarn</value>

</property>

sudo mkdir -p /usr/local/hadoop_tmp

sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

sudo chown -R hduser /usr/local/hadoop_tmp
Formatting the Name Node 

hdfs namenode -format

Starting the Daemons

start-dfs.sh ( stats the hdfs daemons)

start-yarn.sh ( Starts the Mapreduce daemons)

 

jps ( Shows the services which are  up)

25424 NameNode

26246 NodeManager

25883 SecondaryNameNode

26606 Jps

26118 ResourceManager

25586 DataNode
Wordcount example with Python

  1. wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
  2. hadoop fs -mkdir /wordcount
  3. hadoop fs -ls /
  4. hadoop fs -copyFromLocal ./pg2701.txt /wordcount/mobydick.txt
  5. hadoop fs -ls /wordcount/mobydick.txt
  6. head -n1000 pg2701.txt | ./mapper.py | sort | ./reducer.py
  7. hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -mapper “python $PWD/mapper.py” -reducer “python $PWD/reducer.py” -input “/wordcount/mobydick.txt” -output “/wordcount/outputs”