Hadoop: Installation

This tutorial describes how to install Hadoop.

  • Download Hadoop version from here: http://archive.apache.org/dist/hadoop/core/hadoop-
  • Unzip the file to "/home/bude/[USERNAME]/RAID/hadoop"
  • Modify the files „core-site.xml“ and „mapred-site.xml“ (inside the conf sub-directory) as described below (and use „/home/bude/[USERNAME]/RAID/hadoop/temp“ as tmp dir)
  • Check that the JAVA_HOME environment variable is set (e.g. in the ~/.bashrc file) and pointing to your Java installation directory.
    • If not, add this to your ~/.bashrc file: export JAVA_HOME=/usr/lib/jvm/java-6-sun
  • Add this to your ~/.bashrc file: export HADOOP_INSTALL=/home/bude/[USERNAME]/RAID/hadoop
  • Add Hadoop to your path by adding this to your ~/.bashrc file: export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
  • Create the sub-directory „/home/bude/[USERNAME]/RAID/hadoop/temp“
  • Close the current terminal (if your were using one) and open a new one (this way the .bashrc file gets read again)
  • Check that Hadoop can be found by typing: hadoo version
    The output should look like this:

    hduser@ubuntu:~$ hadoop version
    Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 -r 1179940
    Compiled by hortonfo on Fri Oct  7 06:20:32 UTC 2011

  • Format the HDFS filesystem by: hadoop namenode -format
  • Start Hadoop by: start-all.sh
    The output should look like this:

    hduser@ubuntu:~$ start-all.sh
    starting namenode, logging to /home/cocktail2/tconrad/hadoop/libexec/../logs/hadoop-tconrad-namenode-pitu.out
    starting datanode, logging to /home/cocktail2/tconrad/hadoop/libexec/../logs/hadoop-tconrad-datanode-pitu.out
    starting secondarynamenode, logging to /home/cocktail2/tconrad/hadoop/libexec/../logs/hadoop-tconrad-secondarynamenode-pitu.out
    starting jobtracker, logging to /home/cocktail2/tconrad/hadoop/libexec/../logs/hadoop-tconrad-jobtracker-pitu.out
    starting tasktracker, logging to /home/cocktail2/tconrad/hadoop/libexec/../logs/hadoop-tconrad-tasktracker-pitu.out

  • Check the Hadoop tools are running by running: jps
    The output shoud look like this:

    tconrad@pitu:~$ jps
    2287 TaskTracker
    2149 JobTracker
    1938 DataNode
    2085 SecondaryNameNode
    2349 Jps
    1788 NameNode

  • Run the Hadoop "grep" example (as provided in the quick start guide)
    • You should first create the input directory inside the HDFS file system (don’t worry about the output directory – Grep will create it)
      hadoop fs -mkdir /test_grep
    • Now we will copy a bunch of xml file as the input source for the Grep example:
      hadoop fs -put conf/*.xml /test_grep
    • We are now ready to run the example:
      hadoop jar hadoop-*-examples.jar grep /test_grep output 'dfs[a-z.]+'
    • Some remarks about the above command line:
      • The first parameter tells hadoop to run the main method out of a JAR.
      • The second argument is the actual jar to be used. The hadoop-*-examples.jar expression results into a single jar file. This is easier than writing the explicit name of the JAR file because it contains the actual hadoop version in it.
      • The remaining parameters are: the input folder (containing the XML files to be scanned), the output folder for the resulting list and finally the regular expression.
    • Tip: The regular expression as shown above will result in a single occurrence…you may want to change the regular expression into something more interesting so that you get more results. Don’t forget to delete the output directory before re-testing.

conf/core-site.xml - add the following inbetween the "<configuration> ... </configuration>" tags

  <description>A base for other temporary directories.</description>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>

conf/mapred-site.xml - add the following inbetween the "<configuration> ... </configuration>" tags

  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.

Last updated

Wednesday, 08 May 2013