When you implement Apache Hadoop in production environment, you’ll need multiple server nodes. If you are just exploring the distributed computing, you might want to play around with Hadoop by installing it on a single node.
This article explains how to setup and configure a single node standalone Hadoop environment. Please note that you can also simulate a multi node Hadoop installation on a single server using pseudo distributed hadoop installation, which we’ll be covering in detail in the next article of this series.
The standlone hadoop environment is a good place to start to make sure your server environment is setup properly with all the pre-req to run Hadoop.
The standlone hadoop environment is a good place to start to make sure your server environment is setup properly with all the pre-req to run Hadoop.
1. Create a Hadoop User
You can download and install hadoop on root. But, it is recommended to install it as a separate user. So, login to root and create a user called hadoop.
# adduser hadoop # passwd hadoop
2. Download Hadoop Common
Download the Apache Hadoop Common and move it to the server where you want to install it.
You can also use wget to download it directly to your server using wget.
# su - hadoop $ wget http://mirror.nyi.net/apache//hadoop/common/stable/hadoop-0.20.203.0rc1.tar.gz
Make sure Java 1.6 is installed on your system.
$ java -version java version "1.6.0_20" OpenJDK Runtime Environment (IcedTea6 1.9.7) (rhel-1.39.1.9.7.el6-x86_64) OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
3. Unpack under hadoop User
As hadoop user, unpack this package.
$ tar xvfz hadoop-0.20.203.0rc1.tar.gz
This will create the “hadoop-0.20.204.0″ directory.
$ ls -l hadoop-0.20.204.0 total 6780 drwxr-xr-x. 2 hadoop hadoop 4096 Oct 12 08:50 bin -rw-rw-r--. 1 hadoop hadoop 110797 Aug 25 16:28 build.xml drwxr-xr-x. 4 hadoop hadoop 4096 Aug 25 16:38 c++ -rw-rw-r--. 1 hadoop hadoop 419532 Aug 25 16:28 CHANGES.txt drwxr-xr-x. 2 hadoop hadoop 4096 Nov 2 05:29 conf drwxr-xr-x. 14 hadoop hadoop 4096 Aug 25 16:28 contrib drwxr-xr-x. 7 hadoop hadoop 4096 Oct 12 08:49 docs drwxr-xr-x. 3 hadoop hadoop 4096 Aug 25 16:29 etc
Modify the hadoop-0.20.204.0/conf/hadoop-env.sh file and make sure JAVA_HOME environment variable is pointing to the correct location of the java that is installed on your system.
$ grep JAVA ~/hadoop-0.20.204.0/conf/hadoop-env.sh export JAVA_HOME=/usr/java/jdk1.6.0_27
4. Test Sample Hadoop Program
In a single node standalone application, you don’t need to start any hadoop background process. Instead just call the ~/hadoop-0.20.203.0/bin/hadoop, which will execute hadoop as a single java process for your testing purpose.
This example program is provided as part of the hadoop, and it is shown in the hadoop document as an simple example to see whether this setup work.
First, create a input directory, where all the input files will be stored. This might be your location where all the incoming data files will be stored in the hadoop environment.
$ cd ~/hadoop-0.20.204.0 $ mkdir input
For testing purpose, add some sample data files to the input directory. Let us just copy all the xml file from the conf directory to the input directory. So, these xml file will be considered as the data file for the example program.
$ cp conf/*.xml input
Execute the sample hadoop test program. This is a simple hadoop program that simulates a grep. This searches for the reg-ex pattern “dfs[a-z.]+” in all the input/*.xml file and stores the output in the output directory.
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
When everything is setup properly, the above sample hadoop test program will display the following messages on the screen when it is executing it.
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 12/01/14 23:38:46 INFO mapred.FileInputFormat: Total input paths to process : 6 12/01/14 23:38:46 INFO mapred.JobClient: Running job: job_local_0001 12/01/14 23:38:46 INFO mapred.MapTask: numReduceTasks: 1 12/01/14 23:38:46 INFO mapred.MapTask: io.sort.mb = 100 12/01/14 23:38:46 INFO mapred.MapTask: data buffer = 79691776/99614720 12/01/14 23:38:46 INFO mapred.MapTask: record buffer = 262144/327680 12/01/14 23:38:46 INFO mapred.MapTask: Starting flush of map output 12/01/14 23:38:46 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 12/01/14 23:38:47 INFO mapred.JobClient: map 0% reduce 0% ...
This will create the output directory with the results as shown below.
$ ls -l output total 4 -rwxrwxrwx. 1 root root 11 Aug 23 08:39 part-00000 -rwxrwxrwx. 1 root root 0 Aug 23 08:39 _SUCCESS $ cat output/* 1 dfsadmin
The source code of the example programs are located under src/examples/org/apache/hadoop/examples directory.
$ ls -l ~/hadoop-0.20.204.0/src/examples/org/apache/hadoop/examples -rw-rw-r--. 1 hadoop hadoop 2395 Jan 14 23:28 WordCount.java -rw-rw-r--. 1 hadoop hadoop 8040 Jan 14 23:28 Sort.java -rw-rw-r--. 1 hadoop hadoop 9156 Jan 14 23:28 SleepJob.java -rw-rw-r--. 1 hadoop hadoop 7809 Jan 14 23:28 SecondarySort.java -rw-rw-r--. 1 hadoop hadoop 10190 Jan 14 23:28 RandomWriter.java -rw-rw-r--. 1 hadoop hadoop 40350 Jan 14 23:28 RandomTextWriter.java -rw-rw-r--. 1 hadoop hadoop 11914 Jan 14 23:28 PiEstimator.java -rw-rw-r--. 1 hadoop hadoop 853 Jan 14 23:28 package.html -rw-rw-r--. 1 hadoop hadoop 8276 Jan 14 23:28 MultiFileWordCount.java -rw-rw-r--. 1 hadoop hadoop 6582 Jan 14 23:28 Join.java -rw-rw-r--. 1 hadoop hadoop 3334 Jan 14 23:28 Grep.java -rw-rw-r--. 1 hadoop hadoop 3751 Jan 14 23:28 ExampleDriver.java -rw-rw-r--. 1 hadoop hadoop 13089 Jan 14 23:28 DBCountPageView.java -rw-rw-r--. 1 hadoop hadoop 2879 Jan 14 23:28 AggregateWordHistogram.java -rw-rw-r--. 1 hadoop hadoop 2797 Jan 14 23:28 AggregateWordCount.java drwxr-xr-x. 2 hadoop hadoop 4096 Jan 14 08:49 dancing drwxr-xr-x. 2 hadoop hadoop 4096 JAn 14 08:49 terasort
5. Troubleshooting Issues
Issue: “Temporary failure in name resolution”
While executing the sample hadoop program, you might get the following error message.
12/01/14 23:34:57 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-root/mapred/staging/root-1040516815/.staging/job_local_0001 java.net.UnknownHostException: hadoop: hadoop: Temporary failure in name resolution at java.net.InetAddress.getLocalHost(InetAddress.java:1438) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:815) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791) at java.security.AccessController.doPrivileged(Native Method)
Solution: Add the following entry to the /etc/hosts file that contains the ip-address, FQDN fully qualified domain name, and host name.
192.168.1.10 hadoop.sureshkumarpakalapati.in hadoop