Your first Hadoop Map-Reduce Job


Introduction

Hadoop Map-Reduce is a YARN-based system for parallel processing of large data sets. If you are new to hadoop, first visit here. In this article, I will help you quickly start with writing the simplest Map-Reduce job. This is a famous “Wordcount” MR job and the first one for 90% of the people (if not more).

WordCount is a simple application that counts the number of occurences of each word in a given input set.

This code example is from MapReduce tutorial available here. You can checkout source code directly from this small Github project I created.

Step 1. Install and start Hadoop server

In this tutorial, I assume your hadoop installation is ready. For Single Node setup, visit here.

Start Hadoop:

amresh@ubuntu:/home/amresh$ cd /usr/local/hadoop/
amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/start-all.sh
amresh@ubuntu:/usr/local/hadoop-1.0.2$ sudo jps

6098 JobTracker
8024 Jps
5783 DataNode
5997 SecondaryNameNode
5571 NameNode
6310 TaskTracker

(Make sure NameNode, DataNode, JobTracker, TaskTracker, SecondaryNameNode are running)

If NameNode is not running, try formatting it and restart Hadoop.

amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop namenode -format
amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/stop-all.sh
amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/start-all.sh

Step 2. Write Map-Reduce Job for Wordcount

Map.java (Mapper Implementation)


package com.impetus.code.examples.hadoop.mapred.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
 private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
 throws IOException
 {
 String line = value.toString();
 StringTokenizer tokenizer = new StringTokenizer(line);
 while (tokenizer.hasMoreTokens())
 {
 word.set(tokenizer.nextToken());
 output.collect(word, one);
 }
 }
}

Reduce.java (Reducer Implementation)


package com.impetus.code.examples.hadoop.mapred.wordcount;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
 Reporter reporter) throws IOException
 {
 int sum = 0;
 while (values.hasNext())
 {
 sum += values.next().get();
 }
 output.collect(key, new IntWritable(sum));
 }
}

WordCount.java (Job)


package com.impetus.code.examples.hadoop.mapred.wordcount;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount
{
public static void main(String[] args) throws Exception
 {
 JobConf conf = new JobConf(WordCount.class);
 conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
 conf.setCombinerClass(Reduce.class);
 conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

 }
}

Step 3. Compile and Create Jar file

I prefer maven for building my java project. You can find POM file here and add to your java project. This will make sure you have Hadoop Jar dependency ready.

Just Run:

amresh@ubuntu:/usr/local/hadoop-1.0.2$ cd ~/development/hadoop-examples
amresh@ubuntu:/home/amresh/development/hadoop-examples$ mvn clean install

Step 4. Create input files to copy words from

amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -mkdir ~/wordcount/input
amresh@ubuntu:/usr/local/hadoop-1.0.2$ sudo vi file01 (Hello World Bye World)
amresh@ubuntu:/usr/local/hadoop-1.0.2$ sudo vi file02 (Hello Hadoop Goodbye Hadoop)
amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -copyFromLocal file01 /home/amresh/wordcount/input/
amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -copyFromLocal file02 /home/amresh/wordcount/input/
amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -ls /home/amresh/wordcount/input/

Found 2 items
-rw-r--r-- 1 amresh supergroup 0 2012-05-08 14:51 /home/amresh/wordcount/input/file01
-rw-r--r-- 1 amresh supergroup 0 2012-05-08 14:51 /home/amresh/wordcount/input/file02

Step 5. Run Map-Reduce job you wrote


amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop jar ~/development/hadoop-examples/target/hadoop-examples-1.0.jar com.impetus.code.examples.hadoop.mapred.wordcount.WordCount /home/amresh/wordcount/input /home/amresh/wordcount/output
amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -ls /home/amresh/wordcount/output/

Found 3 items
-rw-r--r-- 1 amresh supergroup 0 2012-05-08 15:23 /home/amresh/wordcount/output/_SUCCESS
drwxr-xr-x - amresh supergroup 0 2012-05-08 15:22 /home/amresh/wordcount/output/_logs
-rw-r--r-- 1 amresh supergroup 41 2012-05-08 15:23 /home/amresh/wordcount/output/part-00000

amresh@ubuntu:/usr/local/hadoop-1.0.2$ bin/hadoop dfs -cat /home/amresh/wordcount/output/part-00000

Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

About these ads

6 thoughts on “Your first Hadoop Map-Reduce Job

  1. Hi Amresh,
    How you ran the Wordcount.java program ? Are you using any java IDE on ubuntu. ?
    I was trying to run it through terminal but I got below error.

    yuvraj@yuvraj-Inspiron-N4050:~/Hadoop$ javac -classpath /usr/local/hadoop/hadoop-1.1.2/hadoop-core-1.1.2.jar WordCount.java
    WordCount.java:23: error: cannot find symbol
    conf.setMapperClass(Map.class);
    ^
    symbol: class Map
    location: class WordCount
    WordCount.java:24: error: cannot find symbol
    conf.setCombinerClass(Reduce.class);
    ^
    symbol: class Reduce
    location: class WordCount
    WordCount.java:25: error: cannot find symbol
    conf.setReducerClass(Reduce.class);
    ^
    symbol: class Reduce
    location: class WordCount
    3 errors

    Please help to resove this issue.

    • I ran from command line as well as eclipse. You code is having some compilation problem, did you make sure that hadoop jars are available in class path to your program.

    • If you are trying to compile on console than you have to include $HADOOP_HOME/lib/common-cli.jar in your class path.

      compile it like

      javac -classpath $HADOOP_HOME/hadoop-core.jar:$HADOOP_HOME/lib/commons-cli.jar WordCount.java

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s