1. Overview

We are trying to perform most commonly executed problem by prominent distributed computing frameworks, i.e Hadoop MapReduce WordCount example using Java.
For a Hadoop developer with Java skill set, Hadoop MapReduce WordCount example is the first step in Hadoop development journey.

2. Development environment

Java            : Oracle JDK 1.8
Hadoop      : Apache Hadoop 2.2.0
IDE              : Eclipse
Build Tool : Gradle 3.5

3. Sample Input

In order to experience the power of Hadoop (MapReduce and HDFS), the input data size should be massive. But in our case we are using small input files for learning.

For this tutorial, we are using followinf text files(UTF-8 format) as input,
Input File 1: The Adventures of Sherlock Holmes, by Arthur Conan Doyle.
Input File 2: The Return of Sherlock Holmes, by Arthur Conan Doyle.
Input File 3: A Lesson to Fathers by F. Anstey.

4. Solution

Using core MapReduce

We are going to use following 3 Java files for explaining Hadoop MapReduce example,
WordCountDriver.java
WordCountMapper.java
WordCountReducer.java.

Hadoop MapReduce WordCount example using Java

Hadoop MapReduce WordCount example using Java

4.1 Build File: build.gradle

apply plugin: 'java'
description """Hadoop MapReduce WordCount Example Using Java"""
sourceCompatibility = 1.7
targetCompatibility = 1.7
/* In this section you declare where to find the dependencies of your project*/
repositories {
   
    jcenter()
    mavenCentral()
}
dependencies {
    
  compile group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.2.0'
  compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.2.0'
  compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '2.2.0'
     
}

4.2 Driver Code: WordCountDriver.java

package com.javadeveloperzone.bigdata.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool{
  @Override
  public int run(String[] args) throws Exception {
    
    Configuration configuration = getConf();
    
    Job job = Job.getInstance(configuration, "WordCountJob");
    
    job.setJarByClass(WordCountDriver.class);
    
    job.setMapperClass(WordCountMapper.class);
    
    job.setReducerClass(WordCountReducer.class);
    
    job.setOutputKeyClass(Text.class);
    
    job.setOutputValueClass(IntWritable.class);
    
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    job.waitForCompletion(true);
    return job.isSuccessful() ? 0 : 1;
  }
  
  public static void main(String[] args) throws Exception {
    
    int result = ToolRunner.run(new Configuration(), new WordCountDriver(), args);
    
    if(result==0){
      System.out.println("Job Completed successfully...");
    }
    else{
      System.out.println("Job Execution Failed with status::"+result);
    }
    
  }
}

4.3 Mapper Code: WordCountMapper.java

package com.javadeveloperzone.bigdata.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object,Text,Text,IntWritable>
{
  
  private static final IntWritable countOne = new IntWritable(1);
  
  private Text word = new Text();
  
  public void map(Object key, Text value, Context context) throws IOException,InterruptedException
  {
    String [] words = value.toString().split(" ");
    
    for(String string : words)
    {
      
      word.set(string);
      
      context.write(word, countOne);
      
    }
  }	
}

4.4 Reducer Code: WordCountReducer.java

package com.javadeveloperzone.bigdata.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> 
{
  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws 
  IOException,InterruptedException{
    
    int total = 0;
    
    for(IntWritable value : values)
    {
      total++;
    }
    context.write(key, new IntWritable(total));
    
  }
}

4.5 Copy files from local file system to HDFS

Copy Input files from local file System to HDFS.
Make sure your Hadoop cluster is up and running.
I have used following commands to copy files from local file system HDFS.

hdfs dfs -copyFromLocal The-Adventures-of-Sherlock-Holmes.txt /input
hdfs dfs -copyFromLocal The-Return-of-Sherlock-Holmes.txt /input
hdfs dfs -copyFromLocal A-Lesson-to-Fathers.txt /input

5. Build & Run Application

Now build the Jar file which we are going to submit to Hadoop cluster.
Once the Jar file building is completed, we can use following command to run Hadoop word count job on Hadoop cluster.

hadoop jar HadoopWordCount.jar /input /output/HadoopWordCount

6. Output Log

18/02/03 12:07:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/03 12:07:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/02/03 12:07:51 INFO input.FileInputFormat: Total input paths to process : 3
18/02/03 12:07:52 INFO mapreduce.JobSubmitter: number of splits:3
18/02/03 12:07:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1517639785506_0001
18/02/03 12:07:53 INFO impl.YarnClientImpl: Submitted application application_1517639785506_0001
18/02/03 12:07:53 INFO mapreduce.Job: The url to track the job: http://javadeveloperzone:8088/proxy/application_1517639785506_0001/
18/02/03 12:07:53 INFO mapreduce.Job: Running job: job_1517639785506_0001
18/02/03 12:08:00 INFO mapreduce.Job: Job job_1517639785506_0001 running in uber mode : false
18/02/03 12:08:00 INFO mapreduce.Job:  map 0% reduce 0%
18/02/03 12:08:08 INFO mapreduce.Job:  map 100% reduce 0%
18/02/03 12:08:16 INFO mapreduce.Job:  map 100% reduce 100%
18/02/03 12:08:19 INFO mapreduce.Job: Job job_1517639785506_0001 completed successfully
18/02/03 12:08:19 INFO mapreduce.Job: Counters: 49
  File System Counters
    FILE: Number of bytes read=3969779
    FILE: Number of bytes written=8363679
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=1907611
    HDFS: Number of bytes written=260998
    HDFS: Number of read operations=12
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
  
  Job Counters 
    Launched map tasks=3
    Launched reduce tasks=1
    Data-local map tasks=3
    Total time spent by all maps in occupied slots (ms)=18008
    Total time spent by all reduces in occupied slots (ms)=5491
    Total time spent by all map tasks (ms)=18008
    Total time spent by all reduce tasks (ms)=5491
    Total vcore-seconds taken by all map tasks=18008
    Total vcore-seconds taken by all reduce tasks=5491
    Total megabyte-seconds taken by all map tasks=18440192
    Total megabyte-seconds taken by all reduce tasks=5622784
  Map-Reduce Framework
    Map input records=39226
    Map output records=350295
    Map output bytes=3269183
    Map output materialized bytes=3969791
    Input split bytes=372
    Combine input records=0
    Combine output records=0
    Reduce input groups=23987
    Reduce shuffle bytes=3969791
    Reduce input records=350295
    Reduce output records=23987
    Spilled Records=700590
    Shuffled Maps =3
    Failed Shuffles=0
    Merged Map outputs=3
    GC time elapsed (ms)=256
    CPU time spent (ms)=8710
    Physical memory (bytes) snapshot=987660288
    Virtual memory (bytes) snapshot=2785767424
    Total committed heap usage (bytes)=757596160
  Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
  File Input Format Counters 
    Bytes Read=1907239
  File Output Format Counters 
    Bytes Written=260998
Job Completed successfully...

7. Output(Portion)

Once the job is completed successfully, you will get the output which looks like below output,

  11824
"'A	1
"'About	1
"'Absolute	1
"'Ah!'	2
"'Ah,	2
"'Ample.'	1
"'And	10
"'Arthur!'	1
"'As	1
"'At	1
"'Because	1
"'Breckinridge,	1
"'But	1
"'But,	1
"'But,'	1
"'Certainly	2
"'Certainly,'	1
"'Come!	1
"'Come,	1
"'DEAR	1
"'Dear	2
"'Dearest	1
"'Death,'	1
"'December	1
"'Do	3
"'Don't	1
"'Entirely.'	1
"'For	1
"'Fritz!	1
"'From	1
"'Gone	1
"'Hampshire.	1
"'Have	1
"'Here	1
"'How	2
"'I	22
"'If	2
"'In	2
"'Is	3
"'It	7
"'It's	1
"'Jephro,'	1
"'Keep	1
"'Ku	1
"'L'homme	1
"'Look	2
"'Lord	1
"'MY	2
"'May	1
"'Most	1
"'Mr.	2
"'My	4
"'Never	1
"'Never,'	1

8. Source Code

You can download the source code of Hadoop MapReduce WordCount example using Java at git repository, which can be boilerplate code for writing complex Hadoop MapReduce programs using Java.

9. References

Apache Hadoop Word Count Tutorial

Hadoop API

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *