

Table of Contents
1. Overview
We are trying to perform most commonly executed problem by prominent distributed computing frameworks, i.e Hadoop MapReduce WordCount example using Java.
For a Hadoop developer with Java skill set, Hadoop MapReduce WordCount example is the first step in Hadoop development journey.
2. Development environment
Java : Oracle JDK 1.8
Hadoop : Apache Hadoop 2.2.0
IDE : Eclipse
Build Tool : Gradle 3.5
3. Sample Input
In order to experience the power of Hadoop (MapReduce and HDFS), the input data size should be massive. But in our case we are using small input files for learning.
For this tutorial, we are using followinf text files(UTF-8 format) as input,
Input File 1: The Adventures of Sherlock Holmes, by Arthur Conan Doyle.
Input File 2: The Return of Sherlock Holmes, by Arthur Conan Doyle.
Input File 3: A Lesson to Fathers by F. Anstey.
4. Solution
Using core MapReduce
We are going to use following 3 Java files for explaining Hadoop MapReduce example,
WordCountDriver.java
WordCountMapper.java
WordCountReducer.java.

Hadoop MapReduce WordCount example using Java
4.1 Build File: build.gradle
apply plugin: 'java' description """Hadoop MapReduce WordCount Example Using Java""" sourceCompatibility = 1.7 targetCompatibility = 1.7 /* In this section you declare where to find the dependencies of your project*/ repositories { jcenter() mavenCentral() } dependencies { compile group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.2.0' compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.2.0' compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', version: '2.2.0' }
4.2 Driver Code: WordCountDriver.java
package com.javadeveloperzone.bigdata.hadoop; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordCountDriver extends Configured implements Tool{ @Override public int run(String[] args) throws Exception { Configuration configuration = getConf(); Job job = Job.getInstance(configuration, "WordCountJob"); job.setJarByClass(WordCountDriver.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return job.isSuccessful() ? 0 : 1; } public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new WordCountDriver(), args); if(result==0){ System.out.println("Job Completed successfully..."); } else{ System.out.println("Job Execution Failed with status::"+result); } } }
4.3 Mapper Code: WordCountMapper.java
package com.javadeveloperzone.bigdata.hadoop; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<Object,Text,Text,IntWritable> { private static final IntWritable countOne = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException,InterruptedException { String [] words = value.toString().split(" "); for(String string : words) { word.set(string); context.write(word, countOne); } } }
4.4 Reducer Code: WordCountReducer.java
package com.javadeveloperzone.bigdata.hadoop; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException{ int total = 0; for(IntWritable value : values) { total++; } context.write(key, new IntWritable(total)); } }
4.5 Copy files from local file system to HDFS
Copy Input files from local file System to HDFS.
Make sure your Hadoop cluster is up and running.
I have used following commands to copy files from local file system HDFS.
hdfs dfs -copyFromLocal The-Adventures-of-Sherlock-Holmes.txt /input hdfs dfs -copyFromLocal The-Return-of-Sherlock-Holmes.txt /input hdfs dfs -copyFromLocal A-Lesson-to-Fathers.txt /input
5. Build & Run Application
Now build the Jar file which we are going to submit to Hadoop cluster.
Once the Jar file building is completed, we can use following command to run Hadoop word count job on Hadoop cluster.
hadoop jar HadoopWordCount.jar /input /output/HadoopWordCount
6. Output Log
18/02/03 12:07:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/02/03 12:07:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/02/03 12:07:51 INFO input.FileInputFormat: Total input paths to process : 3 18/02/03 12:07:52 INFO mapreduce.JobSubmitter: number of splits:3 18/02/03 12:07:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1517639785506_0001 18/02/03 12:07:53 INFO impl.YarnClientImpl: Submitted application application_1517639785506_0001 18/02/03 12:07:53 INFO mapreduce.Job: The url to track the job: http://javadeveloperzone:8088/proxy/application_1517639785506_0001/ 18/02/03 12:07:53 INFO mapreduce.Job: Running job: job_1517639785506_0001 18/02/03 12:08:00 INFO mapreduce.Job: Job job_1517639785506_0001 running in uber mode : false 18/02/03 12:08:00 INFO mapreduce.Job: map 0% reduce 0% 18/02/03 12:08:08 INFO mapreduce.Job: map 100% reduce 0% 18/02/03 12:08:16 INFO mapreduce.Job: map 100% reduce 100% 18/02/03 12:08:19 INFO mapreduce.Job: Job job_1517639785506_0001 completed successfully 18/02/03 12:08:19 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=3969779 FILE: Number of bytes written=8363679 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1907611 HDFS: Number of bytes written=260998 HDFS: Number of read operations=12 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=3 Launched reduce tasks=1 Data-local map tasks=3 Total time spent by all maps in occupied slots (ms)=18008 Total time spent by all reduces in occupied slots (ms)=5491 Total time spent by all map tasks (ms)=18008 Total time spent by all reduce tasks (ms)=5491 Total vcore-seconds taken by all map tasks=18008 Total vcore-seconds taken by all reduce tasks=5491 Total megabyte-seconds taken by all map tasks=18440192 Total megabyte-seconds taken by all reduce tasks=5622784 Map-Reduce Framework Map input records=39226 Map output records=350295 Map output bytes=3269183 Map output materialized bytes=3969791 Input split bytes=372 Combine input records=0 Combine output records=0 Reduce input groups=23987 Reduce shuffle bytes=3969791 Reduce input records=350295 Reduce output records=23987 Spilled Records=700590 Shuffled Maps =3 Failed Shuffles=0 Merged Map outputs=3 GC time elapsed (ms)=256 CPU time spent (ms)=8710 Physical memory (bytes) snapshot=987660288 Virtual memory (bytes) snapshot=2785767424 Total committed heap usage (bytes)=757596160 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1907239 File Output Format Counters Bytes Written=260998 Job Completed successfully...
7. Output(Portion)
Once the job is completed successfully, you will get the output which looks like below output,
11824 "'A 1 "'About 1 "'Absolute 1 "'Ah!' 2 "'Ah, 2 "'Ample.' 1 "'And 10 "'Arthur!' 1 "'As 1 "'At 1 "'Because 1 "'Breckinridge, 1 "'But 1 "'But, 1 "'But,' 1 "'Certainly 2 "'Certainly,' 1 "'Come! 1 "'Come, 1 "'DEAR 1 "'Dear 2 "'Dearest 1 "'Death,' 1 "'December 1 "'Do 3 "'Don't 1 "'Entirely.' 1 "'For 1 "'Fritz! 1 "'From 1 "'Gone 1 "'Hampshire. 1 "'Have 1 "'Here 1 "'How 2 "'I 22 "'If 2 "'In 2 "'Is 3 "'It 7 "'It's 1 "'Jephro,' 1 "'Keep 1 "'Ku 1 "'L'homme 1 "'Look 2 "'Lord 1 "'MY 2 "'May 1 "'Most 1 "'Mr. 2 "'My 4 "'Never 1 "'Never,' 1
8. Source Code
You can download the source code of Hadoop MapReduce WordCount example using Java at git repository, which can be boilerplate code for writing complex Hadoop MapReduce programs using Java.
9. References
Apache Hadoop Word Count Tutorial