1. Overview

“Hadoop is a framework which allows us to distributed processing of large data sets across clusters of computers.”  As we know Hadoop job submitted to cluster for further execution to achieve our organizational goals. Sometimes we as a Big Data Developer requires to debug our logic. There are many ways to debug our logic like include job counters to track required pieces of information, Print some error messages on console or logs to check where the things go wrong.

What about if you are able to debug your Hadoop map reduce job as a normal code in your code editor. It’s easy and more productive compared to other approaches.

In this article, we will discuss how to debug Hadoop map reduce code in a local environment and get the output in the local file itself. Here we have used IntelliJ idea to debug.

2. Development Environment

Hadoop: 3.1.1

Java: Oracle JDK 1.8

IDE: IntelliJ Idea 2018.3

3. Steps To Debug Code locally

3.1 Add hadoop-mapreduce-client-jobclient maven dependency

The very first step to debug Hadoop map reduce code locally is to add hadoop-mapreduce-client-jobclient maven dependency.

3.2 Set local file system

Set eitherlocal or file:///in fs.defaultFS job configuration parameters.

conf.set("fs.defaultFS", "local");
conf.set("fs.defaultFS", "file:///");

3.2 Set Number of mappers and reducers

The final step is to set the number of mappers and reducers to 1. These properties are used to launch only a single mapper and reducer of our job.

conf.set("mapreduce.job.maps","1");
conf.set("mapreduce.job.reduces","1");

4. Example

Here is the complete example of  Multiple Outputs with locally debug enable.

4.1 pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>HadoopMapReduceDebugExample</groupId>
    <artifactId>HadoopMapReduceDebugExample</artifactId>
    <version>1.0-SNAPSHOT</version>
    <description>Hadoop MapReduce Debug Example</description>
    <build>
        <finalName>HadoopMapReduceDebugExample</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <configuration>
                    <useSystemClassLoader>false</useSystemClassLoader>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>3.1.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>3.1.1</version>
        </dependency>
    </dependencies>
</project>

4.2 MultipleOutpusDebugDriver.java

package com.javadeveloperzone;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MultipleOutputsDebugDriver extends Configured implements Tool {

    public static final String OTHER = "OTHER";
    public static final String MUMBAI = "MUMBAI";
    public static final String DELHI = "DELHI";
    public static final String AHMEDABAD = "AHMEDABAD";
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new Configuration(),
                new MultipleOutputsDebugDriver(), args);
        System.exit(exitCode);
    }
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.out.println("Please provid two arguments :");
            System.out.println("[ 1 ] Input dir path");
            System.out.println("[ 2 ] Output dir path");
            return -1;
        }
        Configuration c=new Configuration();
        String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
        Path input=new Path(files[0]);
        Path output=new Path(files[1]);
        Configuration conf=new Configuration();
        conf.set("fs.defaultFS", "local");
//        conf.set("fs.defaultFS", "file:///");
        conf.set("mapreduce.job.maps","1");
        conf.set("mapreduce.job.reduces","1");
        Job job=Job.getInstance(conf,"Debug Hadoop MapReduce Code Example");
        job.setJarByClass(MultipleOutputsDebugDriver.class);
        job.setMapperClass(MultipleOutputsMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setInputFormatClass(KeyValueTextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setNumReduceTasks(0);
        FileInputFormat.addInputPath(job, input);
        FileOutputFormat.setOutputPath(job, output);
        MultipleOutputs.addNamedOutput(job,"AHMEDABAD", TextOutputFormat.class,Text.class,Text.class);
        MultipleOutputs.addNamedOutput(job,"DELHI", TextOutputFormat.class,Text.class,Text.class);
        MultipleOutputs.addNamedOutput(job,"MUMBAI", TextOutputFormat.class,Text.class,Text.class);
        MultipleOutputs.addNamedOutput(job,"OTHER", TextOutputFormat.class,Text.class,Text.class);
        boolean success = job.waitForCompletion(true);
        return (success?0:1);
    }
}

4.3 MultipleOutpusMapper.java

Refer our Previous MultipleOutputsMapper Example

5. Build & Debug

Our sample code is ready. Set Hadoop Job Input Path and Output Path as a command line arguments.

"sample_input.txt" "HDFS/output"

Refer our previous article for sample input.

It’s time to debug our Hadoop map reduce code for debugging complex logic which helps us to improve productivity.

Set Debugger points in line numbers from where you want to check logic.

Click on Debug Icon on your IntelliJ idea project. It will start debugging the project. If Everything is going will IntelliJ idea will hold the Hadoop map-reduce code in your first debug point.

6. Output

Here I have set two debug points in my project. one is in Driver class and one is in mapper class. Refer below screens.

6.1 Debugger Screen of Driver class

Debug Map Reduce Driver

6.2 Debugger Screen of Mapper class

Debug Map Reduce Mapper

6.3 Job Output locally

Once Hadoop map reduce job completed we will get output in our local file system under job output directory.

Debug_Map_Reduce_Output

7. References

8. Source Code

Hadoop-Map-Reduce-Debug-Example

You can also check our Git repository for Debug Hadoop Map Reduce Code and other useful examples.

 

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *