Spark WordCount example

Table of Contents

1. Overview
2. Development environment
3. Sample Input
4. Solution
5. Build & Run Spark Wordcount Example
6. Output(Portion)
7. Source Code
- Was this post helpful?

1. Overview

Now a days, with advancement of technologies, millions of devices are generating the data at massive speed.
Organizations across the globe are digging deeper to find valuable information from data, so we can say that data is “New Oil”.
Apache Spark is a fast and general engine for large-scale data processing.
So in this blog, we are trying to perform most commonly executed program by prominent distributed computing frameworks,
i.e Spark WordCount example.
For a bigdata developer, Spark WordCount example is the first step in spark development journey.

2. Development environment

Java : Oracle JDK 1.8
Spark : Apache Spark 2.0.0-bin-hadoop2.6
IDE : Eclipse
Build Tool: Gradle 4.4.1

3. Sample Input

In order to experience the power of Spark, the input data size should be massive. But in our case, we are using small input file for learning.
For this tutorial, we are using below text files(UTF-8 format) as input,

Input File 1: 3sonnets.vrs.txt(which contains THE CLOUD SCULPTORS written by Staeorra Rokraven)

4. Solution

We are solving Spark WordCount example problem using prominent programming languages like, Java and Scala.

Using Spark Core

We are using following 2 files for Spark WordCount example,

Java source file
WordCount.java

Scala Source file
WordCount.scala

spark wordcount example

4.1 Build File : build.gradle

apply plugin: 'java'
apply plugin: 'scala'
description """Spark WordCount example"""
sourceCompatibility = 1.8
targetCompatibility = 1.8
/* In this section you declare where to find the dependencies of your project*/
repositories {
    jcenter()
    mavenCentral()
    maven { url "https://mvnrepository.com/artifact/org.apache.spark/spark-core" }
}
dependencies {
    
    compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.0.0'
       
}

4.2 Java Code: WordCount.java

package com.javadeveloperzone.spark.java;
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class WordCount {
  
  /*For Simplicity,
   *We are creating custom Split function,so it makes code easier to understand 
   *We are implementing FlatMapFunction interface.*/
  static class SplitFunction implements FlatMapFunction<String, String>
  {
    private static final long serialVersionUID = 1L;
    @Override
    public Iterator<String> call(String s) {
      return Arrays.asList(s.split(" ")).iterator();
    }
    
  }
  
  public static void main(String[] args)
  {
    SparkConf sparkConf = new SparkConf();
        
    sparkConf.setAppName("Spark WordCount example using Java");
    
    //Setting Master for running it from IDE.
    sparkConf.setMaster("local[2]");
    JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
    /*Reading input file whose path was specified as args[0]*/
    JavaRDD<String> textFile = sparkContext.textFile(args[0]);
    
    /*Creating RDD of words from each line of input file*/
    JavaRDD<String> words = textFile.flatMap(new SplitFunction());
    
    /*Below code generates Pair of Word with count as one 
     *similar to Mapper in Hadoop MapReduce*/
    JavaPairRDD<String, Integer> pairs = words
        .mapToPair(new PairFunction<String, String, Integer>() {
          public Tuple2<String, Integer> call(String s) {
            return new Tuple2<String, Integer>(s, 1);
          }
        });
    
    /*Below code aggregates Pairs of Same Words with count
     *similar to Reducer in Hadoop MapReduce  
     */
    JavaPairRDD<String, Integer> counts = pairs.reduceByKey(
        new Function2<Integer, Integer, Integer>() {
          public Integer call(Integer a, Integer b) {
            return a + b;
          }
        });
    /*Saving the result file to the location that we have specified as args[1]*/
    counts.saveAsTextFile(args[1]);
    sparkContext.stop();
    sparkContext.close();
  }
}

4.3 Scala Code: WordCount.scala

package com.javadeveloperzone.spark.scala
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
  
   def main(args: Array[String]) 
   {
     
     /*By default we are setting local so it will run locally with one thread 
      *Specify: "local[2]" to run locally with 2 cores, OR 
      *        "spark://master:7077" to run on a Spark standalone cluster */
     
      val sparkContext = new SparkContext("local","Spark WordCount example using Scala",
          System.getenv("SPARK_HOME"))
      
      /*Reading input from File*/
      val input = sparkContext.textFile(args(0))
      
      /*Creating RDD from lines on input file*/
      val words = input.flatMap(line => line.split(" "))
      
      /*Performing mapping and reducing operation*/
      val counts = words.map(word => (word, 1)).reduceByKey{case (x,y) => x + y}
      
      /*Saving the result file to the location that we have specified as args[1]*/
      counts.saveAsTextFile(args(1))
      
    }
}

5. Build & Run Spark Wordcount Example

We need to pass 2 arguments to run the program(s).
First argument will be input file path and second argument will be output path.
Output path(folder) must not exist at the location, Spark will create it for us.

6. Output(Portion)

Once the job is completed successfully, you will get the output which looks like following output,

(voices,1)
(wonder,1)
(call,2)
(For,2)
(The,3)
(soaring,1)
(Moon,1)
(spark,1)
(under,1)
(glow,1)
(have,2)
(filled,1)
(old,1)
(intertwine,1)
(shadowed,1)
(express,1)
(All,1)
(This,4)
(arms,1)
(unseen,1)
(wish,1)
(clasp,1)
(thing,1)
(place,1)
(lofty,1)
(With,2)
(To,3)
(edge,1)
(Our,1)
(along,1)
(Those,1)
(walk,1)
(CLOUD,1)
(enchant,1)
(these,2)
(realms,1)
(SCULPTORS,1)
(the,12)
(dance,1)
(not,1)
(part,1)
(process,1)
(shape,1)
(love,1)
(all,3)
(soft,2)
(our,2)
(psychic,2)
(falling,1)
(Then,3)

7. Source Code

You can download the source code of Spark WordCount example from our git repository, which gives you understanding of writing Spark programs using Java or Scala.

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.

Tags: Spark

Java Developer Zone

http://javadeveloperzone.com

JavaDeveloperZone is a group of innovative software developers. We are experienced in, ● Java Software Development ● Java web development ● Big Data development ● Data analytics ● Artificial Intelligence Development Our contributions will help Java developers and make development journey easy. Feel free to ask any questions and suggestions. Always have space for improvement! Feel free to Contact us for any software development services.

Installing Apache Spark on Ubuntu Linux

May 28, 2018