Spark RDD Example

December 23, 2018SparkNo Comments

Table of Contents

1. Overview
2. Development environment
3. Input Files
4 Project Structure
5. Solution
- 5.1 Loading the external dataset
- 5.2 Parallelizing a collection
6. API & References
- - Was this post helpful?

1. Overview

With advancement in technologies, Data is growing faster than processing speed, for processing massive amount of data possible solution is to parallelize on large clusters.
Apache Spark is a unified processing framework and RDD is a fundamental block of Spark processing.
So in this article we are going to explain Spark RDD example for creating RDD in Apache Spark.

Make sure that you have installed Apache Spark, If you have not installed it yet,you may follow our article step by step install Apache Spark on Ubuntu.

2. Development environment

Java : Oracle JDK 1.8
SCALA : 2.11.7
Spark : Apache Spark 2.0.0-bin-hadoop2.6
IDE : Eclipse
Build Tool: Gradle 4.4.1

3. Input Files

For explaining Spark RDD example, we are going to use project Gutenberg Ebook of A Christmas Carol, by Charles Dickens.

4 Project Structure

5. Solution

Spark provides two ways to create RDD.

5.1 Loading the external dataset

We can create RDD by loading the data from external sources like HDFS, S3, Local File system etc.
For explaining RDD Creation, we are going to use a data file which is available in local file system.

Following snippet shows how we can create an RDD by loading external Dataset.

SparkConf sparkConf = new SparkConf();
        
sparkConf.setAppName("Spark RDD Example using Java");
    
//Setting Master for running it from IDE.
sparkConf.setMaster("local[2]");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
/*First way to create RDD is, Read data from external data source,
 *here we are reading an external input from File*/
JavaRDD<String> textFile = sparkContext.textFile(args[0]);
    
/*Creating RDD of words from each line of input file*/
JavaRDD<String> words = textFile.flatMap(new SplitFunction());

In above snippet, we are using custom Function which implements FlatMapFunction.
so our SplitFunction looks like,

static class SplitFunction implements FlatMapFunction<String, String>
{
  private static final long serialVersionUID = 1L;
  @Override
  public Iterator<String> call(String s) {
    return Arrays.asList(s.split(" ")).iterator();
  }
}

5.2 Parallelizing a collection

Another method for creating an RDD is to parallelize a collection.
This method is relatively simple and ideal for testing & learning purpose, so this method can not be used in production
because it requires to have your entire dataset in memory on one machine.

Following code snippet is a simple way to create RDD in Apache Spark,

JavaRDD<String> likes = sparkContext.parallelize(Arrays.asList("spark","I like spark"));
likes.saveAsTextFile(args[2]);

6. API & References

We have used Spark API for Java for writing this article, you can download complete example from our Git repository.

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.

Tags: Apache Spark, Spark RDD Example

Prashant Khunt

http://javadeveloperzone.com

Prashant Khunt is a happily married father, Java/Bigdata developer. Currently, he is living in Rajkot City and he works in the IT industry. He is working as Senior Bigdata developer having 10 years of experience in software development and business management. You can hire him through Hire Hadoop BigData Developer from ProminentPixel team. He is also a member of development team at ProminentPixel for their Java web development services. In his free time, He loves spending time with this family and reading books.

Passing Function to Spark

December 27, 2018