1. Overview

With advancement in technologies, Data is growing faster than processing speed, for processing massive amount of data possible solution is to parallelize on large clusters.
Apache Spark is a unified processing framework and RDD is a fundamental block of Spark processing.
So in this article we are going to explain Spark RDD example for creating RDD in Apache Spark.

Make sure that you have installed Apache Spark, If you have not installed it yet,you may follow our article step by step install Apache Spark on Ubuntu.

2. Development environment

Java : Oracle JDK 1.8
SCALA : 2.11.7
Spark : Apache Spark 2.0.0-bin-hadoop2.6
IDE : Eclipse
Build Tool: Gradle 4.4.1

3. Input Files

For explaining Spark RDD example, we are going to use project Gutenberg Ebook of A Christmas Carol, by Charles Dickens.

4 Project Structure


5. Solution

Spark provides two ways to create RDD.

5.1 Loading the external dataset

We can create RDD by loading the data from external sources like HDFS, S3, Local File system etc.
For explaining RDD Creation, we are going to use a data file which is available in local file system.

Following snippet shows how we can create an RDD by loading external Dataset.

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Spark RDD Example using Java");
//Setting Master for running it from IDE.

JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

/*First way to create RDD is, Read data from external data source,
 *here we are reading an external input from File*/
JavaRDD<String> textFile = sparkContext.textFile(args[0]);
/*Creating RDD of words from each line of input file*/
JavaRDD<String> words = textFile.flatMap(new SplitFunction());

In above snippet, we are using custom Function which implements FlatMapFunction.
so our SplitFunction looks like,

static class SplitFunction implements FlatMapFunction<String, String>
  private static final long serialVersionUID = 1L;
  public Iterator<String> call(String s) {
    return Arrays.asList(s.split(" ")).iterator();

5.2 Parallelizing a collection

Another method for creating an RDD is to parallelize a collection.
This method is relatively simple and ideal for testing & learning purpose, so this method can not be used in production
because it requires to have your entire dataset in memory on one machine.

Following code snippet is a simple way to create RDD in Apache Spark,

JavaRDD<String> likes = sparkContext.parallelize(Arrays.asList("spark","I like spark"));

6. API & References

We have used Spark API for Java for writing this article, you can download complete example from our Git repository.


Was this post helpful?
Let us know, if you liked the post. Only in this way, we can improve us.

Leave a Reply

Your email address will not be published. Required fields are marked *