Note: This Apache Spark blog post initially appeared here on my alvinalexander.com website.
This is an excerpt from the Scala Cookbook, 2nd Edition. This is Recipe 20.2, Reading a File Into an Apache Spark RDD.
Problem
You want to start reading data files into a Spark RDD.
Solution
The canonical example for showing how to read a data file into an RDD is a “word count” application, so not to disappoint, this recipe shows how to read the text of the Gettysburg Address by Abraham Lincoln, and find out how many times each word in the text is used.
After starting the Spark shell, the first step in the process is to read a file named Gettysburg-Address.txt using the textFile
method of the SparkContext
variable sc
that was introduced in the previous recipe: