Apache Spark

Apache Spark: the basics

RDD: Resilient Distributed Dataset

It’s encapsulation on a collection of data. Distributed in clusters automatically. RDDs are immutable.

You can apply transformations (return a new RDD with, for example, filtered data), and Actions (like First(), to return the first item in them

RDDs are resilient, they can lose nodes, and be able to recreate them automagically

Transformations in RDD are lazy loaded. For example, if we have lines that open a file, and then filter it, the opening of the file won’t happen right away. Spark reads the entire data set first, and determine to only save the filtered data, for example.


It is a connection to a computer cluster, used to build RDDs

Example of loading an RDD from external storage:

sc = SparkContext(“local”, “texfile”) # builds the context first
lines = sc.textFile(“in/uppercase.text”) # creates the RDD


They do not mutate the current RDD, they return a new one.

filter() # returns elements passing the filter function

map() # applies the map function to each element on the original RDD and return the results in a new one

RDD actions

collect: puts the entire RDD in the driver program, usually to persist to disk. Memory intensive, make sure it is use in filtered, small datasets

count / countByValue: count number of rows, or number of unique values

take: return a subset of the RDD

saveAsTextFile: outputs to a storage in text mode

reduce: apply a lambda function to all the elements, two at a time, until we get a single result in return

persist: it will keep a copy of a produced RDD in memory, available fast for all nodes. You can pass the prefer storage level you like (DISK_ONLY, MEMORY_ONLY, etc) unpersist removes from the cache.

Leave a Reply

Your email address will not be published. Required fields are marked *