RDD: Resilient Distributed Dataset
It’s encapsulation on a collection of data. Distributed in clusters automatically. RDDs are immutable.
You can apply transformations (return a new RDD with, for example, filtered data), and Actions (like First(), to return the first item in them
RDDs are resilient, they can lose nodes, and be able to recreate them automagically
Transformations in RDD are lazy loaded. For example, if we have lines that open a file, and then filter it, the opening of the file won’t happen right away. Spark reads the entire data set first, and determine to only save the filtered data, for example.
SparkContext
It is a connection to a computer cluster, used to build RDDs
Example of loading an RDD from external storage:
sc = SparkContext(“local”, “texfile”) # builds the context first
lines = sc.textFile(“in/uppercase.text”) # creates the RDD
Transformations
They do not mutate the current RDD, they return a new one.
filter() # returns elements passing the filter function
map() # applies the map function to each element on the original RDD and return the results in a new one
RDD actions
collect: puts the entire RDD in the driver program, usually to persist to disk. Memory intensive, make sure it is use in filtered, small datasets
count / countByValue: count number of rows, or number of unique values
take: return a subset of the RDD
saveAsTextFile: outputs to a storage in text mode
reduce: apply a lambda function to all the elements, two at a time, until we get a single result in return
persist: it will keep a copy of a produced RDD in memory, available fast for all nodes. You can pass the prefer storage level you like (DISK_ONLY, MEMORY_ONLY, etc) unpersist removes from the cache.