Apache Spark

Apache Spark: the basics

RDD: Resilient Distributed Dataset

It’s encapsulation on a collection of data. Distributed in clusters automatically. RDDs are immutable.

You can apply transformations (return a new RDD with, for example, filtered data), and Actions (like First(), to return the first item in them

RDDs are resilient, they can lose nodes, and be able to recreate them automagically

Transformations in RDD are lazy loaded. For example, if we have lines that open a file, and then filter it, the opening of the file won’t happen right away. Spark reads the entire data set first, and determine to only save the filtered data, for example.


It is a connection to a computer cluster, used to build RDDs

Example of loading an RDD from external storage:

sc = SparkContext(“local”, “texfile”) # builds the context first
lines = sc.textFile(“in/uppercase.text”) # creates the RDD


They do not mutate the current RDD, they return a new one.

filter() # returns elements passing the filter function

map() # applies the map function to each element on the original RDD and return the results in a new one

RDD actions

collect: puts the entire RDD in the driver program, usually to persist to disk. Memory intensive, make sure it is use in filtered, small datasets

count / countByValue: count number of rows, or number of unique values

take: return a subset of the RDD

saveAsTextFile: outputs to a storage in text mode

reduce: apply a lambda function to all the elements, two at a time, until we get a single result in return

persist: it will keep a copy of a produced RDD in memory, available fast for all nodes. You can pass the prefer storage level you like (DISK_ONLY, MEMORY_ONLY, etc) unpersist removes from the cache.


Java: the basics

Use int, long (primitives), instead of their objects (Integer, Long)

primitives are the atomic, basic, data types, unless you know what you are doing, stick to those.

They (primitives) are passed by value. Long and Integer are the object form of the primitives, to be avoided unless you need to pass by reference, or pass and make the code infer the type.

sample of using inline filters: -> s.getId() == COMPARE_ID)

where s is a particular member of somelist, with getId() as a method, and we are just picking the one where id match COMPARE_ID in this case

Spring / autowire

  • when you see it on a class, it pretty much means “you are going to need one of these, and I am going to wire it for you”. Example:
public class SomeClass  {
private ChallengeManager challengeManager;
public void setChallengeManager(@Qualifier(SpringConstants.COMPONENT_CHALLENGE_MANAGER) ChallengeManager challengeManager) {
    this.challengeManager = challengeManager;

so when SomeClass gets spawned, it will automagically include the class marked by @Autowired

the @Qualifier(SOMECONSTANT) is to ensure it is the class you want to autowire

in complex systems, there may be more than one ChallengeManager, so that qualifier and constant will make sure we are auto wiring the right one

Throwing and catching

if an interface is marked as “throws”, there should be a try catch that deal with the specified throwable exception in there

but if you want to make the parent calls of the class / method to deal with it instead, you can just add “throws ExceptionName” in the signature

AWS Glue

AWS Glue: the basics

  1. Crawl your data source first
    1. to create a catalog
    2. and table definitions
  2. Add a job to process your crawled data

That’s all!