RDD: Resilient Distributed Dataset It’s encapsulation on a collection of data. Distributed in clusters automatically. RDDs are immutable. You can apply transformations (return a new RDD with, for example, filtered data), and Actions (like First(), to return the first item in them RDDs are resilient, they can lose nodes, and be able to recreate them […]
Use int, long (primitives), instead of their objects (Integer, Long) primitives are the atomic, basic, data types, unless you know what you are doing, stick to those. They (primitives) are passed by value. Long and Integer are the object form of the primitives, to be avoided unless you need to pass by reference, or pass […]
Crawl your data source first to create a catalog and table definitions Add a job to process your crawled data That’s all!