Apache Spark: the basics

RDD: Resilient Distributed Dataset

It’s encapsulation on a collection of data. Distributed in clusters automatically. RDDs are immutable.

You can apply transformations (return a new RDD with, for example, filtered data), and Actions (like First(), to return the first item in them

RDDs are resilient, they can lose nodes, and be able to recreate them automagically

Transformations in RDD are lazy loaded. For example, if we have lines that open a file, and then filter it, the opening of the file won’t happen right away. Spark reads the entire data set first, and determine to only save the filtered data, for example.

SparkContext

It is a connection to a computer cluster, used to build RDDs

Example of loading an RDD from external storage:

sc = SparkContext(“local”, “texfile”) # builds the context first
lines = sc.textFile(“in/uppercase.text”) # creates the RDD

Transformations

They do not mutate the current RDD, they return a new one.

filter() # returns elements passing the filter function

map() # applies the map function to each element on the original RDD and return the results in a new one

RDD actions

collect: puts the entire RDD in the driver program, usually to persist to disk. Memory intensive, make sure it is use in filtered, small datasets

count / countByValue: count number of rows, or number of unique values

take: return a subset of the RDD

saveAsTextFile: outputs to a storage in text mode

reduce: apply a lambda function to all the elements, two at a time, until we get a single result in return

persist: it will keep a copy of a produced RDD in memory, available fast for all nodes. You can pass the prefer storage level you like (DISK_ONLY, MEMORY_ONLY, etc) unpersist removes from the cache.

Java: the basics

Use int, long (primitives), instead of their objects (Integer, Long)

primitives are the atomic, basic, data types, unless you know what you are doing, stick to those.

They (primitives) are passed by value. Long and Integer are the object form of the primitives, to be avoided unless you need to pass by reference, or pass and make the code infer the type.

sample of using inline filters:
somelist.stream().anyMatch(s -> s.getId() == COMPARE_ID)

where s is a particular member of somelist, with getId() as a method, and we are just picking the one where id match COMPARE_ID in this case

React.js: the basics

Basic element creation:

ReactDOM.render(React.createElement('h1', null, 'Hello world!'),document.getElementById('content'))

The first argument, the element
The second: the data to be feed to that element
The third, the innerHTML inside that element

ReactDOM.render does the actual appending to the page



React Hooks
Example (look ma' no classes!):

const GeneralStats = () => {

useEffect(() => {
     // fetch your data or whatever you did on react before hooks here, useEffect is similar to componentdidmount
});

    return (
        <div className="Home">
            Please wait, loading ... 
        </div>
    );

}


export default GeneralStats;

Redshift: alter table column TYPE is not allowed

Only allowed for varchar column types.

The trick to get it done:

ALTER TABLE sometable ADD COLUMN some_new_column (with the new definition you want)
UPDATE sometable SET some_new_column = old_column;
ALTER TABLE sometable DROP COLUMN old_column;
ALTER TABLE sometable RENAME COLUMN some_new_column TO old_column;

The catch: column order will be altered (the new column will be the last now

If you use copy to fill out that table, you can’t reorder columns to make it fit still

If that is your setup, instead of create a new column, create a new table with the right TYPE, and do as above

firebase setup

Basic startup command

npm install -g firebase-tools

firebase login

firebase init (make sure you click the space bar to select an option, otherwise your firebase.json file will be empty)

create an index.html page in that dir

in the firebase console, click on the “add firebase to your web app” button, and put your javascript code into the index.html page

alternatively, if you choose the “firebase host” option, and choose the default options, you will have a public/index.html file with the needed boilerplate to start

App structure

Add the following to index.html:

<script defer src=”app.js”></script>

the attribute defer just so it is in sync with all the other scripts

Inside app.js, the code to include if you want to sign up via the oath authorization built in functionality:

var provider = new firebase.auth.GoogleAuthProvider();

const btnLogin = document.getElementById('btnLogin');

// add login event

btnLogout.addEventListener('click', e => {

        firebase.auth().signOut().then(function(){

                user = null;

                // log user out:

                console.log('User log out ');

        }).catch(function(error){

                var errorCode = error.code;

                var errorMessage = error.message;

                console.log('Error: ' + errorCode + ' -- ' + errorMessage);

        });

});

btnLogin.addEventListener('click', e => {

        firebase.auth().signInWithPopup(provider).then(function(result){

                user = result.user;

                // log user into

                console.log('our logging user: ' + JSON.stringify(user));

        }).catch(function(error){

                var errorCode = error.code;

                var errorMessage = error.message;

                console.log('Error: ' + errorCode + ' -- ' + errorMessage);

        });

});

Testing your code locally

To start the server locally, run the following command:

firebase serve –host 0.0.0.0 –port 8080

if you try the login button at this point, you will get an error in the console.log, you need to add the domain shown in that error to the list of authorized domains, in the “Authentication / Sign-in Method” section of the dashboard.

 

 

 

 

AWS lambda: the basics

For a API Gate / Lambda combo, there is a bit of a gotch when setting those two services together and following their hello world example.

Instead of the default in their example:

# print(“value2 = ” + event[‘key2’])

use:

event[‘params’][‘querystring’][‘key1’]

I wish it was more evident in their documentation what “event” means, but basically, after you set the above, you need to also set the query params (spell out what they will be) under: Amazon API Gateway / Resources / Method execution

Also, in that section, Integration Request, specify:

When there are no templates defined (recommended)”

and add a new template for: “application/json”

In the “Generate Template” section, choose: “Method Request Pass through”

Leave the default code in there, and now, when you pass your parameters as:

your-api-gateway-url?yourparam=yourvalue

you will see those values in your python script as:

event[‘params’][‘querystring’][‘yourparam’]