Apache Spark: the basics

RDD: Resilient Distributed Dataset

It’s encapsulation on a collection of data. Distributed in clusters automatically. RDDs are immutable.

You can apply transformations (return a new RDD with, for example, filtered data), and Actions (like First(), to return the first item in them

RDDs are resilient, they can lose nodes, and be able to recreate them automagically

Transformations in RDD are lazy loaded. For example, if we have lines that open a file, and then filter it, the opening of the file won’t happen right away. Spark reads the entire data set first, and determine to only save the filtered data, for example.

SparkContext

It is a connection to a computer cluster, used to build RDDs

Example of loading an RDD from external storage:

sc = SparkContext(“local”, “texfile”) # builds the context first
lines = sc.textFile(“in/uppercase.text”) # creates the RDD

Transformations

They do not mutate the current RDD, they return a new one.

filter() # returns elements passing the filter function

map() # applies the map function to each element on the original RDD and return the results in a new one

RDD actions

collect: puts the entire RDD in the driver program, usually to persist to disk. Memory intensive, make sure it is use in filtered, small datasets

count / countByValue: count number of rows, or number of unique values

take: return a subset of the RDD

saveAsTextFile: outputs to a storage in text mode

reduce: apply a lambda function to all the elements, two at a time, until we get a single result in return

persist: it will keep a copy of a produced RDD in memory, available fast for all nodes. You can pass the prefer storage level you like (DISK_ONLY, MEMORY_ONLY, etc) unpersist removes from the cache.

Java: the basics

Use int, long (primitives), instead of their objects (Integer, Long)

primitives are the atomic, basic, data types, unless you know what you are doing, stick to those.

They (primitives) are passed by value. Long and Integer are the object form of the primitives, to be avoided unless you need to pass by reference, or pass and make the code infer the type.

sample of using inline filters:
somelist.stream().anyMatch(s -> s.getId() == COMPARE_ID)

where s is a particular member of somelist, with getId() as a method, and we are just picking the one where id match COMPARE_ID in this case

Redshift: alter table column TYPE is not allowed

Only allowed for varchar column types.

The trick to get it done:

ALTER TABLE sometable ADD COLUMN some_new_column (with the new definition you want)
UPDATE sometable SET some_new_column = old_column;
ALTER TABLE sometable DROP COLUMN old_column;
ALTER TABLE sometable RENAME COLUMN some_new_column TO old_column;

The catch: column order will be altered (the new column will be the last now

If you use copy to fill out that table, you can’t reorder columns to make it fit still

If that is your setup, instead of create a new column, create a new table with the right TYPE, and do as above

firebase setup

Basic startup command

npm install -g firebase-tools

firebase login

firebase init (make sure you click the space bar to select an option, otherwise your firebase.json file will be empty)

create an index.html page in that dir

in the firebase console, click on the “add firebase to your web app” button, and put your javascript code into the index.html page

alternatively, if you choose the “firebase host” option, and choose the default options, you will have a public/index.html file with the needed boilerplate to start

App structure

Add the following to index.html:

<script defer src=”app.js”></script>

the attribute defer just so it is in sync with all the other scripts

Inside app.js, the code to include if you want to sign up via the oath authorization built in functionality:

var provider = new firebase.auth.GoogleAuthProvider();

const btnLogin = document.getElementById('btnLogin');

// add login event

btnLogout.addEventListener('click', e => {

        firebase.auth().signOut().then(function(){

                user = null;

                // log user out:

                console.log('User log out ');

        }).catch(function(error){

                var errorCode = error.code;

                var errorMessage = error.message;

                console.log('Error: ' + errorCode + ' -- ' + errorMessage);

        });

});

btnLogin.addEventListener('click', e => {

        firebase.auth().signInWithPopup(provider).then(function(result){

                user = result.user;

                // log user into

                console.log('our logging user: ' + JSON.stringify(user));

        }).catch(function(error){

                var errorCode = error.code;

                var errorMessage = error.message;

                console.log('Error: ' + errorCode + ' -- ' + errorMessage);

        });

});

Testing your code locally

To start the server locally, run the following command:

firebase serve –host 0.0.0.0 –port 8080

if you try the login button at this point, you will get an error in the console.log, you need to add the domain shown in that error to the list of authorized domains, in the “Authentication / Sign-in Method” section of the dashboard.

 

 

 

 

AWS lambda: the basics

For a API Gate / Lambda combo, there is a bit of a gotch when setting those two services together and following their hello world example.

Instead of the default in their example:

# print(“value2 = ” + event[‘key2’])

use:

event[‘params’][‘querystring’][‘key1’]

I wish it was more evident in their documentation what “event” means, but basically, after you set the above, you need to also set the query params (spell out what they will be) under: Amazon API Gateway / Resources / Method execution

Also, in that section, Integration Request, specify:

When there are no templates defined (recommended)”

and add a new template for: “application/json”

In the “Generate Template” section, choose: “Method Request Pass through”

Leave the default code in there, and now, when you pass your parameters as:

your-api-gateway-url?yourparam=yourvalue

you will see those values in your python script as:

event[‘params’][‘querystring’][‘yourparam’]

Docker: the basics

Dockerfile: series of commands used to create an image. Below is an explanation of some of the basic commands you can use inside:

# usually a operative system, where we are going to build the docker image:

FROM some_img_name

MAINTAINER your name <your@email.com>

# any commands you need to run as part of your image build (each run will create an image that is cacheable)

RUN apt-get update

RUN some other command

# notice how we are passing -y to avoid the y/n question at install time:

RUN apt-get install -y some_package

# example of creating a config file via echo, also, this command will make your docker image available to any external connections:

RUN echo “bind_ip = 0.0.0.0” >> /etc/mongodb.conf

# Including files from our local host into the image:

ADD some_local_file_path some_path_inside_docker_img

# important! this is how you expose ports from inside the image:

EXPOSE 27017

# this command run after the image start (it is the default, can be overwritten in the command line)

CMD some_command_here

ENTRYPOINT could be used instead of CMD, the difference is that ENTRYPOINT will always execute, whereas CMD can be overwritten by the arguments passed in the command line

Once your Dockerfile is ready, you are ready to build your image:

docker build -t your_docker_namespace/some_tag:latest .

The . indicates you want to use the local folder to run your build. This will execute each command in that Dockerfile at build time.

Once you build successfully, you are ready to push to the docker hub repo, so you can download and use this new image from anywhere:

docker tag your_new_img_tag your_dockerhub_namespace/your_img_name

docker push your_new_img_tag your_dockerhub_namespace/your_img_name

# pulling images from the docker hub to your local env:

docker pull postgres:latest

you will notice how several “things” are downloaded. This is because images are comprised of several sets of layers, some of those shareable between images. The idea is to be able to cache and reuse better.

By default, you are pulling from the dockerhub repo.

# running docker images:

docker run docker_img_name /path/to/command command_args

Example:

docker run –name dockerimgname -it -v /src:/somedirinsideimg/src -p 9000:9000

# running the ubuntu image locally, and then interact with it (-it) by opening a bash session to it:

docker run -it ubuntu /bin/bash

# exposing ports in a running docker container:

docker run -d -p 8000:80 –name some_name atbaker/nginx-example

Notes: option -d is so we run in detach mode (in the background). For the ports, it takes port 80 in the docker container, and makes it accessible in port 8000 in the host machine. The –name option is to avoid the default name docker gives to the running images (you can pass any string to it). To get the actual ip address you need to hit on your machine (in the browser, for example), you need to run:

docker-machine ls

So the actual url you will be looking at (for the example above) would be something like:

192.168.99.100:8000

# tailing logs on a running docker container:

docker logs -f some_name

# see what has changed on a docker container since we started it:

docker diff some_name

# check the history of commands run to produce a docker image:

docker history docker_img_name

# inspect low level information about our container:

docker inspect some_name

# get the top command applied to our docker image:

docker stats some_name

# remove all docker running images:

docker rm –force `docker ps -qa`

# creating new docker images:

pull and run a base docker image as instructed above, and then go ahead and go inside the image:

docker run -it image_name_here bash

inside the image, do whatever modifications you need to do for the base image, then you can commit your changes as follows:

docker commit -m “Some description of the changes here” docker_id_here docker_tag_here

the docker tag at the end is just any descriptor of your new image version. To push the changes to dockerhub you need to login first, and then push:

docker login

docker tag docker_tag_here your_dockerhub_namespace/name_of_docker_repository

docker push your_dockerhub_namespace/name_of_docker_repository

Mounting external volumes inside docker images

-v [hostpath]:[containerpath]

Example:

docker run \
    -ti -v `pwd`/testdir:/root \
    ubuntu /bin/bash

we are running an image, and attaching whatever folder we are at the moment (via pwd), plus /testdir, to go inside the root folder in the docker image

so whatever files we create inside that image, they will also be created in the root directory.

Example of how to persist data between stops and starts of a docker image:

docker run -it --name somedockerimgname -v /root ubuntu /bin/bash

so now, when you stop and restart somedockerimgname, the files you created inside the /root folder will still be there. Destroying the container will still remove the data though!

Differences between docker and Vagrant

Vagrant is meant to spawn and manage entire Virtual Machines. Docker is more a series of files and executables packed in the image, so when programs run, they are directed to that set of files. When initialized, we are not booting a full fledge VM, just the set of files needed to run as one.

Docker’s goal is to run the fewest services per image, so you may need multiple to run your app.

The advantage of docker is that it gives you more flexibility, as you can swap services as modules easier. Also, it require less resources than running full blown Virtual Machines.

Docker also has its own internal network service. You can control the ports that the outside world uses to communicate with your image.