Categories
Apache Spark

Apache Spark: the basics

RDD: Resilient Distributed Dataset

It’s encapsulation on a collection of data. Distributed in clusters automatically. RDDs are immutable.

You can apply transformations (return a new RDD with, for example, filtered data), and Actions (like First(), to return the first item in them

RDDs are resilient, they can lose nodes, and be able to recreate them automagically

Transformations in RDD are lazy loaded. For example, if we have lines that open a file, and then filter it, the opening of the file won’t happen right away. Spark reads the entire data set first, and determine to only save the filtered data, for example.

SparkContext

It is a connection to a computer cluster, used to build RDDs

Example of loading an RDD from external storage:

sc = SparkContext(“local”, “texfile”) # builds the context first
lines = sc.textFile(“in/uppercase.text”) # creates the RDD

Transformations

They do not mutate the current RDD, they return a new one.

filter() # returns elements passing the filter function

map() # applies the map function to each element on the original RDD and return the results in a new one

RDD actions

collect: puts the entire RDD in the driver program, usually to persist to disk. Memory intensive, make sure it is use in filtered, small datasets

count / countByValue: count number of rows, or number of unique values

take: return a subset of the RDD

saveAsTextFile: outputs to a storage in text mode

reduce: apply a lambda function to all the elements, two at a time, until we get a single result in return

persist: it will keep a copy of a produced RDD in memory, available fast for all nodes. You can pass the prefer storage level you like (DISK_ONLY, MEMORY_ONLY, etc) unpersist removes from the cache.

Categories
java

Java: the basics

Use int, long (primitives), instead of their objects (Integer, Long)

primitives are the atomic, basic, data types, unless you know what you are doing, stick to those.

They (primitives) are passed by value. Long and Integer are the object form of the primitives, to be avoided unless you need to pass by reference, or pass and make the code infer the type.

sample of using inline filters:
somelist.stream().anyMatch(s -> s.getId() == COMPARE_ID)

where s is a particular member of somelist, with getId() as a method, and we are just picking the one where id match COMPARE_ID in this case

Spring / autowire

  • when you see it on a class, it pretty much means “you are going to need one of these, and I am going to wire it for you”. Example:
public class SomeClass  {
...
private ChallengeManager challengeManager;
@Autowired
public void setChallengeManager(@Qualifier(SpringConstants.COMPONENT_CHALLENGE_MANAGER) ChallengeManager challengeManager) {
    this.challengeManager = challengeManager;
}

so when SomeClass gets spawned, it will automagically include the class marked by @Autowired

the @Qualifier(SOMECONSTANT) is to ensure it is the class you want to autowire

in complex systems, there may be more than one ChallengeManager, so that qualifier and constant will make sure we are auto wiring the right one

Throwing and catching

if an interface is marked as “throws”, there should be a try catch that deal with the specified throwable exception in there

but if you want to make the parent calls of the class / method to deal with it instead, you can just add “throws ExceptionName” in the signature

Categories
AWS Glue

AWS Glue: the basics

  1. Crawl your data source first
    1. to create a catalog
    2. and table definitions
  2. Add a job to process your crawled data

That’s all!

Categories
react.js

React.js: the basics

Boiler plating:
npx create-react-app my-app

Basic element creation:

ReactDOM.render(React.createElement('h1', null, 'Hello world!'),document.getElementById('content'))

The first argument, the element
The second: the data to be feed to that element
The third, the innerHTML inside that element

ReactDOM.render does the actual appending to the page



React Hooks
Example (look ma' no classes!):

const GeneralStats = () => {

useEffect(() => {
     // fetch your data or whatever you did on react before hooks here, useEffect is similar to componentdidmount
});

    return (
        <div className="Home">
            Please wait, loading ... 
        </div>
    );

}


export default GeneralStats;

Categories
redshift

Redshift: alter table column TYPE is not allowed

Only allowed for varchar column types.

The trick to get it done:

ALTER TABLE sometable ADD COLUMN some_new_column (with the new definition you want)
UPDATE sometable SET some_new_column = old_column;
ALTER TABLE sometable DROP COLUMN old_column;
ALTER TABLE sometable RENAME COLUMN some_new_column TO old_column;

The catch: column order will be altered (the new column will be the last now

If you use copy to fill out that table, you can’t reorder columns to make it fit still

If that is your setup, instead of create a new column, create a new table with the right TYPE, and do as above

Categories
firebase

firebase setup

Basic startup command

npm install -g firebase-tools

firebase login

firebase init (make sure you click the space bar to select an option, otherwise your firebase.json file will be empty)

create an index.html page in that dir

in the firebase console, click on the “add firebase to your web app” button, and put your javascript code into the index.html page

alternatively, if you choose the “firebase host” option, and choose the default options, you will have a public/index.html file with the needed boilerplate to start

App structure

Add the following to index.html:

<script defer src=”app.js”></script>

the attribute defer just so it is in sync with all the other scripts

Inside app.js, the code to include if you want to sign up via the oath authorization built in functionality:

var provider = new firebase.auth.GoogleAuthProvider();

const btnLogin = document.getElementById('btnLogin');

// add login event

btnLogout.addEventListener('click', e => {

        firebase.auth().signOut().then(function(){

                user = null;

                // log user out:

                console.log('User log out ');

        }).catch(function(error){

                var errorCode = error.code;

                var errorMessage = error.message;

                console.log('Error: ' + errorCode + ' -- ' + errorMessage);

        });

});

btnLogin.addEventListener('click', e => {

        firebase.auth().signInWithPopup(provider).then(function(result){

                user = result.user;

                // log user into

                console.log('our logging user: ' + JSON.stringify(user));

        }).catch(function(error){

                var errorCode = error.code;

                var errorMessage = error.message;

                console.log('Error: ' + errorCode + ' -- ' + errorMessage);

        });

});

Testing your code locally

To start the server locally, run the following command:

firebase serve –host 0.0.0.0 –port 8080

if you try the login button at this point, you will get an error in the console.log, you need to add the domain shown in that error to the list of authorized domains, in the “Authentication / Sign-in Method” section of the dashboard.

 

 

 

 

Categories
AWS

AWS lambda: the basics

For a API Gate / Lambda combo, there is a bit of a gotch when setting those two services together and following their hello world example.

Instead of the default in their example:

# print(“value2 = ” + event[‘key2’])

use:

event[‘params’][‘querystring’][‘key1’]

I wish it was more evident in their documentation what “event” means, but basically, after you set the above, you need to also set the query params (spell out what they will be) under: Amazon API Gateway / Resources / Method execution

Also, in that section, Integration Request, specify:

When there are no templates defined (recommended)”

and add a new template for: “application/json”

In the “Generate Template” section, choose: “Method Request Pass through”

Leave the default code in there, and now, when you pass your parameters as:

your-api-gateway-url?yourparam=yourvalue

you will see those values in your python script as:

event[‘params’][‘querystring’][‘yourparam’]

Categories
Atom

Atom: the basics

cmd-shift-P opens the search command

Categories
Docker

boot2docker error “Error response from daemon: client and server don’t have same version”

The fix:

boot2docker stop

boot2docker download

boot2docker up

Categories
Docker

Docker: the basics

Dockerfile: series of commands used to create an image. Below is an explanation of some of the basic commands you can use inside:

# an existing docker image that contains some of the functionality you need (like node.js, or java drop wizard, etc):

FROM some_img_name

MAINTAINER your name <your@email.com>

# any commands you need to run as part of your image build (each run will create an image that is cacheable)

RUN apt-get update

RUN some other command

# notice how we are passing -y to avoid the y/n question at install time:

RUN apt-get install -y some_package

# example of creating a config file via echo, also, this command will make your docker image available to any external connections:

RUN echo “bind_ip = 0.0.0.0” >> /etc/mongodb.conf

# Including files from our local host into the image:

ADD some_local_file_path some_path_inside_docker_img

# important! this is how you expose ports from inside the image:

EXPOSE 27017

# this command run after the image start (it is the default, can be overwritten in the command line)

CMD some_command_here

ENTRYPOINT could be used instead of CMD, the difference is that ENTRYPOINT will always execute, whereas CMD can be overwritten by the arguments passed in the command line

Once your Dockerfile is ready, you are ready to build your image:

docker build -t your_docker_namespace/some_tag:latest .

The . indicates you want to use the local folder to run your build. This will execute each command in that Dockerfile at build time.

What build does is to run the local Dockerfile instructions into a machine, marked with the tag name provided (some_tag:latest ), so you can run it locally afterwards.

Once you build successfully, you are ready to push to the docker hub repo, so you can download and use this new image from anywhere:

docker tag your_new_img_tag your_dockerhub_namespace/your_img_name

docker push your_new_img_tag your_dockerhub_namespace/your_img_name

# pulling images from the docker hub to your local env:

docker pull postgres:latest

you will notice how several “things” are downloaded. This is because images are comprised of several sets of layers, some of those shareable between images. The idea is to be able to cache and reuse better.

By default, you are pulling from the dockerhub repo.

# running docker images:

(remember to build first, if you want to version the run locally)

docker run docker_img_name /path/to/command command_args

Example:

docker run –name dockerimgname -it -v /src:/somedirinsideimg/src -p 9000:9000

# running the ubuntu image locally, and then interact with it (-it) by opening a bash session to it:

docker run -it ubuntu /bin/bash

# exposing ports in a running docker container:

docker run -d -p 8000:80 –name some_name atbaker/nginx-example

Notes: option -d is so we run in detach mode (in the background). For the ports, it takes port 80 in the docker container, and makes it accessible in port 8000 in the host machine. The –name option is to avoid the default name docker gives to the running images (you can pass any string to it). To get the actual ip address you need to hit on your machine (in the browser, for example), you need to run:

docker-machine ls

So the actual url you will be looking at (for the example above) would be something like:

192.168.99.100:8000

# tailing logs on a running docker container:

docker logs -f some_name

# see what has changed on a docker container since we started it:

docker diff some_name

# check the history of commands run to produce a docker image:

docker history docker_img_name

# inspect low level information about our container:

docker inspect some_name

# get the top command applied to our docker image:

docker stats some_name

# remove all docker running images:

docker rm –force `docker ps -qa`

# creating new docker images:

pull and run a base docker image as instructed above, and then go ahead and go inside the image:

docker run -it image_name_here bash

inside the image, do whatever modifications you need to do for the base image, then you can commit your changes as follows:

docker commit -m “Some description of the changes here” docker_id_here docker_tag_here

the docker tag at the end is just any descriptor of your new image version. To push the changes to dockerhub you need to login first, and then push:

docker login

docker tag docker_tag_here your_dockerhub_namespace/name_of_docker_repository

docker push your_dockerhub_namespace/name_of_docker_repository

Mounting external volumes inside docker images

-v [hostpath]:[containerpath]

Example:

docker run \
    -ti -v `pwd`/testdir:/root \
    ubuntu /bin/bash

we are running an image, and attaching whatever folder we are at the moment (via pwd), plus /testdir, to go inside the root folder in the docker image

so whatever files we create inside that image, they will also be created in the root directory.

Example of how to persist data between stops and starts of a docker image:

docker run -it --name somedockerimgname -v /root ubuntu /bin/bash

so now, when you stop and restart somedockerimgname, the files you created inside the /root folder will still be there. Destroying the container will still remove the data though!

A very simple example of running a local node.js based app via a docker file:

FROM node:latest
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
COPY package.json /usr/src/app/
RUN npm install
COPY . /usr/src/app
EXPOSE 3000
CMD [ “npm”, “start” ]

FROM takes the latest node image

RUN makes a dir inside that image

WORKDIR makes that dir the current directory

COPY copies your local files into the image

EXPOSE tells what ports will be open

CMD tells what will run afterwards all is well

docker-compose.yml

a wrapper for docker commands, and also so it is easier to keep all dockers running services in one place / file

Sample of a simple docker-compose.yml setup:

version: "2"
services:
  app:
    container_name: some_app_name_here
    build: .
    ports:
      - "3000:3000"
    links:
      - mongodb
  mongodb:
    container_name: mongo
    image: mongo
    volumes:
      - ./data:/data/db
    ports:
      - "27017:27017"

links will tell where to connect to the other related services

volumes are the local mounted data on the services

notice the dot in the “build:” part, that tells docker where to look for that Dockerfile with instructions on what to do for this service (in this case the local directory)

note: in developer mode, you may want to mount your node app directory, instead of copy / add, so when you make changes, you don’t have to restart manually:

volumes:
– .:/usr/src/app

Differences between docker and Vagrant

Vagrant is meant to spawn and manage entire Virtual Machines. Docker is more a series of files and executables packed in the image, so when programs run, they are directed to that set of files. When initialized, we are not booting a full fledge VM, just the set of files needed to run as one.

Docker’s goal is to run the fewest services per image, so you may need multiple to run your app.

The advantage of docker is that it gives you more flexibility, as you can swap services as modules easier. Also, it require less resources than running full blown Virtual Machines.

Docker also has its own internal network service. You can control the ports that the outside world uses to communicate with your image.