ETL Transforms: some basic techniques

Some basic principles:

  • Select your indexes and sortable columns wisely, it can be the difference between hours and seconds runs
  • Try to process your data in chunks, create a staging table where you do your transforms with smaller amounts of data, instead of transforming all the data at all times.
  • Try to batch the changes by timestamp, pick a timestamp that will make all the data consistent across all your tables

Some techniques to transform data (note that they are tailored made for redshift, may be different in other databases):

Replace characters at the same time you are selecting

SELECT 
 select id,
 user_id,
 value,
 replace(replace(metadata, 'n', ' '), 't', '') as clean_metadata
FROM some_table;

Augmenting data in one table, by adding columns from another table

 SELECT
 du.user_id,
 du.site_id,
 du.user_name,
 up.earned_credits as current_point_balance,
 up.balance as lifetime_point_balance
 FROM base_table du
 LEFT JOIN enhancer_table up
 ON (du.user_id = up.user_id)
 WHERE up.debits is not null OR up.balance is not null;

 

Query using the IN keyword, to see if the values exist in another subquery

 SELECT * 
 FROM table_name WHERE (user_id, updated) IN 
 ( SELECT user_id, MAX(updated)
 FROM add_points
 GROUP BY user_id
 )

 

Convert a timestamp into a day id

date_trunc('day', create_time)


datediff(day, '2005-01-01', your_timestamp_column) + 1


Breaking your query into more manageable subqueries

WITH add_day_id as (
 SELECT
 site_id,
 user_id,
 someothercolum
 FROM sometable
),

add_something_else as (
 SELECT
 (some other transform here)
 FROM add_day_id
)

SELECT * FROM add_something_else;

Using CASE and if statements to select column values

SELECT
 some_column,
 CASE
 WHEN day_id = 10000
 THEN -1
 ELSE
 coalesce(
 lead(day_id) 
 OVER (
 PARTITION BY
 site_id,
 user_id,
 point_category_id,
 day_id
 ORDER BY
 site_id, user_id, day_id
 ),-1)
 END AS last_record_in_day
 FROM some_table

Check if a record exists in some other table, and putting its existance as part of the select condition

SELECT something FROM somewhere
WHERE some_condition = 'something'
AND exists (
   SELECT 1
   FROM some_other_table ot
   WHERE ot.site_id = full.reference.to.this.table.site_id
   AND (same as above, as many times as you need)
);

 Creating (or replacing) temp views, also, a way to mark row numbers by a certain order, by creating an artificial column for that

CREATE OR REPLACE VIEW temp_some_view_name AS (
WITH
mark_rows as (
 SELECT
 site_id,
 user_id,
 start_ts,
 ROW_NUMBER() OVER(
 PARTITION BY
 site_id,
 user_id
 ORDER BY start_ts) AS row_num
FROM some_table
),
pick_only_first_recs_on_groups as (
   SELECT
      site_id,
      user_id,
      start_ts
FROM mark_rows WHERE row_num = 1
)
SELECT * from pick_only_first_recs_on_groups);

least, to choose the lessen of all values

select some_column, least((SELECT day_id from dim_day where actual_date  = date_trunc('day', start_ts)), some_default_value) as some_other_column, yet_another_column FROM * sometable;

Coalesce, to avoid nulls and set defaults

select some_column, coalesce((SELECT day_id from dim_day where actual_date  = date_trunc('day', start_ts)), 1) as some_other_column, yet_another_column FROM * sometable;