Some basic principles:
- Select your indexes and sortable columns wisely, it can be the difference between hours and seconds runs
- Try to process your data in chunks, create a staging table where you do your transforms with smaller amounts of data, instead of transforming all the data at all times.
- Try to batch the changes by timestamp, pick a timestamp that will make all the data consistent across all your tables
Some techniques to transform data (note that they are tailored made for redshift, may be different in other databases):
Replace characters at the same time you are selecting
SELECT select id, user_id, value, replace(replace(metadata, 'n', ' '), 't', '') as clean_metadata FROM some_table;
Augmenting data in one table, by adding columns from another table
SELECT du.user_id, du.site_id, du.user_name, up.earned_credits as current_point_balance, up.balance as lifetime_point_balance FROM base_table du LEFT JOIN enhancer_table up ON (du.user_id = up.user_id) WHERE up.debits is not null OR up.balance is not null;
Query using the IN keyword, to see if the values exist in another subquery
SELECT * FROM table_name WHERE (user_id, updated) IN ( SELECT user_id, MAX(updated) FROM add_points GROUP BY user_id )
Convert a timestamp into a day id
date_trunc('day', create_time) datediff(day, '2005-01-01', your_timestamp_column) + 1
Breaking your query into more manageable subqueries
WITH add_day_id as ( SELECT site_id, user_id, someothercolum FROM sometable ), add_something_else as ( SELECT (some other transform here) FROM add_day_id ) SELECT * FROM add_something_else;
Using CASE and if statements to select column values
SELECT some_column, CASE WHEN day_id = 10000 THEN -1 ELSE coalesce( lead(day_id) OVER ( PARTITION BY site_id, user_id, point_category_id, day_id ORDER BY site_id, user_id, day_id ),-1) END AS last_record_in_day FROM some_table
Check if a record exists in some other table, and putting its existance as part of the select condition
SELECT something FROM somewhere WHERE some_condition = 'something' AND exists ( SELECT 1 FROM some_other_table ot WHERE ot.site_id = full.reference.to.this.table.site_id AND (same as above, as many times as you need) );
Creating (or replacing) temp views, also, a way to mark row numbers by a certain order, by creating an artificial column for that
CREATE OR REPLACE VIEW temp_some_view_name AS ( WITH mark_rows as ( SELECT site_id, user_id, start_ts, ROW_NUMBER() OVER( PARTITION BY site_id, user_id ORDER BY start_ts) AS row_num FROM some_table ), pick_only_first_recs_on_groups as ( SELECT site_id, user_id, start_ts FROM mark_rows WHERE row_num = 1 ) SELECT * from pick_only_first_recs_on_groups);
least, to choose the lessen of all values
select some_column, least((SELECT day_id from dim_day where actual_date = date_trunc('day', start_ts)), some_default_value) as some_other_column, yet_another_column FROM * sometable;
Coalesce, to avoid nulls and set defaults
select some_column, coalesce((SELECT day_id from dim_day where actual_date = date_trunc('day', start_ts)), 1) as some_other_column, yet_another_column FROM * sometable;