Most Popular Posts

AWS Glue First Impressions

I wrote some thoughts on the cost, speed, and usefulness of AWS Glue when it was first released. Readers have commented their opinions as Glue has evolved and progressed.

Upgrade Your Redshift Cluster from DC1 to DC2 Nodes

A guide on what to expect when upgrading your cluster to the second generation node type. Upgrading improves performance by up to 45% with no cost increase.

Do You Really Need a Data Warehouse? RDS vs. Redshift on AWS

Some basic considerations if you are thinking about setting up a data warehouse. You might be able to get by on RDS and save some $$$ until you really need a data warehouse.


Do You Really Need a Data Warehouse? Redshift vs. RDS on AWS

A question I wish I had asked a year ago is whether my company really needed a data warehouse like Redshift, or whether we could have gotten by on an RDS instance. If you’re setting up your analytics infrastructure, I recommend you ask yourself the same question. A data warehouse like Redshift, Snowflake, or BigQuery has obvious benefits – size, scalability, and performance on queries that require large scans. But for many startups and early stage companies, I’d consider saving a bit of money by getting by on an RDS instance until you reach its limits.

For the comparable dc2.large and db.t2.large instances, Redshift costs $0.25/hour (or $180/month) and Postgres on RDS costs $0.145/hour (or $104.40/month). RDS is 42% cheaper than Redshift. Smaller instances are available on RDS, allowing you to start at just $13/month for a db.t2.micro instance and scaling up from there, whereas Redshift’s smallest instance is the dc2.large at $180/month. 

The key advantage of Redshift that I’ve experienced is the SORT KEY feature. For a table with millions of rows, you can define a SORT KEY column to help query performance improve. For example, if you have a table of all messages sent on your product, and there are 1 million messages per day, you’ll be adding 30 million rows to your Messages table each month. Eventually, on an unoptimized database, query performance will slow dramatically as the database has to scan the entire table to find the messages you are querying for. But with Redshift, the SORT KEY column tells the database to store rows in an organized fashion on blocks of memory. The database stores meta data about what is stored on each block. Then, when you query and specify a range for your SORT KEY column, the query optimizer will know to only scan the relevant blocks of memory, rather than all of them.

Careful schema design on a normal RDS database will take you pretty far with improving query performance. And for smaller datasets, you can be sure that query performance will likely be faster than on Redshift. Redshift is built for intake and storage of large data sets, but less performant for queries. RDS is more of a “jack of all trades” database, but will underperform with massive tables. I’d consider that threshold to be on the order of 100 million to 1 billion rows in a table. When a table reaches that size, it’s likely time to consider moving to Redshift.

The data intake performance of RDS vs. Redshift is likely going to be about the same if you do it right. Both have my coveted COPY command available for loading CSV files directly into the database (rather than running a massive INSERT). COPY, for me and others, is extremely fast for loading lots of data. Redshift might be better if you’re loading <1 million rows / day into a given table, while RDS might be the better option if you’re loading single rows at a time.

To summarize: Unless you have a table of 100M rows or more, you’re probably fine starting out with RDS and then migrating to the more expensive Redshift when you start to hit the limits of RDS’s performance capabilities. 


A Solution to Detecting Deleted Objects in Incrementally Replicated ETL Data Sources

We use Stitch as our primary ETL tool. It copies the data from our individual tools, such as SalesForce and HubSpot, into our Redshift data warehouse. Last week, I ran into an issue with Stitch’s replication of QuickBooks data, and needed to create a second ETL pipeline to ensure data validity. The problem?

Stitch uses incremental replication for QuickBooks, using a “last modified” field to know which rows to insert or update. However, in QuickBooks, objects are “hard” deleted, meaning the objects are deleted entirely rather than just being marked deleted. Since the objects no longer appear in API calls (and thus don’t have a “last modified” time), Stitch can’t detect that those rows were deleted.

This became a pressing issue when I tried to calculate revenue using the data in Redshift, only to find our numbers were off due to records that were deleted in QuickBooks still showing up in our query results. So I faced the problem: how can I detect deleted data in incrementally replicated data sources?

The answer, although somewhat “hacky,” was to create a second ETL pipeline of our own.  (more…)