Do You Really Need a Data Warehouse? Redshift vs. RDS on AWS

A question I wish I had asked a year ago is whether my company really needed a data warehouse like Redshift, or whether we could have gotten by on an RDS instance. If you’re setting up your analytics infrastructure, I recommend you ask yourself the same question. A data warehouse like Redshift, Snowflake, or BigQuery has obvious benefits – size, scalability, and performance on queries that require large scans. But for many startups and early stage companies, I’d consider saving a bit of money by getting by on an RDS instance until you reach its limits.

For the comparable dc2.large and db.t2.large instances, Redshift costs $0.25/hour (or $180/month) and Postgres on RDS costs $0.145/hour (or $104.40/month). RDS is 42% cheaper than Redshift. Smaller instances are available on RDS, allowing you to start at just $13/month for a db.t2.micro instance and scaling up from there, whereas Redshift’s smallest instance is the dc2.large at $180/month. 

The key advantage of Redshift that I’ve experienced is the SORT KEY feature. For a table with millions of rows, you can define a SORT KEY column to help query performance improve. For example, if you have a table of all messages sent on your product, and there are 1 million messages per day, you’ll be adding 30 million rows to your Messages table each month. Eventually, on an unoptimized database, query performance will slow dramatically as the database has to scan the entire table to find the messages you are querying for. But with Redshift, the SORT KEY column tells the database to store rows in an organized fashion on blocks of memory. The database stores meta data about what is stored on each block. Then, when you query and specify a range for your SORT KEY column, the query optimizer will know to only scan the relevant blocks of memory, rather than all of them.

Careful schema design on a normal RDS database will take you pretty far with improving query performance. And for smaller datasets, you can be sure that query performance will likely be faster than on Redshift. Redshift is built for intake and storage of large data sets, but less performant for queries. RDS is more of a “jack of all trades” database, but will underperform with massive tables. I’d consider that threshold to be on the order of 100 million to 1 billion rows in a table. When a table reaches that size, it’s likely time to consider moving to Redshift.

The data intake performance of RDS vs. Redshift is likely going to be about the same if you do it right. Both have my coveted COPY command available for loading CSV files directly into the database (rather than running a massive INSERT). COPY, for me and others, is extremely fast for loading lots of data. Redshift might be better if you’re loading <1 million rows / day into a given table, while RDS might be the better option if you’re loading single rows at a time.

To summarize: Unless you have a table of 100M rows or more, you’re probably fine starting out with RDS and then migrating to the more expensive Redshift when you start to hit the limits of RDS’s performance capabilities. 

I'm the Analytics Therapist at Redox, a quickly growing technology platform that enables organizations to send healthcare data back and forth. Here, I write about our journey to become a data-driven organization, and the technical challenges I've faced along the way. All views and opinions are my own and do not represent those of my employer.

View Comments