Uncategorized

Most Popular Posts

AWS Glue First Impressions

I wrote some thoughts on the cost, speed, and usefulness of AWS Glue when it was first released. Readers have commented their opinions as Glue has evolved and progressed.

Upgrade Your Redshift Cluster from DC1 to DC2 Nodes

A guide on what to expect when upgrading your cluster to the second generation node type. Upgrading improves performance by up to 45% with no cost increase.

Do You Really Need a Data Warehouse? RDS vs. Redshift on AWS

Some basic considerations if you are thinking about setting up a data warehouse. You might be able to get by on RDS and save some $$$ until you really need a data warehouse.


Quickbooks

Simplifying Stitch ETL’s Quickbooks Data in Redshift: Revenue Table

This is the first post of a six-part series of posts explaining how we make analyzing Quickbooks data easier by creating cached data sets from complex queries, resulting in simplified tables that anyone at our company is able to easily understand.

This is the final post in this series of queries for simplifying Quickbooks data. Note that Credit Memo’s and Refund Receipts count against Revenue, so they are negated by multiply the Line Amount field by -1 in the queries.

If you made it through this series, thank you for reading! I am aware that this is a highly specific guide, but it took me almost two weeks to create these queries, so I hope I can save someone else that time.

(more…)


Quickbooks

Simplifying Stitch ETL’s Quickbooks Data in Redshift: Expenses Table

This is the first post of a six-part series of posts explaining how we make analyzing Quickbooks data easier by creating cached data sets from complex queries, resulting in simplified tables that anyone at our company is able to easily understand.

This table summarizes all outgoing financial transactions that would fall under the “Expenses” category in a Quickbooks report.

(more…)


Quickbooks

Simplifying Stitch ETL’s Quickbooks Data in Redshift: Cash Movements Table

This is the first post of a six-part series of posts explaining how we make analyzing Quickbooks data easier by creating cached data sets from complex queries, resulting in simplified tables that anyone at our company is able to easily understand.

This is the Cash Movements table, and the first of the complex tables in this series. It joins many tables together into a single table using JOIN’s and UNION’s.

Notice the differentiation between “In” and “Out” transactions – one will multiply the amount by -1 to show that it is negative cashflow.

(more…)


Quickbooks

Simplifying Stitch ETL’s Quickbooks Data in Redshift: Budgets Table

This is the third post of a six-part series of posts explaining how we make analyzing Quickbooks data easier by creating cached data sets from complex queries, resulting in simplified tables that anyone at our company is able to easily understand.

This is the Budgets table, which is a rather simple query that just creates a cleaner table summarizing the budgets for each account in Quickbooks.

(more…)


Quickbooks

Simplifying Stitch ETL’s Quickbooks Data in Redshift: The Shadow Schema

This is the second post of a six-part series of posts explaining how we make analyzing Quickbooks data easier by creating cached data sets from complex queries, resulting in simplified tables that anyone at our company is able to easily understand.

List of tables in the quickbooks_shadow schema.

List of tables in the quickbooks_shadow schema.

This is the schema creation script for the shadow schema. I’ll outline how I load the data in a separate post and link it here when it is available.

create schema quickbooks_shadow;

create table quickbooks_shadow.accounts (
	name varchar(1000),
	subaccount boolean,
	parentref_value varchar(200),
	fullyqualifiedname varchar(200),
	active boolean,
	classification varchar(200),
	accounttype varchar(200),
	accountsubtype varchar(200),
	currentbalance float,
	currentbalancewithsubaccounts float,
	currencyref_value varchar(200),
	currencyref_name varchar(200),
	domain varchar(200),
	sparse boolean,
	id int,
	metadata_createtime varchar(200),
	metadata_lastupdatedtime varchar(200)
);

create table quickbooks_shadow.creditmemos (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.invoices (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.journalentries (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.refundreceipts (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.salesreceipts (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.deposits (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.creditmemolines (
	parent_id varchar(200),
	id varchar(200),
	linenum int,
	description varchar(1000),
	amount float,
	salesitemlinedetail_itemref_value varchar(200),
	salesitemlinedetail_itemref_name varchar(200),
	salesitemlinedetail_unitprice float,
	salesitemlinedetail_qty float
);

create table quickbooks_shadow.refundreceiptlines (
	parent_id varchar(200),
	id varchar(200),
	linenum int,
	description varchar(1000),
	amount float,
	salesitemlinedetail_itemref_value varchar(200),
	salesitemlinedetail_itemref_name varchar(200),
	salesitemlinedetail_unitprice float,
	salesitemlinedetail_qty float
);

create table quickbooks_shadow.invoicelines (
	parent_id varchar(200),
	id varchar(200),
	linenum int,
	description varchar(1000),
	amount float,
	salesitemlinedetail_itemref_value varchar(200),
	salesitemlinedetail_itemref_name varchar(200),
	salesitemlinedetail_unitprice float,
	salesitemlinedetail_qty float
);

create table quickbooks_shadow.salesreceiptlines (
	parent_id varchar(200),
	id varchar(200),
	linenum int,
	description varchar(1000),
	amount float,
	salesitemlinedetail_itemref_value varchar(200),
	salesitemlinedetail_itemref_name varchar(200),
	salesitemlinedetail_unitprice float,
	salesitemlinedetail_qty float
);

create table quickbooks_shadow.journalentrylines (
	parent_id varchar(200),
	id varchar(200),
	linenum int,
	description varchar(1000),
	amount float,
	journalentrylinedetail_accountref_value varchar(200),
	journalentrylinedetail_accountref_name varchar(200),
	journalentrylinedetail_entity_entityref_name varchar(200),
	journalentrylinedetail_entity_entityref_value varchar(200),
	journalentrylinedetail_entity_type varchar(200),
	journalentrylinedetail_postingtype varchar(200)
);

create table quickbooks_shadow.depositlines (
	parent_id varchar(200),
	id varchar(200),
	linenum int,
	description varchar(1000),
	amount float,
	depositlinedetail_accountref_value varchar(200),
	depositlinedetail_accountref_name varchar(200),
	depositlinedetail_entity_entityref_name varchar(200),
	depositlinedetail_entity_entityref_value varchar(200),
	depositlinedetail_entity_type varchar(200),
	depositlinedetail_postingtype varchar(200)
);


create table quickbooks_shadow.payments (
	id varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.purchases (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.bills (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.billpayments (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.vendorcredits (
	id varchar(200),
	docnumber varchar(200),
	txndate varchar(200)
);

create table quickbooks_shadow.paymentlines (
	parent_id varchar(200),
	id varchar(200),
	amount float
);

create table quickbooks_shadow.purchaselines (
	parent_id varchar(200),
	id varchar(200),
	amount float
);

create table quickbooks_shadow.billlines (
	parent_id varchar(200),
	id varchar(200),
	amount float
);

create table quickbooks_shadow.billpaymentlines (
	parent_id varchar(200),
	id varchar(200),
	amount float
);

create table quickbooks_shadow.vendorcreditlines (
	parent_id varchar(200),
	id varchar(200),
	amount float
);

grant all on schema quickbooks_shadow to <USERNAME>;
grant all on all tables in schema quickbooks_shadow to <USERNAME>;

Up Next:

Simplifying Stitch ETL’s Quickbooks Data in Redshift: Budgets Table

Previous Post:

Simplifying Stitch ETL’s Quickbooks Data in Redshift: Introduction


Quickbooks

Simplifying Stitch ETL’s Quickbooks Data in Redshift: Introduction

This is the first post of a six-part series of posts explaining how we make analyzing Quickbooks data easier by creating cached data sets from complex queries, resulting in simplified tables that anyone at our company is able to easily understand.

Stitch exports Quickbooks Online data in the way that Quickbooks structures its API – each type of record is its own separate object, and each object has child objects with transaction details. Journal Entries, Invoices, Payments, Receipts, Vendor Credits, Customers, Credit Memos, Bills, Accounts, Deposits, Employees, Items, and Refunds are all broken down into multiple tables each when replicated into your data warehouse using the Stitch connector. Unfortunately, that leaves it up to an analyst to reconstruct the datasets needed to analyze things like revenue, expenses, cash flows, etc.

I’ve spent the countless hours necessary to reconstruct the data into cached data stores that we use to actually analyze and visualize our Quickbooks data. This post demonstrates each of those queries.

(more…)



AWS

Do You Really Need a Data Warehouse? Redshift vs. RDS on AWS

A question I wish I had asked a year ago is whether my company really needed a data warehouse like Redshift, or whether we could have gotten by on an RDS instance. If you’re setting up your analytics infrastructure, I recommend you ask yourself the same question. A data warehouse like Redshift, Snowflake, or BigQuery has obvious benefits – size, scalability, and performance on queries that require large scans. But for many startups and early stage companies, I’d consider saving a bit of money by getting by on an RDS instance until you reach its limits.

For the comparable dc2.large and db.t2.large instances, Redshift costs $0.25/hour (or $180/month) and Postgres on RDS costs $0.145/hour (or $104.40/month). RDS is 42% cheaper than Redshift. Smaller instances are available on RDS, allowing you to start at just $13/month for a db.t2.micro instance and scaling up from there, whereas Redshift’s smallest instance is the dc2.large at $180/month. 

The key advantage of Redshift that I’ve experienced is the SORT KEY feature. For a table with millions of rows, you can define a SORT KEY column to help query performance improve. For example, if you have a table of all messages sent on your product, and there are 1 million messages per day, you’ll be adding 30 million rows to your Messages table each month. Eventually, on an unoptimized database, query performance will slow dramatically as the database has to scan the entire table to find the messages you are querying for. But with Redshift, the SORT KEY column tells the database to store rows in an organized fashion on blocks of memory. The database stores meta data about what is stored on each block. Then, when you query and specify a range for your SORT KEY column, the query optimizer will know to only scan the relevant blocks of memory, rather than all of them.

Careful schema design on a normal RDS database will take you pretty far with improving query performance. And for smaller datasets, you can be sure that query performance will likely be faster than on Redshift. Redshift is built for intake and storage of large data sets, but less performant for queries. RDS is more of a “jack of all trades” database, but will underperform with massive tables. I’d consider that threshold to be on the order of 100 million to 1 billion rows in a table. When a table reaches that size, it’s likely time to consider moving to Redshift.

The data intake performance of RDS vs. Redshift is likely going to be about the same if you do it right. Both have my coveted COPY command available for loading CSV files directly into the database (rather than running a massive INSERT). COPY, for me and others, is extremely fast for loading lots of data. Redshift might be better if you’re loading <1 million rows / day into a given table, while RDS might be the better option if you’re loading single rows at a time.

To summarize: Unless you have a table of 100M rows or more, you’re probably fine starting out with RDS and then migrating to the more expensive Redshift when you start to hit the limits of RDS’s performance capabilities. 


Uncategorized

A Solution to Detecting Deleted Objects in Incrementally Replicated ETL Data Sources

We use Stitch as our primary ETL tool. It copies the data from our individual tools, such as SalesForce and HubSpot, into our Redshift data warehouse. Last week, I ran into an issue with Stitch’s replication of QuickBooks data, and needed to create a second ETL pipeline to ensure data validity. The problem?

Stitch uses incremental replication for QuickBooks, using a “last modified” field to know which rows to insert or update. However, in QuickBooks, objects are “hard” deleted, meaning the objects are deleted entirely rather than just being marked deleted. Since the objects no longer appear in API calls (and thus don’t have a “last modified” time), Stitch can’t detect that those rows were deleted.

This became a pressing issue when I tried to calculate revenue using the data in Redshift, only to find our numbers were off due to records that were deleted in QuickBooks still showing up in our query results. So I faced the problem: how can I detect deleted data in incrementally replicated data sources?

The answer, although somewhat “hacky,” was to create a second ETL pipeline of our own.  (more…)