A Solution to Detecting Deleted Objects in Incrementally Replicated ETL Data Sources
We use Stitch as our primary ETL tool. It copies the data from our individual tools, such as SalesForce and HubSpot, into our Redshift data warehouse. Last week, I ran into an issue with Stitch’s replication of QuickBooks data, and needed to create a second ETL pipeline to ensure data validity. The problem?
Stitch uses incremental replication for QuickBooks, using a “last modified” field to know which rows to insert or update. However, in QuickBooks, objects are “hard” deleted, meaning the objects are deleted entirely rather than just being marked deleted. Since the objects no longer appear in API calls (and thus don’t have a “last modified” time), Stitch can’t detect that those rows were deleted.
This became a pressing issue when I tried to calculate revenue using the data in Redshift, only to find our numbers were off due to records that were deleted in QuickBooks still showing up in our query results. So I faced the problem: how can I detect deleted data in incrementally replicated data sources?
The answer, although somewhat “hacky,” was to create a second ETL pipeline of our own.
I set up a second schema for QuickBooks, calling it “quickbooks_shadow.” I then created tables for the objects that I knew I needed to omit deleted data from.
I built a quick and dirty NodeJS application that uses OAuth2 to connect to our QuickBooks account, and then queries the QuickBooks API for lists of those objects. It transforms the select fields that are relevant here (mainly the id field), transforms the JSON list into CSV format and stores the files to AWS S3. Then it runs a series of commands to TRUNCATE the tables in the shadow schema and COPY the CSV files from S3 into the tables in the shadow schema in Redshift.
Now, when I query for QuickBooks data in Redshift, I simply need to add an INNER JOIN to the sister table in the shadow schema, joining on the id columns. This omits any deleted records because the Node app + S3 CSV + TRUNCATE/COPY method I outlined is a “full” replication of just the identifiers, rather than incremental. And since this full replication gets a fresh start each time it runs, no deleted records should get loaded into Redshift.
Stay tuned: A full breakdown of the Node app is coming soon!