AWS

AWS Glue First Impressions


AWS Glue is a managed ETL (Extract, Transform, Load) service for moving data between AWS products such as S3, RDS, and Redshift. I tested it out for moving S3 data into Redshift, and transforming JSON data to CSV format in S3.

How It Works

Although the split concepts of Crawlers, Databases, Tables, Jobs, and Triggers is confusing at first, the product is fairly easy to use after enduring the initial learning curve. Here’s how I would break it down.

Crawlers look at your data source and try to understand the schema of the data it contains. It stores that schema as a…

Table which is part of a Database. A Table is essentially a virtual object definition, created by a Crawler. It can be the schema of CSV’s in S3 (and many other data formats in S3), or an actual database table from Redshift, RDS, or any database that is accessible via JDBC. Once you have at least one Table defined, you can move the data using a Job.

A Job is a task that copies data from one place to another. You can move S3 data to Redshift, copy a Redshift table to S3, copy an RDS table to Redshift…the sky’s the limit.

A Trigger is a configuration for when to run a Job or set of Jobs. This can be on a schedule, such as daily, or fired on Job Events, such as when a Job succeeds or fails.

How It Performed In My Limited Testing

I set up several Crawlers, Tables, and Jobs to test out Glue. One was simply taking a set of JSON files in S3 (approximately 90,000 files of under 1MB each) and flattening them to CSV files, storing the new files in S3 as well. All told, this is about 7GB of total data to be transformed.

With 10 DPU’s (Data-Processing Units) allocated to the Job, it took 2 hours to run. With 20 DPU’s allocated, it took 90 minutes. Why not 60 minutes, since the compute power was doubled? AWS Glue is serverless, meaning the Job needs to perform a cold start every time it runs (assuming it’s not being run constantly). So a daily Job would perform a cold start on every run. The cold start took 25 minutes on each run.

The other Jobs I tested took a similarly long time, ranging from 45 minutes to an hour to do simple operations. The time performance alone is unacceptable, but let’s focus on the cost.

At $0.44/DPU/hour, that cold start on the second run cost about $3.70. But imagine you have 30 Jobs running daily. The cold starts alone will run $100/day, or $3000/month in waste. And that’s not even the cost of the actual compute power to finish the Jobs. The price of AWS Glue is exorbitant.

Conclusion

I wanted to like AWS Glue, I really did. I had high hopes after reading their intros and watching their product announcement videos. I gave them a solid $100 worth of test runs to see how it works. But the fact that I barely scratched the surface and had racked up $100 in fees? That’s crazy.

ETL shouldn’t cost this much, especially for such slow performance. For now, I’ll be looking at using Lambdas to achieve the same results at a fraction of the cost. Maybe someday Glue will live up to expectations.


I'm the Analytics Therapist at Redox, a quickly growing technology platform that enables organizations to send healthcare data back and forth. Here, I write about our journey to become a data-driven organization, and the technical challenges I've faced along the way. All views and opinions are my own and do not represent those of my employer.

View Comments