AWS Glue First Impressions
AWS Glue is a managed ETL (Extract, Transform, Load) service for moving data between AWS products such as S3, RDS, and Redshift. I tested it out for moving S3 data into Redshift, and transforming JSON data to CSV format in S3.
How It Works
Although the split concepts of Crawlers, Databases, Tables, Jobs, and Triggers is confusing at first, the product is fairly easy to use after enduring the initial learning curve. Here’s how I would break it down.
Crawlers look at your data source and try to understand the schema of the data it contains. It stores that schema as a…
Table which is part of a Database. A Table is essentially a virtual object definition, created by a Crawler. It can be the schema of CSV’s in S3 (and many other data formats in S3), or an actual database table from Redshift, RDS, or any database that is accessible via JDBC. Once you have at least one Table defined, you can move the data using a Job.
A Job is a task that copies data from one place to another. You can move S3 data to Redshift, copy a Redshift table to S3, copy an RDS table to Redshift…the sky’s the limit.
A Trigger is a configuration for when to run a Job or set of Jobs. This can be on a schedule, such as daily, or fired on Job Events, such as when a Job succeeds or fails.
How It Performed In My Limited Testing
I set up several Crawlers, Tables, and Jobs to test out Glue. One was simply taking a set of JSON files in S3 (approximately 90,000 files of under 1MB each) and flattening them to CSV files, storing the new files in S3 as well. All told, this is about 7GB of total data to be transformed.
With 10 DPU’s (Data-Processing Units) allocated to the Job, it took 2 hours to run. With 20 DPU’s allocated, it took 90 minutes. Why not 60 minutes, since the compute power was doubled? AWS Glue is serverless, meaning the Job needs to perform a cold start every time it runs (assuming it’s not being run constantly). So a daily Job would perform a cold start on every run. The cold start took 25 minutes on each run.
The other Jobs I tested took a similarly long time, ranging from 45 minutes to an hour to do simple operations. The time performance alone is unacceptable, but let’s focus on the cost.
At $0.44/DPU/hour, that cold start on the second run cost about $3.70. But imagine you have 30 Jobs running daily. The cold starts alone will run $100/day, or $3000/month in waste. And that’s not even the cost of the actual compute power to finish the Jobs. The price of AWS Glue is exorbitant.
Conclusion
I wanted to like AWS Glue, I really did. I had high hopes after reading their intros and watching their product announcement videos. I gave them a solid $100 worth of test runs to see how it works. But the fact that I barely scratched the surface and had racked up $100 in fees? That’s crazy.
ETL shouldn’t cost this much, especially for such slow performance. For now, I’ll be looking at using Lambdas to achieve the same results at a fraction of the cost. Maybe someday Glue will live up to expectations.
Bryan ( )
Thanks for the post and insight. When working with crawlers, do you remember how to determine (if at all) the number of DPUs that were used?
Kelvis da Gama ( )
Hi Tom,
It is a such great post, congrats!
I found your post searching for the why the cold start so much time.
Just to complement your text, I found in my searchs and in my tests:
– There is no cold start if you run you job in 1 hour (you mentioned that, I’m just especifying the time)
– In AWS Glue pricing page, I found that doesn’t have costs during the cold start, if you go to the History panel, you’ll see that the Execution time is diferent from the (Start time – End Time), you pay just for the time your job takes to finish after the cold start.
But I agree with you, is slower than my expectations.
Kk ( )
I agree with Kelvis, you don’t pay for “startup time”. you only pay for actual execution time.
Based on my findings, this is what aws glue is good for:
1. simple etl
2. time-insensitive etl.
3. for infrequent time-insensitive etl, glue is great. you save on ops effort (no emr to maintain) and on cost (only pay what you use, no idle time charge)
it’s not great for time-sensitive etl. it’s not great for really large datasets, because you cannot optimize your spark config params when it comes to glue. it’s great and extracting and dumping data.