With the announcement of Amazon Athena at this year’s AWS re-invent conference, I couldn’t help but notice its striking similarity with another rival cloud offering. I’m talking about Google’s BigQuery. Athena is a managed service allowing customers to query objects stored in an S3 bucket. Unlike other AWS offerings like Redshift, you only need to pay for the queries you run. There is no need to manage or pay for infrastructure that you may not be using all the time. All you need to do is define your table schema and reference your files in S3. This works in a similar way to BigQuery’s federated sources which reference files in Google Cloud Storage.
Given this, I thought it would be interesting to compare the two platforms to see how they stack up against each other. I wanted to find out which one is the fastest, which one is more feature rich and which is the most reliable.
Let’s get Started
I thought I would start by comparing the two services by doing a simple aggregation. For all my tests I will be using the publically available Github datasets. I started with the license table, it contains all of the open source GitHub repo names and their license name. It’s 3.1 million records in a 90MB file. For the first comparison I decided to use the following query:
SELECT license, COUNT(*) AS licenses FROM `bigquery-public-data.github_repos.licenses` GROUP BY license ORDER BY licenses DESC
I ran it in Bigquery first and as I expected it was really fast. It was able to complete the query in 2.8 seconds. Next up was Athena. I wasn’t really sure what to expect here, unlike BigQuery, Athena does not have the concept of native tables. It’s backed by an S3 bucket which I was expecting to slow the query down. So, I ran the query and to my surprise, the results were returned in 5.32 seconds. That’s 5.32 seconds to load in 3.1 million records from a file in S3, group the records and then sort them. I have got to say I was really impressed. However, it’s still 2.52 seconds slower then BigQuery.
External Tables
BigQuery has native tables, which are highly optimised when it comes to reading the data whereas Athena is backed by an S3 bucket which provides a convenient way of querying existing data without the need to convert it into a table. How might BigQuery stack up under the same circumstances? Well, let’s find out. BigQuery has a feature called federated sources which allow users to query files stored in Google Cloud Storage (GCS). I exported the licenses table to a CSV file in GCS (same one I used for the Athena test). I then created a federated sources table in BigQuery and referenced the licenses’ CSV file. I ran the same query and to my surprise, It took 13.7 seconds to run. I knew that using a federated sources table would be slower than a native table, however, I never expected almost 10 seconds slower.
Can we get even faster results in Athena?
Athena allows you to partition your data to get even better performance. Similar to partitioned tables in BigQuery, you will only be charged for the data in the partitions that are used in your query. Although unlike BigQuery, there is the ability to partition your data on any column of your choosing. So in the case of the Github licenses table, it makes sense to partition the table on the license column. This would allow any grouping by license name to be extremely fast. Now if your data was already partitioned into separate files it would just be a matter of creating a partitioned table which references your partitioned files in S3. However, if your data is not partitioned, as it was in my case, you will need to partition it yourself. So I found myself writing a simple script to read the license file and write it out into separate files partitioned by license name. So I basically transformed the data from a single CSV file with the schema: repo_name, license to multiple files containing only the repo name, with the file name being the license name. This was hardly ideal, however, let’s see if it was worth it. I ran the above query again and……. Boom! 2.66 seconds. That’s faster than BigQuery on a table that’s querying CSV files.
Let’s try some more data!
Image may be NSFW.
Clik here to view.In the first query, we were only using the license table, which was not really that interesting. This time I would like to also join on commits table so that we can see which licenses are the most popular by the number of commits. Since the commits table was over 608GB in size, I opted for a trimmed down version. I created a version of the table which contained all repos, mapped to the number of commits recorded against that repo. This would mean joining two tables together that were 3.1 million records each.
Image may be NSFW.
Clik here to view.
Not a bad result, 3.05 seconds to do a join and group by. Now let’s see how the same query will perform using BigQuery?
Image may be NSFW.
Clik here to view.
So it’s slower as we were seeing with the previous query. The performance of Athena is obviously improved significantly by the license table being partitioned by license name. I was seeing some really good performance when dealing with relatively small datasets. What about something much, much larger? What if we try querying a table with more than 2 billion records? The files table lists all files publically stored in Github. I thought it would interesting to see what the most popular file type is based on the file extension.
I was only interested in the path column as it contains the file name so I ran the following query:
SELECT REGEXP_EXTRACT(path, r'.+\.(\w+)$') AS file_ext, COUNT(*) AS files_with_extension FROM `bigquery-public-data.github_repos.files` GROUP BY file_ext ORDER BY files_with_extension DESC
I first ran the query in BigQuery and the result was returned in 17.4 seconds – that’s 17.4 seconds to group together more than 2 billion records. That is really impressive! Athena couldn’t possibly compete with this, could it? Well, let’s find out! I put the query in, hit run and waited. Then a few minutes later…
Image may be NSFW.
Clik here to view.
Athena couldn’t handle 2 billion records? That’s really annoying! Maybe I could try something a little simpler. How about, which repos have the most files?
SELECT repo_name, count(*) AS repos FROM files GROUP BY repo_name ORDER BY repos DESC limit 10
Athena was able to run this query in 28.51 seconds, on the first go without any errors. The same query in BigQuery took only 15.9 seconds. Could I get better performance by partitioning my table and converting it into columnar format? Well I probably could, however, to do this for a table with more than 2 billion records would take quite a lot of time and unfortunately I didn’t get a chance to do this.
Reliability
I have been working with BigQuery quite a lot over the past few years. For the most part, I have found it to be reliable. While some days we do record several hundred failed operations, it generally succeeds when retrying the operation. I can’t say the same for Athena.
Image may be NSFW.
Clik here to view.
The above query, while simple, it was unable to run. I re-tried several times without success. This seemed to happen on several queries. I was just unable to get them working. However running the above query several minutes later seemed to get it to work. Given that Athena is a new product I would expect that there would be some issues and I have since been informed that this is a known issue that they are working on so it would be good to see this fixed.
User Defined Functions (UDF)
What is a UDF? In BigQuery, it’s a Javascript function which can be called as part of your query. The function will take a table row as input and returns 0 or more rows as output. This is quite powerful functionality that allows you to mix SQL and code, you can read more about it here. With Athena, however, UDFs are still not supported. They are on the roadmap which is great to hear but there is no information on how they will work. I would assume that they would work in a similar way to Apache Hive, where the UDF functions are written in Java but we will just have to wait and see.
What Athena does better than BigQuery
- Can be faster than BigQuery, especially when querying federated/external tables
- Ability to use regex to define a schema, allowing you to query log files without needing to change the format of the file
- Can be faster and cheaper than BigQuery by running queries against tables that make use of a Partitioned/Columnar format
- Tables can be partitioned on any column
Issues with Athena
- It’s not easy to convert data between formats
- No ability to modify any data, insert/update/delete/etc
- I was randomly getting the error HIVE_UNKNOWN_ERROR which is an issue with Partitioned tables
- No support for native tables
- No streaming support
- Struggles with really large datasets
- Quoted CSV files are not fully supported
What BigQuery does better than Athena
- It has native table support giving it better performance and more features than Athena
- It’s easy to manipulate data, insert/update records, truncate tables, write query results back to tables etc…
- Querying native tables is really fast
- Easy to convert non-columnar formats into a native table for columnar queries
- Supports UDFs, although they will be available in the future for Athena
- Supports nested tables
- Works well for petabyte scale queries
Final Thoughts
Did Athena slay BigQuery? Athena is a great option for querying data that you already have stored in S3. It offers great performance, and if you are already running your infrastructure on the AWS stack you probably have logs and other files stored in S3. This makes getting up and running with Athena as simple as creating an Athena table and pointing it to your data. You can get better performance by partitioning your data and converting it into a columnar format however this is something you need to take care of yourself. This can be a really tedious process and Athena provides no mechanism for converting your data or manipulating it in any way. It would be nice to have a feature in the console/SDK to convert tables into partitioned and columnar formats. It would also be nice to also have the ability to write the results of a query to a new table or append results to an existing table. Athena also has some work to do to fix some of its broken features and improve reliability.
BigQuery is a mature product that has been around for many years now (since 2010). It is feature rich and extremely fast. I think that allowing the user to partition tables on columns of their choice would offer a huge improvement in the performance and cost of using BigQuery. All said and done, I think BigQuery is the better product at this stage, however, it’s had a big head start over Athena which has a lot of catching up to do.
Image may be NSFW.
Clik here to view.
Clik here to view.
