How save costs on AWS SQL Athena? Cost of using AWS SQL Athena is killing you?
consider the below
- did you switch to columnar? if not try the this link as reference: convert to columnar from raw based data.
- did you use parquet or orc? one of them take less space.
- did u use partitioning? did you use the correct partitioning for your query?
- if using ORC consider using bucketing on top of partitioning, not sure if Athena supports this. confirming this. TBD.
- if using highly nested data , perhaps AVRO will be your space saver, need to test this.
- did you compress via gzip? or something else? there more compressions supported, each have there own storage put print and scan put print….
- was your data spliced into chunks? if so try to change chunk size. more complicated but doable, again, could go either way – need to test this will your data.
- apply hints on the table may help on data scan in some cases. not sure if Athena supports this. confirming this. TBD.
- if using multiple tables join, order of joins, may impact scanned data
- consider pre aggregating data if possible as part of your transformation/cleansing process. even if it is on each (using window table, each row will hold aggregation tables. )
- consider pre calculating table with heavy group by on raw-data. i.e have the data already calculated on s3, and have your production user/ end user query that table.
- if your application permits this – using caching layer like elastic cache. remember Athena has it own caching as well (results are saved for 24 hours)
- have a data engineer review each query, to make sure data scan is minimised. for example
- Minimise the columns in the results set… a results set of longs strings maybe be very costly.
- where possible switch strings to ints, this will minimise footprint on storage greatly.
- if possible switch from bigint to tinyint. this will save some disk space as well. notice the list of supported data types: https://prestodb.io/docs/current/language/types.html
- Consider EMR with presto and task group with spots + auto scaling – you could have a tighter control on max budget that will be used, and in some cases it may be faster running on AWS EMR.
15.Benchmarking different types of Hadoop / spark / hive / Presto storage and compressions types.
Using this blog on convert raw based data to columnar on tpch data , I created new destinations tables , and converted to different type storage and compression. For completeness I used popular ORC and parquet the following compressions types GZIP, GZIP, Deflate, LZO, Zlib. Afterwards I ran a simple preview query on each of the 5 compressions below:
SELECT * FROM “tpch_data”.”lineitem_compresation_name” limit 10
Results on data read via Athena on cold queries (data scanned only once, after 72 hours):
- Parquet, GZIP: (Run time: 4.8 seconds, Data scanned: 84MB)
- Parquet, BZIP: (Run time: 6.0 seconds, Data scanned: 242MB)
- Parquet, Deflate: (Run time: 5.81 seconds, Data scanned: 242MB)
- Parquet, LZO: (Run time: 9 seconds, Data scanned: 15GB)
- ORC, Zlib : (Run time: 10.1 seconds, Data scanned: 11GB)
- Text,GZIP: (Run time: 5.9 seconds, Data scanned: 49MB)
Results on data read via Athena on hot queries (data scanned several times):
- Parquet, GZIP: (Run time: 6 seconds, Data scanned: 2.5GB)
- Parquet, BZIP: (Run time: 6.39 seconds, Data scanned: 4.84GB)
- Parquet, Deflate: (Run time: 4.88 seconds, Data scanned: 5.31GB)
- Parquet, LZO: (Run time: 5.77 seconds, Data scanned: 8.97GB)
- ORC, Zlib : (Run time: 6.5 seconds, Data scanned: 5.07GB)
The results are not surprising as there are many compressions, and each can behave differently on different data types (strings, ints, floats, rows, columns) . notice the scan time is more or less the same.
I ran a columnar test on a specific int column instead of all columns in this queries, and the results were the same, as all compressions read the same amount of giga scanning this specific column. again notice the different running time.
SELECT avg( l_orderkey) FROM “tpch_data”.”lineitem_parquet”
- Parquet, GZIP: (Run time: 6.84 seconds, Data scanned: 3.77GB)
- Parquet, BZIP: (Run time: 9.03 seconds, Data scanned: 3.77GB)
- Parquet, Deflate: (Run time: 9 seconds, Data scanned: 3.77GB)
- Parquet, LZO: (Run time: 9.94 seconds, Data scanned: 3.77GB)
- ORC, Zlib : (Run time: 11.27 seconds, Data scanned: 3.77GB)
Again, This test should not convince you to prefer one storage type over the other, nor to prefer one compression type over the other. Test it on you data, and understand the differences in your use case.
and don’t forget monthly AWS s3 Storage costs….
- Old data can be reduced to reduced availability.
- how much data is read in your queries. if your invoice of Athena is 500$, this means you are reading 100 TB of compressed data. so… can you minimised the amount of data read in your query – or can your reduce the history from 2 years to 1 year ?
Conclusion on cost reduction using AWS SQL Athena
- As you can see, you could be saving a 50% or more. easily on your AWS SQL Athena costs simply by changing to the correct compression.
- Check the running time, be sure it is a non issues for your use case.
- regarding the text vs parquet, be sure to understand the use-case, not always you need to extract all the rows, thus column based storage, will be more use-full.
Need to learn more about aws big data (demystified)?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber