AWS S3 caching while working with Hive spark SQL and External table | LLAP

If you are looking for options to speed up your queries which are using subsets of the same data and you would like to know if there is any AWS solution that fits the requirement of caching frequently accessed data.

If you are using Hive, you may use LLAP(If not already). LLAP effectively is a daemon that caches metadata as well as the data itself. There is an AWS blog on enabling LLAP using a bootstrap action and then executing your queries. Please look at [1] and let me know if you have any questions regarding the same. LLAP daemons are launched under YARN management to ensure that the nodes don’t get overloaded with the compute resources of these daemons. You may specify the number of instances you want the daemon to run, the memory allocation, number of executors per instance and so forth. But it does have its default values as well.

# –instances – number of LLAP daemon instances, defaults to the number of slave nodes # –cache – LLAP cache for each daemon, defaults to 20% of physical memory

# –executors – number of executors per daemon, defaults to the number of CPU cores

# –iothreads – number of IO threads, defaults to the number of CPU cores

# –size – YARN container memory, defaults to 50% of available memory on a node

# –xmx – LLAP daemon memory, defaults to 50% of container memory

# –log-level – log level, defaults to INFO If you are using Spark, RDD Persistence is one of the configurations that you may use to cache data in memory across operations. There are multiple levels at which you can choose to cache the data. It could be Memory Only, or caching in Memory and Disk both amongst other in [2]. You can mark an RDD to be persisted using the persist() or cache() methods on it.

Tachyon(Alluxio) is basically similar. It sits between HDFS and Spark to provide in-memory file-system, like a virtual distributed storage. Integration of Alluxio in EMR is currently in dev stages. [3]

I personally have not tested the above solution, but i am planning too, and will update on this post in the future. tested this yourself? please contact me for you feedback.

References

[1] AWS Blog LLAP – https://aws.amazon.com/blogs/big-data/turbocharge-your-apache-hive-queries-on-amazon-emr-using-llap/

[2] RDD Persistence – https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence [3] LLAP Wiki – https://cwiki.apache.org/confluence/display/Hive/LLAP#LLAP-Caching

[3] Alluxio Docs – http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html#class-alluxiohadoopfilesystem-not-found-issues-with-sparksql-and-hive-metastore

[4] LLAP benchmark: https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown

[5] Hive LLAP benchmark VS Impala: https://dzone.com/articles/3x-faster-interactive-query-with-apache-hive-llap

 

Need to learn more about aws big data (demystified)?

 

Need to learn more about aws big data (demystified)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s