JDBC , in memory, sometimes faster than athena.
External table in s3
insert into external table?
Syntax limitations compared with hive
- INSERT OVERWRITE Statements are NOT Supported
Presto does not currently support INSERT OVERWRITE Statements. Please delete table before INSERT INTO. See the detail here.
2. Presto does not currently support cost-based JOIN optimizations, meaning JOINs are not automatically reordered based on table size. Please make sure that smaller tables are on the right hand size of JOIN, and they must fit in memory. Otherwise out of memory exceptions will cause the query to fail.
Dynamic partitions? Not supported in presto. 😦
How to use presto with emr:
1. Presto with Airpal– Airpal has many helplful features like highlighting syntax, export results to CSV for download etc. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Please note that running an extra Airpal server will lead to extra EC2 costs.
2. Presto with Hue– You can use Presto with hue(hue-4.0.1) on EMR(version 5.9.0 or later). Hue provides Sql editor for running your presto queries in a web interface similar to Airpal(there may be a difference in features provided by hue as compared to Airpal). Hue is a better option than using Airpal as per my understanding, as you can install hue as a part of EMR installation.
3. Presto on EMR CLI– You can run presto using command line interface and monitor your queries using presto web UI. You can open “MASTER_NODE_IP:8889“(default) to monitor your cluster details. To enable web interfaces for EMR cluster, Kindly refer(https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-ui-console.html)
4. Use Athena instead of Presto on EMR– You can also use AWS Athena(https://aws.amazon.com/athena/) if you want to process data present in S3. Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3. Athena internally uses Presto as sql query engine.
5. Use presto, when you want to reduce costs on your AWS Athena service.
Hive partitions including dynamic partitioning with Hive
Presto has full support for Hive partitions including dynamic partitioning with Hive [1,2].
On EMR, when you install Presto on your cluster, EMR installs Hive as well. Presto uses the Hive metastore to map database tables to their underlying files.
The INSERT query into an external table on S3 is also supported by the service. To query data from Amazon S3, you will need to use the Hive connector that ships with the Presto installation.
Scheduling job in presto
As per my understanding, you can use one of the following methods:
1. You can create a shell script and submit it as a step to the cluster. For example, you can create a script. Kindly refer  for more details on submitting step to a cluster.
presto-cli –catalog hive –schema default –execute “select count(*) from TABLE_NAME;”
2. Use a shell action to schedule an oozie workflow on EMR cluster(oozie needs to be installed as part of EMR cluster). Kindly refer this blog , which explains on how to use oozie workflows.
3. You can save your queries in hue and then run those saved queries in hue console.
I hope that above information is helpful to you. Kindly let me know if I missed something.
. Submit step to an EMR cluster: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-work-with-steps.html#emr-add-steps
Working example with Hive and presto:
- create table via hive
- select via presto
- apache phoenix
- AWS Redshift
Need to learn more about AWS big data (demystified)?
- Contact me via linked in Omid Vahdaty
- website: https://amazon-aws-big-data-demystified.ninja/
- Join our meetup, FB group and youtube channel
- Join our meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
- Join our facebook group https://www.facebook.com/groups/amazon.aws.big.data.demystified/
- subscribe to our youtube channel https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber