AWS EMR Presto Demystified | Everything you wanted to know about Presto

JDBC , in memory, sometimes faster than athena.

https://aws.amazon.com/blogs/big-data/analyze-data-with-presto-and-airpal-on-amazon-emr/

External table in s3

http://www.jiayul.me/tutorial/2016/07/19/querying-s3-data-using-hive-and-presto.html

insert into external table?

Syntax limitations compared with hive

  1. INSERT OVERWRITE Statements are NOT Supported

Presto does not currently support INSERT OVERWRITE Statements. Please delete table before INSERT INTO. See the detail here.

https://docs.treasuredata.com/articles/presto-known-limitations

2. Presto does not currently support cost-based JOIN optimizations, meaning JOINs are not automatically reordered based on table size. Please make sure that smaller tables are on the right hand size of JOIN, and they must fit in memory. Otherwise out of memory exceptions will cause the query to fail.

Best practices

http://docs.qubole.com/en/latest/user-guide/presto/best-practices.html

Partitions

https://stackoverflow.com/questions/20185271/is-presto-hive-partition-aware

Dynamic partitions? Not supported in presto. 😦

Tuening

https://qubole.zendesk.com/hc/en-us/articles/210266303-How-To-Presto-Tuning

https://docs.treasuredata.com/articles/presto-performance-tuning

How to use presto with emr:

1. Presto with Airpal– Airpal has many helplful features like highlighting syntax, export results to  CSV for download etc. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Please note that running an extra Airpal server will lead to extra EC2 costs.

2. Presto with Hue– You can use Presto with hue(hue-4.0.1) on EMR(version 5.9.0 or later). Hue provides Sql editor for running your presto queries in a web interface similar to Airpal(there may be a difference in features provided by hue as compared to Airpal). Hue is a better option than using Airpal as per my understanding, as you can install hue as a part of EMR installation.

3. Presto on EMR CLI– You can run presto using command line interface and monitor your queries using presto web UI. You can open “MASTER_NODE_IP:8889“(default) to monitor your cluster details. To enable web interfaces for EMR cluster, Kindly refer(https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-ui-console.html)

4. Use Athena instead of Presto on EMR– You can also use AWS Athena(https://aws.amazon.com/athena/) if you want to process data present in S3. Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3. Athena internally uses Presto as sql query engine.

5. Use presto, when you want to reduce costs on your AWS Athena service.

Hive partitions including dynamic partitioning with Hive

Presto has full support for Hive partitions including dynamic partitioning with Hive [1,2].

On EMR, when you install Presto on your cluster, EMR installs Hive as well. Presto uses the Hive metastore to map database tables to their underlying files.

The INSERT query into an external table on S3 is also supported by the service. To query data from Amazon S3, you will need to use the Hive connector that ships with the Presto installation.

[1] https://github.com/prestodb/presto/blob/1e49d9b125b6897d5014b64f38355605dfe9318d/presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionKey.java

[2] https://github.com/prestodb/presto/blob/886cdf90f4e5b331afcebdde91eae5cfe2a2834d/presto-hive/src/main/java/com/facebook/presto/hive/HiveWriterFactory.java

Scheduling job in presto

As per my understanding, you can use one of the following methods:

1. You can create a shell script and submit it as a step to the cluster. For example, you can create a script. Kindly refer [1] for more details on submitting step to a cluster.

=====
#!/bin/bash
presto-cli –catalog hive –schema default –execute “select count(*) from TABLE_NAME;”
=====

2. Use a shell action to schedule an oozie workflow on EMR cluster(oozie needs to be installed as part of EMR cluster). Kindly refer this blog [2], which explains on how to use oozie workflows.

3. You can save your queries in hue and then run those saved queries in hue console.

I hope that above information is helpful to you. Kindly let me know if I missed something.

References:

[1]. Submit step to an EMR cluster: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-work-with-steps.html#emr-add-steps

[2]. Oozie workflows:
https://aws.amazon.com/blogs/big-data/use-apache-oozie-workflows-to-automate-apache-spark-jobs-and-more-on-amazon-emr/

Working example with Hive and presto:

  1. create table via hive
  2. select via presto

http://www.jiayul.me/tutorial/2016/07/19/querying-s3-data-using-hive-and-presto.html

 

Presto Connectors

  1. kudu
  2. elasticseach
  3. apache phoenix
  4. AWS Redshift
  5. Thrift
  6. Cassandra

New features:

 

 

Need to learn more about AWS big data (demystified)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s