AWS Big Data Demystified #2 | AWS Athena, Spectrum, EMR, Hive

AWS Big Data Demystified #2 |

AWS Athena, Spectrum, EMR, Hive |

Omid Vahdaty, Big Data Ninja

Lectures slides:

 

The video

 

Some follow up questions from the meetup attendees via email (answer may be inlined):

1. I did all the steps as you guided in the EMR lecture (#2) including installing Presto and setting on the Glue check mark (for all : Presto,spark and hive)
 Then I created schema , table , parquet table (in Athena) and run in hive “insert overwrite”  – exactly the same way as you guided in the lecture
Now when I open Hue – in Hive I see the new DB (tpch_data) but in Presto I don’t see this db
see the attachment
Moreover Presto SQLs are not running (not finish) when I run Presto from Hue
Might be the problem is that I created db from Athena?
if you create the table in hive, you have to add the database before table name
otherwise it will be created on local database of hive. 
 
I have not used presto except for testing (when i created the architecture), i believe it is a common mistake we do here in my team as well.
 
if you have schema created in athena called sampleDB
in hive (or where ever ever u create a table )
you must use 
create table sampleDB.tableName ()
this should do the trick. notice the list of databases on the left in hue (default, tpchdata), open then to see where the table was created.
 
 if not, let me know, we can resolve this via screen sharing.
2. Zeppelin is not loading. When I click on Zeppelin link in AWS EMR page the page is not loading…
All others apps load good – only Zeppling is stuck
did you open the SG for the port 8890?
are using tunnel?
did u use foxy proxy to enable access on your browser
did you configure something special to zeppelin? 
FW issue on your office side?
try restarting the zeppelin sudo stop zeppelin, sudo start zeppelin
if all  the above fails, screen share will help 🙂
3. ARCHITECTURE
1. Can you share your retention policies for the data?
I mean according to your architecture you do:
1. gzip texts files and load it to S3 dedicated raw data buckets  (gzipped)
2. clean, transfrom to parquet (via hive) and save the parquet files in other S3 buckets hierarchy (
3. module, data enrichemnt, flattening etc.. and again I guess you have here other S3 buckets  hierarchy
In bottom line you have 3 logical layers of data :
a. original text
b. original, cleaned parquet uncomporessed
c. moduled parquet gzip
[same same but different. I keep one bucket per data source, with 3 “folders” inside , each for folder per layer above u mentioned above. this is due to different restriction i have from business side on each data source for example different encryption \ retention \ access management] 
The questions are :
 3.1. Do you apply the same retention policy to all layers?
yes and no.
No: b/c i treat raw data differently. i delete it after i finished initial transformation to parquet and gzip. (no cleansing or changes)
different encryption policies for each data source.
yes: once the data is transformed and modelted, i keep it for 120 days. ( cookie live cycle is 90 days)  
3.2.How long do you keep it?
120 days , per use case. 
3.3. Do you use / plan use AWS Glacier for old data?
no need for COLD data in my use case. 🙂 check the costs … 
3.4. How do you handle GDPR? If someone ask to delete data 6 months old . You can not delete from parquet so the only way to do this is : delete the entire file and regenerate it. For this you need some kind of index to find the target for deletion files (for all 3 layers)
Do you have something like that?
keep the data partitioned per day, have a s3 lifecycle policy on the bucket for 180 days (pert specific path) , use msck daily, and use hive external tables + dynamic partitioning  for inserts. this way , when u delete data on bucket level, and use msck, the table is “updated” with removed partitions, and the jobs are not failing.
 
Note – there is not insert overwrite and dynamic partitioning in presto…
 
as for GDPR…
we started working on obfuscation on the data and delete the raw data, and keeping the obfuscated data anonymized and adding opt in and out procedures, and checking if the citizen is an EU citizen. we have much work to  do. [ see lecture 4….]
 
3. Presto+Zeppelin vs. Athena
   I understand all advantages of managed service but in this specific case I don’t see many
   Install dediacted EMR cluster with Presto , Zeppelin and Ganglia, working with extrenal tables , data in s3 . NO Hadoop .  + Autoscale
Why do I need Athena?? What do I miss ?
this is a good question. yes, both are applicable. it is a matter of cost/performance ratio and more.if you have many queries per day, the cost of athena will be high. as the cost is mainly on TB read per day. however if you have same query running over and over in 24 hours, the results are cached. so the cost will be lower. 
so in the end of the day it is a matter of use case, expected concurrency, budget, and performance targets, simplicity VS flexibility (managed notebooks is not a feature of athena, but for ad hoq query/ minimal BI backend it should do the trick). 
 
what will be cheaper? need to know how many TB’s are read to answer this best way. as atehna uses on demand resources, and EMR may use spot instance and auto scaling.
what will be faster? could go either way depending on the cluster , as EMR are dedicated resources
 
Here is the link (you probably familiar with) how to integrate presto with zeppelin
The crazy stuff is that with Presto you can query Kafka and Elastic . So theoretically you can build more hollistic solutions without Athena. Also less cloud lock
sounds great! assuming u need to query kafka and elastic.
it will take a bit longer to implement, as there will probably some learning curve, but other than that, it sounds great. i won’t not worry about vendor lock 🙂

Technical comments from the audience sent via email:

Hi Omid, In your #2 presentation in the slides about Hive you use ‘org.openx.data.jsonserde.JsonSerDe’
This is very old serde and indeed you mentioned it in the slide But the problem with it if you create with this serde table in Hive – then you can NOT work with this table in Presto. When you run any query in Presto on this table – you get the exception :
deserializer does not exist: org.openx.data.jsonserde.JsonSerDe at com.facebook.presto.jdbc.PrestoResultSet.resultsException(PrestoResultSet.java
The correct serde to use is ‘org.apache.hive.hcatalog.data.JsonSerDe’ It’s in Presto classpath And so Presto works properly when you switch to it

————————————————-

Special thanks to vlady for the feedback 🙂

 

Need to learn more about aws big data (demystified)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s