When should we use EMR and When should we use Redshift? EMR VS Redshift

Use Redshift when

  1. traditional data warehouse
  2. When you need the data relatively hot for analytics such as BI
  3.  when there is no data engineering team
  4. When you require joins
  5. when u need a cluster 24X7
  6. when you data type are simple
  7. when no nested jsons
  8. peta scale database
  9. when you want analize massive amount of data (spectrum)
  10. when u need update/delete
  11. when you require and ACID DBMS

Use EMR (SparkSQL, Presto, hive) when

  1. When you need a transient cluster
  2. when elasticity is important (auto scaling on tasks)
  3. when cost is important: spots
  4. until a few hundred TB’s
  5. when you want to separate compute and storage (external table + task node + auto scaling)
  6. when you require more flexibility
    1. complex partitions + dynamic partitioning + insert overwrite
    2. complex data type
      1. structs
      2. arrays <–> nested json
    3. orchestration built in
    4. notebook built in – mix code with SQL

 

Please check below Redshift specific faq: 

Q: When would I use Amazon Redshift vs. Amazon EMR?
Q: Can Redshift Spectrum replace Amazon EMR?
Q: Can I use Redshift Spectrum to query data that I process using Amazon EMR?

— Reference : Redshift faq
https://aws.amazon.com/redshift/faqs/

Please check below EMR specific faq:

Q: What can I do with Amazon EMR?
Q: Who can use Amazon EMR?
Q: What can I do with Amazon EMR that I could not do before?
Q: What is the data processing engine behind Amazon EMR?
Q: What is Apache Spark?
Q: What is Presto?

— Reference : EMR faq
https://aws.amazon.com/emr/faqs/

** Point 2. I am listing other resources which can help to understand RDS and EMR use cases better.

— Reference :
AWS redshift related case studies > Look for case study section :
https://aws.amazon.com/redshift/getting-started/
https://pages.awscloud.com/redshift-proof-of-concept-request.html

— Reference :
AWS EMR related case studies > Look for case study section :
https://aws.amazon.com/emr/
https://pages.awscloud.com/GLOBAL_OT_emr-poc_20170530.html

** Point 3. I have tried to check some of AWS blogs which shows how EMR and RDS can be used together in specific use cases. 

— How I built a data warehouse using Amazon Redshift and AWS services in record time
https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

— Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP
https://aws.amazon.com/blogs/big-data/build-a-healthcare-data-warehouse-using-amazon-emr-amazon-redshift-aws-lambda-and-omop/

— Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

Hope this information helps in understanding EMR and Redshift use cases better.

 

Need to learn more about aws big data (demystified)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s