Bootstrapping EMR from 0900 to 1700 on each work day with AWS Cloud Formation and AWS Data Pipelines

This article covers some options to bootstrap a daily AWS EMR cluster to work continuously from 0900 in the morning to 17:00 , e.g office hours .

I will be showing two options:

  1. using data pipelines and AWS CMD
  2. using Cloud Formation and EMR

The Cluster will consist of:

  1. one master node on demand.
  2. one data node, on demand, no autoscaling.
  3. task group with auto scaling, using SPOT instances.
  4. Applications: Spark, Hive, Presto, Ganglia and more.
  5. One step action for my custom steps. this will include all my custom configurations such as glue connectors, maximise resource allocation etc.
  6. One DNS xxx.myDomain.com that will forwarded to the masterNodePublicDNS. this is useful if you have actual employees using this cluster from 0900 to 17:00 , you want to be able to have a look and feel of “always on cluster” by letting them query xxx.myDomain.com instead of the AWS EMR masterDNS

 

important node: use use https://jsonformatter.curiousconcept.com/ to reformat the below JSONS’s easily.

Option 1: using AWS Data Pipelines to bootstrap AWS EMR

  1. use data pipelines to lunch an EMR cluster,  with task group, auto scaling, glue connectors, and maximize resources config for spark, you will need a cmd that would look like :

aws emr create-cluster –auto-scaling-role EMR_AutoScaling_DefaultRole –applications Name=Ganglia Name=Spark Name=Hive Name=Tez Name=Zeppelin Name=Oozie Name=Hue Name=Presto Name=Livy –ec2-attributes ‘{“KeyName”:”walla_omid”,”AdditionalSlaveSecurityGroups”:[“sg-a22c”],”InstanceProfile”:”sampleOmid53-EMRClusterinstanceProfile-U080RX3ACCZT”,”SubnetId”:”subnet-222″,”EmrManagedSlaveSecurityGroup”:”sg-222″,”EmrManagedMasterSecurityGroup”:”sg-22″,”AdditionalMasterSecurityGroups”:[“sg-22”]}’ –service-role sampleOmid53-EMRClusterServiceRole-KWO13FMZNHF2 –release-label emr-5.13.0 –log-uri ‘s3n://aws-logs-12344-eu-west-1/elasticmapreduce/’ –steps ‘[{“Args”:[“s3://emr-bootstrap/MyBbootstrap-emr.sh”],”Type”:”CUSTOM_JAR”,”ActionOnFailure”:”CONTINUE”,”Jar”:”s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”,”Properties”:””,”Name”:”Custom JAR”}]’ –name ‘myEmrCluster’ –instance-groups ‘[{“InstanceCount”:1,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:32,”VolumeType”:”gp2″},”VolumesPerInstance”:1}]},”InstanceGroupType”:”CORE”,”InstanceType”:”r4.xlarge”,”Name”:”Core”},{“InstanceCount”:1,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:32,”VolumeType”:”gp2″},”VolumesPerInstance”:1}]},”InstanceGroupType”:”MASTER”,”InstanceType”:”r4.xlarge”,”Name”:”Master”},{“InstanceCount”:0,”BidPrice”:”15″,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:50,”VolumeType”:”gp2″},”VolumesPerInstance”:1}],”EbsOptimized”:true},”InstanceGroupType”:”TASK”,”InstanceType”:”r4.xlarge”,”Name”:”TaskSpotsNinja”}]’ –configurations ‘[{“Classification”:”hive-site”,”Properties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”},”Configurations”:[]},{“Classification”:”spark”,”Properties”:{“maximizeResourceAllocation”:”true”},”Configurations”:[]},{“Classification”:”presto-connector-hive”,”Properties”:{“hive.metastore.glue.datacatalog.enabled”:”true”},”Configurations”:[]},{“Classification”:”spark-hive-site”,”Properties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”},”Configurations”:[]}]’ –scale-down-behavior TERMINATE_AT_TASK_COMPLETION –region eu-west-1

2.if your cluster already exists, use aws emr cli to list cluster

aws emr list-clusters –active –output text | grep CLUSTERS 
3. given the cluster id from previous steps, use aws emr cli to describe cluster and confirm the tags are the tags of resources you need.

aws emr describe-cluster –cluster-id j-1124HDDG47D1 –output text | grep TAGS

4. from here, I am pretty sure, you proceed on your own and create a script, once you have all the ID’s u need.

Option 1 Summery:

  1. a fairly quick and simple way to manage clusters
  2. you need to manage the ID’s and instance id’s on your on as input/ouput per each step of your workflow (create emr, task resources, assigning LB, assigning DNS etc)

Option 2: Using Cloud Formation to bootstrap EMR

Another options would be to use cloud formation ,work hard to create the configuration JSON that tells the CF stack what to do,  and the stack will take care of the stop/start of the correct resources.

once you have the JSON, you can schedule a lambda to start stack on the required time such as 0900 (cloud watch trigger). example of a lambda function to start  a CF stack:

def lambda_handler(event, context)
# TODO implement
#return ‘Hello from Lambda’
client = boto3.client(‘cloudformation’)
response = client.create_stack
StackName=’DevEMR’
return(response)

and an example of lambda terminate a stack:

import boto 
def lambda_handler(event, context)
# TODO implement 
#return ‘Hello from Lambda’ 
client = boto3.client(‘cloudformation’) 
response = client.delete_stack(StackName=’DevEMR’)

example of a stack temple, it give you a good sense, bit too specific, but good place to get started:

https://github.com/awslabs/aws-cloudformation-templates/blob/master/aws/services/EMR/EMRCLusterGangliaWithSparkOrS3backedHbase.json

 

Since it is very easy to makes mistakes in CloudFormation, i have attache several examples of clusters, each example, adds something new to the cluster. this way , you can take the basic example below and start adding, and comparing to what i did.

another working example is for EMR cluster with many apps selected, with no instance group and auto scaling.

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Description”: “myEmrCluster”,
“Parameters”: {
“EMRClusterName”: {
“Description”: “Name of the cluster”,
“Type”: “String”,
“Default”: “myEmrCluster”
},
“KeyName”: {
“Description”: “Must be an existing Keyname”,
“Type”: “String”,
“Default”: “walla_omid”

},
“MasterInstacneType”: {
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”,
“Default”: “r4.xlarge”
},
“CoreInstanceType”: {
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”,
“Default”: “r4.xlarge”
},
“NumberOfCoreInstances”: {
“Description”: “Must be a valid number”,
“Type”: “Number”,
“Default”: 1
},
“SubnetID”: {
“Description”: “Must be Valid public subnet ID”,
“Default”: “subnet-012344e”,
“Type”: “String”
},
“LogUri”: {
“Description”: “Must be a valid S3 URL”,
“Default”: “s3://aws-logs-12313231eu-west-1/elasticmapreduce/”,
“Type”: “String”
},
“S3DataUri”: {
“Description”: “Must be a valid S3 bucket URL “,
“Default”: “s3://aws-logs-12131212-eu-west-1/elasticmapreduce/”,
“Type”: “String”
},
“ReleaseLabel”: {
“Description”: “Must be a valid EMR release version”,
“Default”: “emr-5.13.0”,
“Type”: “String”
},
“Applications”: {
“Description”: “Cluster setup:”,
“Type”: “String”,
“AllowedValues”: [
“Spark”,
“TBD”
]
}
},
“Mappings”: {},
“Conditions”: {
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
},
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
}
},
“Resources”: {
“EMRCluster”: {
“DependsOn”: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Type”: “AWS::EMR::Cluster”,
“Properties”: {
“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“Configurations”: [
{“Classification”:”hive-site”, “ConfigurationProperties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”}},
{“Classification”:”spark”, “ConfigurationProperties”:{“maximizeResourceAllocation”:”true”}},
{“Classification”:”presto-connector-hive”, “ConfigurationProperties”:{“hive.metastore.glue.datacatalog.enabled”:”true”}},
{“Classification”:”spark-hive-site”, “ConfigurationProperties”:{“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”}}

],
“Instances”: {
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“AdditionalMasterSecurityGroups” : [ “sg-ad1234” ],
“AdditionalSlaveSecurityGroups” : [ “sg-aa41234” ],
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“TerminationProtected”: false
},
“VisibleToAllUsers”: true,
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
}
}
},
“EMRClusterServiceRole”: {
“Type”: “AWS::IAM::Role”,
“Properties”: {
“AssumeRolePolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
},
“Action”: [
“sts:AssumeRole”
]
}
]
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
}
},
“EMRClusterinstanceProfileRole”: {
“Type”: “AWS::IAM::Role”,
“Properties”: {
“AssumeRolePolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
},
“Action”: [
“sts:AssumeRole”
]
}
]
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
}
},
“EMRClusterinstanceProfile”: {
“Type”: “AWS::IAM::InstanceProfile”,
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
}
}
},
“Outputs”: {}
}

 

Quick note: Cloud former and EMR – not recommended…

Another way to get the JSON for cloud formation is to use cloud former, it uses existing setup and reverse engineers a json for it. below is the documentation. but before you jump right to it… EMR is now supported in cloud former :(.  you can use it for

1) VPCs
2) VPC Network (VPC Subnets, Internet Gateways, Customer Gateways, DHCP options)
3) VPC Security (Network ACLs, Rote Tables)
4) Network (ELB, Elastic IP , Network Interfaces)
5) Compute (Auto Scaling Groups, EC2 Instances)
6) Storage (EBS Volumes, RDS Instances, DynamoDB Tables, S3 Buckets)
7) Services (SQS, SNS Topics, SimpleDB Domains)
8) Config (Auto Scaling Launch Configurations, RDS Subnet groups, RDS Parameter Groups)
9) Security (EC2 Security Groups, RDS Security Groups, SQS Queue Policies, SNS Tpic Policies, S3 Bucket Policies)
10)Optional Resources (AutoScaling Policies, CloudWatch Alarms)

However, EMR is not yet supported in cloud former. and I have created a Feature Request to the internal team to see if they can implement in. However this service has been in beta since 2015, so it might be a while before EMR support comes out.

Documentation of Cloud former:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-using-cloudformer.html

Or using this youtube video blog 🙂

 

Advanced EMR cluster bootstrapping using Cloud Formation example of json

as there is no good compressive examples for AWS EMR bootstrapping with all the different options and the fact, the it takes a lot of time do debug each time. I am contributing this JSON we are using internally to AWS Support for them to publish on their online resources:

This AWS EMR cluster will contain:

1 master node (on demand)

1 data node (on demand)

1 task node (spot)

Auto scaling – scale in/out

apps:spark,hive,presto and more…

config: maximizeResourceAllocation, glue for spark/hive/presto

 

again, use https://jsonformatter.curiousconcept.com/ to reformat the below json easily.

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “walla_omid”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-111111111-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-1111111-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-123456e”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
DependsOn“: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {
“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-1234”
],
“AdditionalSlaveSecurityGroups”: [
“sg-1234”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
“Properties”: {
“Market”:”SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 4,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 12,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
}
}
}

 

Once the cluster is up you need to run steps to automate your cluster needs

documentation for creating a step that runs a bash script:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

you can do it also via

  1. the GUI
  2. the UI.
    1. go to steps, add new step:
    2. JAR location: s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar
    3. arguments: s3://emr-bootstrap/my-bootstrap-emr.sh
    4. “Add”, and wait 🙂
  3. Cloud Formation

 

addgin the step to the same cluster from above will change the jsons as follows:

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “walla_omid”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-1234-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-23rt-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-12345”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
“DependsOn”: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {

“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-1234”
],
“AdditionalSlaveSecurityGroups”: [
“sg-12345”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
DependsOn“: “EMRCluster”,
“Properties”: {
“Market”:”SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 4,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 12,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“TestStep”: {
“Type”: “AWS::EMR::Step”,
“Properties”: {
“ActionOnFailure”: “CONTINUE”,
“HadoopJarStep”: {
“Args”: [
“s3://emr-bootstrap/Mybootstrap-emr.sh”
],
“Jar”: “s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”
},
“Name”: “CustomBootstrap”,
“JobFlowId”: {
“Ref”: “EMRCluster”
}
}
}
}
}

 

Another upgrade to the above cluster is just adding a DNS came for the master. this is useful when you have a team of analysts, connecting to this 0900 to 1700 cluster every day, and you don’t want to really change the JDBC settings every day 🙂 so just create a DNS CNAME for the master node of EMR.

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “walla_omid”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-506754145427-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-506754145427-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-0647325e”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
DependsOn“: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {

“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-ad4e13cb”
],
“AdditionalSlaveSecurityGroups”: [
“sg-aa4e13cc”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
DependsOn“: “EMRCluster”,
“Properties”: {
“Market”:”SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 4,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 2
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 12,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“TestStep”: {
“Type”: “AWS::EMR::Step”,
“Properties”: {
“ActionOnFailure”: “CONTINUE”,
“HadoopJarStep”: {
“Args”: [
“s3://byoo-emr-bootstrap/bootstrap-emr.sh”
],
“Jar”: “s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”
},
“Name”: “CustomBootstrap”,
“JobFlowId”: {
“Ref”: “EMRCluster”
}
}
},
“myDNSRecord” : {
“Type” : “AWS::Route53::RecordSet”,
“Properties” : {
“HostedZoneName” : “b-yoo.net.”,
“Comment” : “DNS name for my instance. for emr. cloud formation”,
“Name” : “xxx.myDomain.com”,
“Type” : “CNAME”,
“TTL” : “600”,
“ResourceRecords” : [ { “Fn::GetAtt” : [ “EMRCluster”, “MasterPublicDNS” ] } ]
}
}
}
}

You may have conceded using ALB  on top of EMR, but currently CloudFormation, does not return the instance ID of the master node. you only get the Master node public dns, so u can only create CNAME for it using route 53. you can involve lambda some code to get thing moving, but there is an open feature request to resolve this issue, so you may want to hold on. 🙂

quick note about deleting stack

you may want to add dependsOn attribute to help cloud Formation properly delete the resources on rollback or delete stack. attaching the same json with depends on. this json also a small mistake we did in previous jsons in the autoscaling, i forgot to add the “UNIT” attribute to make sure the value is considered as percentage

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Conditions”: {
“Hbase”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Hbase”
]
},
“Spark”: {
“Fn::Equals”: [
{
“Ref”: “Applications”
},
“Spark”
]
}
},
“Description”: “myEmrCluster”,
“Mappings”: {},
“Outputs”: {},
“Parameters”: {
“Applications”: {
“AllowedValues”: [
“Spark”,
“TBD”
],
“Description”: “Cluster setup:”,
“Type”: “String”
},
“CoreInstanceType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for core instances.”,
“Type”: “String”
},
“EMRClusterName”: {
“Default”: “myEmrCluster”,
“Description”: “Name of the cluster”,
“Type”: “String”
},
“KeyName”: {
“Default”: “aws_big_data_demystified”,
“Description”: “Must be an existing Keyname”,
“Type”: “String”
},
“LogUri”: {
“Default”: “s3://aws-logs-123-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 URL”,
“Type”: “String”
},
“MasterInstacneType”: {
“Default”: “r4.xlarge”,
“Description”: “Instance type to be used for the master instance.”,
“Type”: “String”
},
“NumberOfCoreInstances”: {
“Default”: 1,
“Description”: “Must be a valid number”,
“Type”: “Number”
},
“ReleaseLabel”: {
“Default”: “emr-5.13.0”,
“Description”: “Must be a valid EMR release version”,
“Type”: “String”
},
“S3DataUri”: {
“Default”: “s3://aws-logs-1234-eu-west-1/elasticmapreduce/”,
“Description”: “Must be a valid S3 bucket URL “,
“Type”: “String”
},
“SubnetID”: {
“Default”: “subnet-1234e”,
“Description”: “Must be Valid public subnet ID”,
“Type”: “String”
}
},
“Resources”: {
“EMRCluster”: {
“DependsOn”: [
“EMRClusterServiceRole”,
“EMRClusterinstanceProfileRole”,
“EMRClusterinstanceProfile”
],
“Properties”: {
“Applications”: [
{
“Name”: “Ganglia”
},
{
“Name”: “Spark”
},
{
“Name”: “Hive”
},
{
“Name”: “Tez”
},
{
“Name”: “Zeppelin”
},
{
“Name”: “Oozie”
},
{
“Name”: “Hue”
},
{
“Name”: “Presto”
},
{
“Name”: “Livy”
}
],
“AutoScalingRole”: “EMR_AutoScaling_DefaultRole”,
“Configurations”: [
{
“Classification”: “hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
},
{
“Classification”: “spark”,
“ConfigurationProperties”: {
“maximizeResourceAllocation”: “true”
}
},
{
“Classification”: “presto-connector-hive”,
“ConfigurationProperties”: {
“hive.metastore.glue.datacatalog.enabled”: “true”
}
},
{
“Classification”: “spark-hive-site”,
“ConfigurationProperties”: {
“hive.metastore.client.factory.class”: “com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
],
“Instances”: {
“AdditionalMasterSecurityGroups”: [
“sg-1234”
],
“AdditionalSlaveSecurityGroups”: [
“sg-1234”
],
“CoreInstanceGroup”: {
“InstanceCount”: {
“Ref”: “NumberOfCoreInstances”
},
“InstanceType”: {
“Ref”: “CoreInstanceType”
},
“Market”: “ON_DEMAND”,
“Name”: “Core”
},
“Ec2KeyName”: {
“Ref”: “KeyName”
},
“Ec2SubnetId”: {
“Ref”: “SubnetID”
},
“MasterInstanceGroup”: {
“InstanceCount”: 1,
“InstanceType”: {
“Ref”: “MasterInstacneType”
},
“Market”: “ON_DEMAND”,
“Name”: “Master”
},
“TerminationProtected”: false
},
“JobFlowRole”: {
“Ref”: “EMRClusterinstanceProfile”
},
“LogUri”: {
“Ref”: “LogUri”
},
“Name”: {
“Ref”: “EMRClusterName”
},
“ReleaseLabel”: {
“Ref”: “ReleaseLabel”
},
“ServiceRole”: {
“Ref”: “EMRClusterServiceRole”
},
“VisibleToAllUsers”: true
},
“Type”: “AWS::EMR::Cluster”
},
“EMRClusterInstanceGroupConfig”: {
DependsOn“: “EMRCluster”,
“Properties”: {
“Market”: “SPOT”,
“AutoScalingPolicy”: {
“Constraints”: {
“MaxCapacity”: 40,
“MinCapacity”: 0
},
“Rules”: [
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 4
}
},
“Description”: “yarn-scale-out2”,
“Name”: “yarn-scale-out2”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “LESS_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 20,
“Unit”: “PERCENT”
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: -1
}
},
“Description”: “yarn-scale-in1”,
“Name”: “yarn-scale-in1”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “YARNMemoryAvailablePercentage”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 80,
“Unit”: “PERCENT”
}
}
},
{
“Action”: {
“SimpleScalingPolicyConfiguration”: {
“AdjustmentType”: “CHANGE_IN_CAPACITY”,
“CoolDown”: 100,
“ScalingAdjustment”: 4
}
},
“Description”: “con-scale-out”,
“Name”: “con-scale-out”,
“Trigger”: {
“CloudWatchAlarmDefinition”: {
“ComparisonOperator”: “GREATER_THAN_OR_EQUAL”,
“EvaluationPeriods”: 1,
“MetricName”: “ContainerPendingRatio”,
“Namespace”: “AWS/ElasticMapReduce”,
“Period”: 300,
“Threshold”: 0.75,
“Unit”: “COUNT”
}
}
}
]
},
“BidPrice”: “15”,
“EbsConfiguration”: {
“EbsBlockDeviceConfigs”: [
{
“VolumeSpecification”: {
“SizeInGB”: “50”,
“VolumeType”: “gp2”
},
“VolumesPerInstance”: “1”
}
],
“EbsOptimized”: “true”
},
“InstanceCount”: 1,
“InstanceRole”: “TASK”,
“InstanceType”: “r4.xlarge”,
“JobFlowId”: {
“Ref”: “EMRCluster”
},
“Name”: “TaskSpotsNinja”
},
“Type”: “AWS::EMR::InstanceGroupConfig”
},
“EMRClusterServiceRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“elasticmapreduce.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole”
],
“Path”: “/”,
“Policies”: [
{
“PolicyName”: “s3fullaccess”,
“PolicyDocument”: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: “s3:*”,
“Resource”: “*”
}
]
}
}
]
},
“Type”: “AWS::IAM::Role”
},
“EMRClusterinstanceProfile”: {
“Properties”: {
“Path”: “/”,
“Roles”: [
{
“Ref”: “EMRClusterinstanceProfileRole”
}
]
},
“Type”: “AWS::IAM::InstanceProfile”
},
“EMRClusterinstanceProfileRole”: {
“Properties”: {
“AssumeRolePolicyDocument”: {
“Statement”: [
{
“Action”: [
“sts:AssumeRole”
],
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“ec2.amazonaws.com”
]
}
}
],
“Version”: “2012-10-17”
},
“ManagedPolicyArns”: [
“arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role”
],
“Path”: “/”
},
“Type”: “AWS::IAM::Role”
},
“TestStep”: {
“Type”: “AWS::EMR::Step”,
“DependsOn”: “EMRClusterInstanceGroupConfig”,
“Properties”: {
“ActionOnFailure”: “CONTINUE”,
“HadoopJarStep”: {
“Args”: [
“s3://byoo-emr-bootstrap/bootstrap-emr.sh”
],
“Jar”: “s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar”
},
“Name”: “CustomBootstrap”,
“JobFlowId”: {
“Ref”: “EMRCluster”
}
}
},
“myDNSRecord”: {
“Type”: “AWS::Route53::RecordSet”,
“DependsOn”: [“EMRCluster”],
“Properties”: {
“HostedZoneName”: “b-yoo.net.”,
“Comment”: “DNS name for my instance. for emr. cloud formation”,
“Name”: “xxx.b-yoo.net”,
“Type”: “CNAME”,
“TTL”: “600”,
“ResourceRecords”: [
{
“Fn::GetAtt”: [
“EMRCluster”,
“MasterPublicDNS”
]
}
]
}
}
}
}

Back to the scheduling… via Lambda

Once you settled on your cloud formation stack, you may want to trigger it at 0900 and kill it at 17:00

here is a lambda boto3 code snippet for the lunch. notice this is a more advanced example the example in the beginning of the blog, as I added the options to select an application in the JSON. and I added the “CAPABILITY_IAM” explicit acknowledgement required by CF:

import boto3

def lambda_handler(event, context):
client = boto3.client(‘cloudformation’)
response = client.create_stack(
StackName=’StgEMR’,
Parameters=[
{
‘ParameterKey’: ‘Applications’,
‘ParameterValue’: ‘Spark’,
‘UsePreviousValue’: True,
‘ResolvedValue’: ‘string’
},
],
Capabilities=[
‘CAPABILITY_IAM’,
],
TemplateURL=’https://myBucket/emrClusterCloudFormation.json’)
return(response)

with a cloud formation trigger  with a schedule expression:

Schedule expression: cron(0 9 ? * SUN-THU *)

and another lambda code snippet would be for the destroy

import boto3

def lambda_handler(event, context):
client = boto3.client(‘cloudformation’)
response = client.delete_stack(StackName=’StgEMR’)

with a cloud formation trigger  with a schedule expression:

Schedule expression: cron(0 17 ? * SUN-THU *)

 

Important note about this lambda:

you will need to create a role for lambda that include permissions for:

  1. CloudFormation policy that creates and delete stack
  2. Route53 for managing the DNS
  3. IAM policies to add those EMR roles
  4. EMR polices to lunch clusters
  5. s3 read only to read the JSON file 🙂

I highly recommend to use least privilege practice to minimise the permissions given to the lambdas and using VPC to run the Lambda.

 

Option 2 Summery

  • we learned several ways to lunch an EMR cluster from 0900 to 1700
  • Just for perspective, it took my 5 full working days to create the EMR json, and and 1 hour to work with lambda. 🙂
  • CloudFormation can you achiever many things with EMR, though the JSON creation process was not trivial, and the documentation was a bit lacking in terms of EMR and cloud formation, i was able to provide a comprehensive working cloud formation example for your to play with and customize to your needs
  • Once the JSON of CF is ready, triggering within lambda will take you about an hour including the learning curve.
  • The beauty of using CF, is that stack takes care of the resources, you do not need to handle instance id, or public dns, you simple working with dynamic parameters.
  • The downside of working with EMR and CF, that as of  today, there is no easy and trivial way to add load balancers to the json, as the EMR wont return the instance ID of the master node, only its public DNS.

 

Thanks, and have fun!

—————————————————————————————————–

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s