Blog

Load CSV from AWS S3 into AWS RDS Aurora

I understand that you would like to connect to your Aurora instance from your EC2.

In order to achieve this, please make sure that the service is running using the following command: 

sudo service mysqld status

Alternatively, install mysql client directly from the mysql repositories (https://dev.mysql.com/downloads/). Please note that these are mysql community repositories and are not maintained/supported by AWS.  
https://dev.mysql.com/doc/refman/5.7/en/linux-installation-yum-repo.html

Once the service is running you can use the following connection string:

mysql -h Cluster end point -P 3306 -u Your user -p 

Please, follow the link for further information while Connecting to an Amazon Aurora MySQL DB Cluster: 
[+]https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Connecting.html

Shortcut to Install mysql CLI:

sudo yum install mysql

 Connect to RDS, after confirming your Security group allows access from your ec2 to your RDS, example:

mysql -u MyUser -p myPassword -h MyClusterName

Create Role to allow aurora access to s3:

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.Authorizing.IAM.CreateRole.html

Set your cluster to use the above role (with access to s3)

assuming your DB is on private LAN, Setup VPC endpoint to s3 (always a good practice to setup and endpoint anyways):

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.Authorizing.Network.html

Add to aurora parameter group – aurora_load_from_s3_role- the ARN of the role with s3 access, and reboot the Auroroa DB.

grant your user s3 access via:

GRANT LOAD FROM S3 ON *.* TO ‘user‘@’domain-or-ip-address

Create a table on mysql , example:

use myDatabase;

create table engine_triggers(
today_date date,
event_prob double
);

Load into examples (ignoring header or using quoted fields_ :

LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ INTO TABLE engine_triggers fields terminated by ‘,’ LINES TERMINATED BY ‘\n’ IGNORE 1 lines;

LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ INTO TABLE engine_triggers   fields TERMINATED BY ‘,’ ENCLOSED BY ‘”‘ ESCAPED BY ‘”‘ ;

Ignoring header and using quoted fields

LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ 
INTO TABLE engine_triggers
fields TERMINATED BY ‘,’
ENCLOSED BY ‘”‘ 
LINES TERMINATED BY ‘\n’
IGNORE 1 lines;

You can automate it via crontab hourly runs as follows (notice the extra escape char before \n and inside enclosed by”:

0 * * * * mysql -u User -pPassword -hClusterDomainName -e “use myDatabase; truncate myDatabse.engine_triggers; LOAD DATA FROM S3 ‘s3://bucket/file.csv’ INTO TABLE engine_triggers   fields TERMINATED BY ‘,’ ENCLOSED BY ‘\”‘ LINES TERMINATED BY ‘\\n’ IGNORE 1 lines”

Deatilted manual:

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html

and some documentation about mysql data types:

https://www.w3schools.com/sql/sql_datatypes.asp

Running queries on RDS Aurora from AWS Lambda

You can find the relevant set of steps for accessing your Amazon Aurora instance using Lambda in the following documentation:

[+] Tutorial: Configuring a Lambda Function to Access Amazon RDS in an Amazon VPC – https://docs.aws.amazon.com/lambda/latest/dg/vpc-rds.html

I also carried out a test for connecting to my Aurora instance from Lambda. Following are the steps taken by me in order to achieve the same:

Create an Aurora Cluster and connect to the Writer instance using cluster endpoint. Create sample database and table. (Make sure you have correct set of source IP address given in the Security group of the instance for allowing successful connection. )

Now coming to creating a Lambda function to access the Aurora instance:


Creating Role

To start with, we first need to create an execution role that gives your lambda function permission to access AWS resources. 

Please follow the to create an execution role:

1. Open the roles page in the IAM console: https://console.aws.amazon.com/iam/home#/role
2. Choose Roles from the left dashboard and select Create role.
3. Under the tab “Choose the service that will use this role” select Lambda and then Next:Permissions
4. Search for “AWSLambdaVPCAccessExecutionRole”. Select this and then Next:Tags
5. Provide a Tag and then a Role Name (ex. lambda-vpc-role) and then Create Role.

The AWSLambdaVPCAccessExecutionRole has the permissions that the function needs to manage network connections to a VPC. 


Creating Lambda Function

Please follow the below steps to create a Lambda function:

1. Open the Lambda Management Console : https://console.aws.amazon.com/lambda
2. Choose Create a function
3. Choose In Author from scratch, and then do the following: 
    * In Name*, specify your Lambda function name.
    * In Runtime*, choose Python 2.7.
    * In Execution Role*, choose “Use an existing role”.
    * In Role name*, enter a name for your role which was previously created “lambda-vpc-role”.
4. Choose create function.
5. Once you have created the lambda function, navigate to the function page .
6. In the function page, Under Networks Section do the following.
    * In VPC, choose default VPC
    * In Subnets*, choose any two subnets
    * In Security Groups*, choose the default security group
7. Click on Save

Setting up Lambda Deployment Environment

Next you will need to set up a deployment environment to deploy a python code that connects to the RDS database.
To connect to a Aurora using Python you will need to import pymysql module. Hence we need to install dependencies with Pip and create a deployment package. In your local console please execute these commands in your local environment. 

1. Creating a local directory which will be the deployment package:
$ mkdir rds_lambda;

$ cd rds_lambda/

$ pwd
/Users/user/rds_lambda

2. Install pymysql module 
$ pip install pymysql -t /Users/user/rds_lambda

By executing the above command you will install the pymysql module in your current directory

3. Next create a python file which contain the code to connect to the RDS instance:
$sudo nano connectdb.py

I have attached the file “connectdb.py” which has the  Python code to connect to the RDS instance.

4. Next we need to zip current directory and upload it to the lambda function.
$ zip -r rds_lambda.zip `ls` 

The above command creates a zip file “rds_lambda.zip” which we will need to upload to the lambda function.
Navigate to the newly created lambda function Console page :

1. In the Function Code section -> Code Entry Type -> From the drop down select upload a zip file
2. Browse the zip file from the local directory 
3. Next you in the Function Code Section you will have to change the Handler to pythonfilename.function (ex. connectdb.main).
4. Click Save.
5. Next you will need to Add  the security group of the Lambda Function in your RDS Security group.
6. After that test the connection, by creating a test event.

If you see that the execution successful then the connection has been made.

You may also go through the below video link which will give a detailed explanation on how to connect to an RDS instance using a lambda function
[+]https://www.youtube.com/watch?v=-CoL5oN1RzQ&vl=en

Followed by successfully establishing the connection, you can modify the python file to query databases inside the Aurora instance.

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…

Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.

how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?

In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically – if it is related to big data – this is THE meetup.

Some of our online materials (mixed content from several cloud vendor):

Website:

https://big-data-demystified.ninja (under construction)

Meetups:

Big Data Demystified

Tel Aviv-Yafo, IL
494 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Next Meetup

Big Data Demystified | From Redshift to SnowFlake

Sunday, May 12, 2019, 6:00 PM
23 Attending

Check out this Meetup Group →

AWS Big Data Demystified

Tel Aviv-Yafo, IL
635 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Check out this Meetup Group →

You tube channels:

https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber

https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

Audience:

Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D

Alluxio Demystified | Unify Data Analytics Any Stack Any Cloud

Personally, I have been waiting for over a year to host this lecture at our meetup. At the time in Walla News , I wanted to test drive their solution to accelerate Hive and spark SQL over s3 and external tables. if you are into caching, performance, and and unifying your multiple storage solutions : GCS, S3, etc, You might want to hear the wonderful lecturer Bin Fan, Phd , Founding Engineer and VP open Source at Alluxio.

This Post will be update soon more! stay tuned. for now, you are welcome to join our meetup.

Unify Data Analytics: Any Stack Any Cloud | Webinar | Big Data Demystified

Tuesday, Mar 19, 2019, 7:00 PM

Kikar Halehem
Kdoshei HaShoa St 63 Herzliya, IL

22 Members Went

**** This is a first webinar on this meetup, Please be patient**** The webinar will be broadcasted via Youtube : https://www.youtube.com/watch?v=5g89Wn6qgc0 if you want to join and beome active in this webinar via hangout: https://hangouts.google.com/hangouts/_/3cjuacifwrdtpp2htrcrusakaae if there is a problem join our meetup group for last minute …

Check out this Meetup →

Serverless Data Pipelines | Big Data Demystified

We had the pleasure to host Michael Haberman, Founder at Topsight :

Serverless is the new kid in town but lets not forget data which is also critical for your organisation, in this talk we will look at the benefits of going serverless with your data pipeline, but also the challenges it raises. This talk will be heavily loaded with demos so watch out!

AWS Big Data Demystified | Serverless data pipeline

Sunday, Mar 3, 2019, 6:00 PM

Investing.com
Ha-Shlosha St 2 Tel Aviv-Yafo, IL

56 Members Went

Agenda: 18:00 networking and gathering 18:30 “A Polylog about Redis” , Itamar Haber 19:15 “Serverless data pipeline” , Michael Haberman Lecturer : Itamar Haber, Technology Evangelist —————————————————————- Bio: a self proclaimed “Redis Geek”, Itamar is the Technology Evangelist at Redis Labs, the home of op…

Check out this Meetup →