what is the cheapest ways to parse GA_sessions at big query? What is the fastest way to parse GA_sessions at big query? What is the simplest way to parse GA_sessions at big query?

Option one: sub query

The simplest solution would be using sub query for each custom dimension, processing 3.18 GB. The problem is that if you you a view on top of it, the cost does not behave as you would expect. so be careful, and test everything.

select
fullVisitorId ,
(select value from h.customDimensions where index = 1) as UID,
(select distinct value from h.customDimensions where index = 36) as App_Edition
FROM
1234.ga_sessions_20180501 t, t.hits h

For example

Option 2: User Defined Function:

using used defined function with SQL. it involves baseis coding. This would be the slowest solution, but unlikely to hit a resources problem.

The query below processed 3.18 GB

CREATE TEMPORARY FUNCTION getCustomDimension(cd ARRAY>, index INT64)
RETURNS STRING
LANGUAGE js AS “””
for(var i = 0; i < cd.length; i++) {
var item = cd[i];
if(item.index == index) {
return item.value
}
}
return ”;
“””;

SELECT
gs_sessions.fullVisitorId,
getCustomDimension(hits.customDimensions, 1) AS customDimension1_uid,
getCustomDimension(hits.customDimensions, 36) AS customDimension36_edition
FROM 1234.ga_sessions_20180501 gs_sessions,
UNNEST(gs_sessions.hits) hits

Option 3: With table as select

this option is the cheapest… 570 mb was processed, However. with a large number of custom dimensions , you might hit an error of “Error: Query exceeded resource limits. 86676.81298774751 CPU seconds were used,this query must use less than 55000.0 CPU seconds.” and there is nothing you can do about it . furthermore – the solution is not elegant in terms of maintenance, and it will slow down the more custom dimensions you have.

WITH CD AS (
SELECT
with_ga_sessions.fullVisitorId,
with_ga_sessions.visitId,
with_hits.hitNumber,
with_cd.index,
with_cd.value
FROM 1234.ga_sessions_20180501 with_ga_sessions,
UNNEST(with_ga_sessions.hits) with_hits,
UNNEST(with_ga_sessions.customDimensions) with_cd
WHERE with_cd.index IN (1, 36)
)
SELECT
ga_sessions.visitId,
CD_1.value AS customDimension1_uid,
CD_36.value AS customDimension36_edition
FROM 1234.ga_sessions_20180501 ga_sessions,
UNNEST(ga_sessions.hits) hits
LEFT JOIN CD CD_1 ON ga_sessions.fullVisitorId = CD_1.fullVisitorId
AND ga_sessions.visitId = CD_1.visitId
AND hits.hitNumber = CD_1.hitNumber
AND CD_1.index = 1
LEFT JOIN CD CD_36 ON ga_sessions.fullVisitorId = CD_36.fullVisitorId
AND ga_sessions.visitId = CD_36.visitId
AND hits.hitNumber = CD_36.hitNumber
AND CD_36.index =36

Parsing the last 4 days of google analytics table GA_sessions in google BigQuery

The problems with querying google BigQuery:

  1. The table is sharded not partitioned – this means each table has it owen table name e.g ga_sessions_20190202
  2. The GA sharded table also has some temporary data on a table called ga_sessions_intraday_20190505
  3. so when you select * from ga_sessions_* you will also get some unexpected ga_sessions_intraday_* shards

Below is example of querying the last 4 days in any GA_sessions table

select * from myProject.MyDataset.ga_sessions_*
where TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY)) AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) OR (_TABLE_SUFFIX = CONCAT('intraday',CAST(FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY)) AS string) ) OR TABLE_SUFFIX = CONCAT('intraday',CAST(FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AS string) ))

Load CSV from AWS S3 into AWS RDS Aurora

I understand that you would like to connect to your Aurora instance from your EC2.

In order to achieve this, please make sure that the service is running using the following command: 

sudo service mysqld status

Alternatively, install mysql client directly from the mysql repositories (https://dev.mysql.com/downloads/). Please note that these are mysql community repositories and are not maintained/supported by AWS.  
https://dev.mysql.com/doc/refman/5.7/en/linux-installation-yum-repo.html

Once the service is running you can use the following connection string:

mysql -h Cluster end point -P 3306 -u Your user -p 

Please, follow the link for further information while Connecting to an Amazon Aurora MySQL DB Cluster: 
[+]https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Connecting.html

Shortcut to Install mysql CLI:

sudo yum install mysql

 Connect to RDS, after confirming your Security group allows access from your ec2 to your RDS, example:

mysql -u MyUser -p myPassword -h MyClusterName

Create Role to allow aurora access to s3:

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.Authorizing.IAM.CreateRole.html

Set your cluster to use the above role (with access to s3)

assuming your DB is on private LAN, Setup VPC endpoint to s3 (always a good practice to setup and endpoint anyways):

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.Authorizing.Network.html

Add to aurora parameter group – aurora_load_from_s3_role- the ARN of the role with s3 access, and reboot the Auroroa DB.

grant your user s3 access via:

GRANT LOAD FROM S3 ON *.* TO ‘user‘@’domain-or-ip-address

Create a table on mysql , example:

use myDatabase;

create table engine_triggers(
today_date date,
event_prob double
);

Load into examples (ignoring header or using quoted fields_ :

LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ INTO TABLE engine_triggers fields terminated by ‘,’ LINES TERMINATED BY ‘\n’ IGNORE 1 lines;

LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ INTO TABLE engine_triggers   fields TERMINATED BY ‘,’ ENCLOSED BY ‘”‘ ESCAPED BY ‘”‘ ;

Ignoring header and using quoted fields

LOAD DATA FROM S3 ‘s3://bucket/sample_triggers.csv’ 
INTO TABLE engine_triggers
fields TERMINATED BY ‘,’
ENCLOSED BY ‘”‘ 
LINES TERMINATED BY ‘\n’
IGNORE 1 lines;

You can automate it via crontab hourly runs as follows (notice the extra escape char before \n and inside enclosed by”:

0 * * * * mysql -u User -pPassword -hClusterDomainName -e “use myDatabase; truncate myDatabse.engine_triggers; LOAD DATA FROM S3 ‘s3://bucket/file.csv’ INTO TABLE engine_triggers   fields TERMINATED BY ‘,’ ENCLOSED BY ‘\”‘ LINES TERMINATED BY ‘\\n’ IGNORE 1 lines”

Deatilted manual:

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html

and some documentation about mysql data types:

https://www.w3schools.com/sql/sql_datatypes.asp

Running queries on RDS Aurora from AWS Lambda

You can find the relevant set of steps for accessing your Amazon Aurora instance using Lambda in the following documentation:

[+] Tutorial: Configuring a Lambda Function to Access Amazon RDS in an Amazon VPC – https://docs.aws.amazon.com/lambda/latest/dg/vpc-rds.html

I also carried out a test for connecting to my Aurora instance from Lambda. Following are the steps taken by me in order to achieve the same:

Create an Aurora Cluster and connect to the Writer instance using cluster endpoint. Create sample database and table. (Make sure you have correct set of source IP address given in the Security group of the instance for allowing successful connection. )

Now coming to creating a Lambda function to access the Aurora instance:


Creating Role

To start with, we first need to create an execution role that gives your lambda function permission to access AWS resources. 

Please follow the to create an execution role:

1. Open the roles page in the IAM console: https://console.aws.amazon.com/iam/home#/role
2. Choose Roles from the left dashboard and select Create role.
3. Under the tab “Choose the service that will use this role” select Lambda and then Next:Permissions
4. Search for “AWSLambdaVPCAccessExecutionRole”. Select this and then Next:Tags
5. Provide a Tag and then a Role Name (ex. lambda-vpc-role) and then Create Role.

The AWSLambdaVPCAccessExecutionRole has the permissions that the function needs to manage network connections to a VPC. 


Creating Lambda Function

Please follow the below steps to create a Lambda function:

1. Open the Lambda Management Console : https://console.aws.amazon.com/lambda
2. Choose Create a function
3. Choose In Author from scratch, and then do the following: 
    * In Name*, specify your Lambda function name.
    * In Runtime*, choose Python 2.7.
    * In Execution Role*, choose “Use an existing role”.
    * In Role name*, enter a name for your role which was previously created “lambda-vpc-role”.
4. Choose create function.
5. Once you have created the lambda function, navigate to the function page .
6. In the function page, Under Networks Section do the following.
    * In VPC, choose default VPC
    * In Subnets*, choose any two subnets
    * In Security Groups*, choose the default security group
7. Click on Save

Setting up Lambda Deployment Environment

Next you will need to set up a deployment environment to deploy a python code that connects to the RDS database.
To connect to a Aurora using Python you will need to import pymysql module. Hence we need to install dependencies with Pip and create a deployment package. In your local console please execute these commands in your local environment. 

1. Creating a local directory which will be the deployment package:
$ mkdir rds_lambda;

$ cd rds_lambda/

$ pwd
/Users/user/rds_lambda

2. Install pymysql module 
$ pip install pymysql -t /Users/user/rds_lambda

By executing the above command you will install the pymysql module in your current directory

3. Next create a python file which contain the code to connect to the RDS instance:
$sudo nano connectdb.py

I have attached the file “connectdb.py” which has the  Python code to connect to the RDS instance.

4. Next we need to zip current directory and upload it to the lambda function.
$ zip -r rds_lambda.zip `ls` 

The above command creates a zip file “rds_lambda.zip” which we will need to upload to the lambda function.
Navigate to the newly created lambda function Console page :

1. In the Function Code section -> Code Entry Type -> From the drop down select upload a zip file
2. Browse the zip file from the local directory 
3. Next you in the Function Code Section you will have to change the Handler to pythonfilename.function (ex. connectdb.main).
4. Click Save.
5. Next you will need to Add  the security group of the Lambda Function in your RDS Security group.
6. After that test the connection, by creating a test event.

If you see that the execution successful then the connection has been made.

You may also go through the below video link which will give a detailed explanation on how to connect to an RDS instance using a lambda function
[+]https://www.youtube.com/watch?v=-CoL5oN1RzQ&vl=en

Followed by successfully establishing the connection, you can modify the python file to query databases inside the Aurora instance.

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…

Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.

how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?

In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically – if it is related to big data – this is THE meetup.

Some of our online materials (mixed content from several cloud vendor):

Website:

https://big-data-demystified.ninja (under construction)

Meetups:

Big Data Demystified

Tel Aviv-Yafo, IL
494 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Next Meetup

Big Data Demystified | From Redshift to SnowFlake

Sunday, May 12, 2019, 6:00 PM
23 Attending

Check out this Meetup Group →

AWS Big Data Demystified

Tel Aviv-Yafo, IL
635 Members

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The techn…

Check out this Meetup Group →

You tube channels:

https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber

https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

Audience:

Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D