How to export data from Google Big Query into AWS S3 + EMR hive or AWS Athena

Perquisites

  1. GCP basics – create machine, or use cloud shell.
  2. big query basics – export data
  3. a machine @@ GCP
  4. gsutil – google cloud storage utility – copy/ls etc.
  5. s3 bucket + with user (access key + secret key)
  6. avro tools
  7. java

The motivation

export the data with the schema, as the structure is highly nested and includes complex data types.

Steps to reconstruct export data from google big query into aws s3 + emr hive or athena:

  1. from GCP machine run export from big query to google cloud storage (notice the destination format):

bq –location=US extract –destination_format AVRO –compression SNAPPY ​​​   `gap:123.ga_sessions_20190129` gs://your-gs-bucket/file*.avro

  1. from the GCP machine edit ~/.boto file to add your aws credentials to gsutil as follows: https://www.quora.com/How-can-I-migrate-data-from-Google-cloud-storage-into-AWS-S3-buckets or try the following:
    1. gcloud auth login

    2. edit ~/.boto file, add the AWS access key , secret key
  2. from gcp machine run the gsutil to copy the data from google cloud storage to aws s3:

gsutil rsync -d -r gs://your-gs-bucket s3://your-aws-bucket

  1. from aws run the following command to get the inlined schema file :

java -jar ~/avro-tools-1.7.7.jar getschema myAvro_20180526.avro > myAvro_20180526.avsc

5. copy  the myAvro_20180526.avsc file in a bucket on aws

6. in hive @ aws  create table as follows:

CREATE EXTERNAL TABLE g

STORED AS AVRO

LOCATION ‘s3://your-aws-bucket/‘ TBLPROPERTIES (‘avro.schema.url’=’s3://your-aws-bucket/avsc/g.avsc‘);

7. run in hive @aws query to confirm data

select * from g limit 1; 

  1. run in hive @ aws the below to see to see table attributes.

show create table g;

Alternate method to create table on top of the AVRO:

  1. make sure the AWS bucket name is in the example name of your table
  2. use AWS Glue wizard to crawl the data and infer the schema on the fly via Athena.

 

Why you should not try to create the table your self (manually):

here the original spec of the schema: https://support.google.com/analytics/answer/3437719?hl=en

as you can see there multiple duplicate columns names, making extremely hard to parse the data. see hits level data and user level data. both will have similar column names

Traversing the data, assuming complex array data structure (non trivial):

https://prestodb.github.io/docs/current/functions/array.html

https://docs.aws.amazon.com/athena/latest/ug/rows-and-structs.html

 

Example of traversing google analytics ga_sessions table in AWS Athena:

Athena has a ROW data type which the equivalent of struct in BigQuery

select hits_page.hostname from
(
SELECT
visitnumber,
–TRY(hits[2]) AS hit_row2,
–TRY(hits[3]) AS hit_row3,
hits[1].page as page,
hits[1].transaction as transaction,
CAST(hits[1].page AS ROW(pagePath VARCHAR,
hostname VARCHAR,
pageTitle VARCHAR,
searchKeyword VARCHAR,
searchcategory VARCHAR,
pagePathLevel1 VARCHAR,
pagePathLevel2 VARCHAR,
pagePathLevel3 VARCHAR,
pagePathLevel4 VARCHAR
)) AS hits_page

FROM ga_session_table)
where hits_page.hostname !=”
limit 100

 

Another way to query nested data in Athena:

SELECT hit.page.pagepathlevel1 FROM omid_ga_session_test CROSS JOIN UNNEST(hits) AS t (hit);

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s