How to export data from Google Big Query into AWS S3 + EMR hive

Perquisites

  1. GCP basics – create machine
  2. big query basics – export data
  3. a machine @@ GCP
  4. gsutil – google cloud storage utility – copy/ls etc.
  5. s3 bucket + with user (access key + secret key)
  6. avro tools
  7. java

Steps to reconstruct export data from google big query into aws s3 + emr hive:

  1. from GCP machine run export from biq query to google cloud storage (notice teh destination format):

bq –location=US extract –destination_format AVRO myProject-169xxx:132673xxx.ga_sessions_20180526 gs://your-gs-bucket/myAvro_20180526*

  1. from the GCP machine edit ~/.boto file to add your aws credentials to gsutil as follows: https://www.quora.com/How-can-I-migrate-data-from-Google-cloud-storage-into-AWS-S3-buckets
  2. from gcp machine run the gsutil to copy the data from google cloud storage to aws s3:

gsutil rsync -d -r gs://your-gs-bucket s3://your-aws-bucket

  1. from aws run the following command to get the inlined schema file :

java -jar ~/avro-tools-1.7.7.jar getschema myAvro_20180526.avro > myAvro_20180526.avsc

5. copy  the myAvro_20180526.avsc file in a bucket on aws

6. in hive @ aws  create table as follows:

CREATE EXTERNAL TABLE g

STORED AS AVRO

LOCATION ‘s3://your-aws-bucket/‘ TBLPROPERTIES (‘avro.schema.url’=’s3://your-aws-bucket/avsc/g.avsc‘);

7. run in hive @aws query to confirm data

select * from from limit 1; 

  1. run in hive @ aws the below to see to see table attributes.

show create table g;

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s