Working with Avro

Reasons to work with  AVRO

  1. good for nested data structures even in analytics use cases
  2. good for Google Big Query, for import
  3. good for Google Big Query as an external source which is not CSV.

Reasons NOT to work with AVRO

  1. Row based , generally, not good for analytics.
  2. Doesn’t not support GZIP.

 

Several ways to create AVRO table

CREATE TABLE doctors
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "testing.hive.avro.serde",
  "name": "doctors",
  "type": "record",
  "fields": [
    {
      "name":"number",
      "type":"int",
      "doc":"Order of playing the role"
    },
    {
      "name":"first_name",
      "type":"string",
      "doc":"first name of actor playing role"
    },
    {
      "name":"last_name",
      "type":"string",
      "doc":"last name of actor playing role"
    },
    {
      "name":"extra_field",
      "type":"string",
      "doc:":"an extra field not in the original file",
      "default":"fishfingers and custard"
    }
  ]
}');

Another way

CREATE TABLE my_avro_table(notused INT)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
    'avro.schema.url'='file:///tmp/schema.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

 

AVRO with compression and partitions example:

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

be sure to test the size of the output to see if matches your needs 🙂 snappy is faster, deflate is slightly more compact. 🙂

CREATE external TABLE dfp_test_partitions_avro
(
`DATE` date ,
`ADVERTISER_NAME` STRING ,
`ORDER_ID` int ,
`ORDER_NAME` STRING ,
`AD_UNIT_NAME` STRING ,
`COUNTRY_NAME` STRING ,
`LINE_ITEM_NAME` STRING ,
`CREATIVE_SIZE` STRING ,
`CREATIVE_SIZE_DELIVERED` STRING ,
`LINE_ITEM_TYPE` STRING ,
`DAY` int ,
`ADVERTISER_ID` int ,
`AD_UNIT_ID` int ,
`COUNTRY_CRITERIA_ID` int ,
`LINE_ITEM_ID` int ,
`LINE_ITEM_GOAL_QUANTITY` STRING ,
`LINE_ITEM_LIFETIME_IMPRESSIONS` STRING ,
`TOTAL_LINE_ITEM_LEVEL_IMPRESSIONS` int ,
`TOTAL_LINE_ITEM_LEVEL_CLICKS` int ,
`TOTAL_LINE_ITEM_LEVEL_CPM_AND_CPC_REVENUE` decimal )

STORED AS AVRO
PARTITIONED BY (dt string)
LOCATION ‘gs://omid-eu-west-1/dfp_gzip_avro_partition/’

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s