Working with Avro

Reasons to work withΒ  AVRO

  1. good for nested data structures even in analytics use cases
  2. good for Google Big Query, for import
  3. good for Google Big Query as an external source which is not CSV.

Reasons NOT to work with AVRO

  1. Row based , generally, not good for analytics.
  2. Doesn’t not support GZIP.


Several ways to create AVRO table

SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "testing.hive.avro.serde",
  "name": "doctors",
  "type": "record",
  "fields": [
      "doc":"Order of playing the role"
      "doc":"first name of actor playing role"
      "doc":"last name of actor playing role"
      "doc:":"an extra field not in the original file",
      "default":"fishfingers and custard"

Another way

CREATE TABLE my_avro_table(notused INT)


AVRO with compression and partitions example:

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

be sure to test the size of the output to see if matches your needs πŸ™‚ snappy is faster, deflate is slightly more compact. πŸ™‚

CREATE external TABLE dfp_test_partitions_avro
`DATE` date ,
`ORDER_ID` int ,
`DAY` int ,
`AD_UNIT_ID` int ,
`LINE_ITEM_ID` int ,

PARTITIONED BY (dt string)
LOCATION ‘gs://omid-eu-west-1/dfp_gzip_avro_partition/’

Big Data and Hadoop options over Microsoft Azure Cloud summery

Azure HDInsights features and advantages:

  1. HiveΒ with LLAP
  2. Spark , SparlSQL, ML,Steaming
  3. Pig
  4. Hbase,
  5. Storm
  6. U-sql – c# and SQL
  7. federated query across several data sources
  8. Kafka! (with rack awareness to azure), replication with mirror maker
  9. Microsoft R server!
  10. Zeppelin and jupiter integration.
  11. Apache Ambari View.
  12. Sqoop
  13. Oozie
  14. Zookeeper, for leader election of head nodes (master node)
  15. Mahut,Β discontinued in v4.0
  16. phoenix (SQL over Hbase)
  17. mono – open source C# .net implementation.
  18. Apache Slider – like yarn.Β, not in the new version. discontinued in v4.0
  19. Apache Livy
  20. Security – kerberos, and active directory, apache ranger
  21. External Hive metastore
  22. very rich documentations:Β
  23. Rich Developer plugins
    1. Zeppelin
    2. intellij
    3. Eclipse
    4. R
    5. Visual studio
    6. Jupiter


  1. data lake analytics
  2. machine learning
  3. Power BI!!
  4. Azure Cosmos DB – extensions of Azure documentdB, basically noSQL
  5. Azure data factory – orchestration
  6. Azure Event Hub
  7. ISV data science
    1. H2o
    2. data iku

More advantages

  1. each worker can be configure for different sizes.
  2. Hive ODBC
  3. Hive add on for excel
  4. Auto scaling.


  1. Gateway nodes – management and security.
  2. Head nodes – like name node, in High availability
  3. Edge nodes – not for data processing, it is for developer and data scientist job testing.
  4. worker nodes – like data nodes.
  5. zoo keeper nodes – for leader election of head nodes.
  6. nimbus nodes – with storm.
  7. Hive meta store – Azure SQL
  8. Azure Data lake storeΒ  and Azure blob


  1. Azure cli to create clusters
  2. Airflow – open source.
  3. TBD.



AWS Big Data Demystified #1.1 | Big Data Architecture Lessons LearnedΒ 

spoken Language: English.

Lecture Video:

Aws big-data-demystified #1.1 | Big Data Architecture Lessons Learned | lecture Keywords:

1. AWS Big Data Demystified #1.1 Big Data Architecture Lessons Learned Omid Vahdaty, Big Data Ninja

2.Β Disclaimer

  • I am not trying to sell anything.
  • This is purely knowledge transfer session.
  • You are more than welcome to challenge each slide, during the lecture and afterwards πŸ™‚
  • This lecture was released at 2018, take into account things change over time.

3.Β When the data outgrows your ability to process

  • Volume (100TB processing per day)
  • Velocity (4GB/s)
  • Variety (JSON, CSV, Veracity (how much of the data is accurate?)

4.Β In the past (web,api, ops db, data warehouse) API DW




8.Β Challenges creating big data architecture?

  • What is the business use case ?
  • How fast do u need the insights?
  • 15 min – 24 hours delay and above ?
  • Less than 15 min?
  • Streaming?Β  Sub seconds delay?Β  Β Sub minute delay? Streaming with in flight analytics ?Β  How complex is the compute jobs? Aggregations? joins?

9.Β Challenges creating big data architecture?

  • What is the velocity?
  • Under 100K events per second? Not a problem
  • Over 1M events per second? Costly. But doable.
  • Over 1B events per seconds? Not trivial at all.
  • Volume ?Β Β Well…. It depends.
  • Veracity (how are you going to handle different data sources?)
  • Structured (CSV)
  • Semi structured (JSON,XML)
  • Unstructured (pictures, movies etc)

10.Β Challenges creating big data architecture?

  • Performance targets?
  • Costs targets?
  • Security restrictions?
  • Regulation restriction?
  • privacy?
  • Which technology to choose?
  • Datacenter or cloud?
  • Latency?
  • Throughput?
  • Concurrency?
  • Security Access patterns?
  • Pass? Max 7 technologies
  • Iaas? Max 4 technologies

11.Β Cloud Architecture rules of thumb…

  • Decouple : β—‹ Store β—‹ Process β—‹ Store β—‹ Process β—‹ insight…
  • Rule of thumb: max 3 technologies in dc, 7 tech max in cloud
  • Don’t use more b/c: maintenance β—‹ Training time β—‹ complexity/simplicity

12.Β Use Case 1: Analyzing browsing history

  • Data Collection: browsing history from an ISP
  • Product – derives user intent and interest for marketing purposes.
  • Challenges β—‹ Velocity: 1 TB per day β—‹ History of: 3M β—‹ Remote DC β—‹ Enterprise grade security β—‹ Privacy

13.Β Use Case 2: Insights from location based data

  • Data collection: from a Mobile operator
  • Products: β—‹ derives user intent and interest for marketing purposes. β—‹ derive location based intent for marketing purposes.
  • Challenges β—‹ Velocity: 4GB/s … β—‹ Scalability: Rate was expected double every year… β—‹ Remote DC β—‹ Enterprise grade security β—‹ Privacy β—‹ Edge analytics

14.Β Use Case 3: Analyzing location based events.

  • Data collection: streaming
  • Product: building location based audiences
  • Challenges: minimizing DevOps work on maintenance of a DIY streaming system

15.Β So what is the product?

Big data platform that β—‹ collects data from multiple sources β—‹ Analyzes the data β—‹ Generates insights : β–  Smart Segments (online marketing) β–  Smart reports (for marketer) β–  Audience analysis (for agencies) ● Customers? β—‹ Marketers β—‹ Publishers β—‹ Agencies

16.Β My Big Data platform is about:

  • Data Collection β—‹ Online β–  messaging β–  Streaming β—‹ Offline β–  Batch β–  Performance aspects
  • Data Transformation (Hive) β—‹ JSON, CSV, TXT, PARQUET, Binary
  • Data Modeling – (R, ML, AI, DEEP, SPARK)
  • Data Visualization (choose your poison) ● PII regulation + GPDR regulation
  • And: Performance… Cost… Security… Simple… Cloud best practices…

17.Β Big Data Generic Architecture Data Collection (file based ETL from remote DC) Data Transformation ( row to colunar + cleansing) Data Modeling ( joins/agg/ML/R) Data Visualization Text, RAW

18.Β Big Data Generic Architecture | Data Collection Data Collection Data Transformation Data Modeling Data Visualization

19.Β Batch Data collection considerations

  • Every hour , about 30GB compressed CSV file
  • Why s3 β—‹ Multi part upload β—‹ S3 CLI β—‹ S3 SDK β—‹ (tip : gzip! )
  • Why Client – needs to run at remote DC
  • Why NOT your own client β—‹ Involves code β†’ β–  Bugs? β–  maintenance β—‹ Don’t analyze data at Edge , since you cant go back in time.
  • Why Not Streaming? β—‹ less accurate β—‹ Expensive

20.Β S3 Considerations

  • Security β—‹ at rest: server side S3-Managed Keys (SSE-S3) β—‹ at transit: SSL / VPN β—‹ Hardening: user, IP ACL, write permission only.
  • Upload β—‹ AWS s3 cli β—‹ Multi part upload β—‹ Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy β—‹ Consider S3 CLI Sync command instead of CP

21.Β Sqoop – ETL

  • Open source , part of EMR
  • HDFS to RDMS and back. Via JDBC.
  • E.g BiDirectional ETL from RDS to HDFS
  • Unlikely use case: ETL from customer source DB.

22.Β Flume & Kafka

  • Opens source project for streaming & messaging
  • Popular ● Generic ● Good practice for many use cases. (a meetup by it self) ● Highly durable, scalable, extension etc.
  • Downside : DIY, Non trivial to get started

23.Β Data Transfer Options

  • VPN
  • Direct Connect (4GB/s?)
  • For all other use case β—‹ S3 multipart upload β—‹ Compression β—‹ Security
  • Data at motion
  • Data at rest
  • bandwidth

24.Β Quick intro to Stream collection

  • Kinesis Client Library (code)
  • AWS lambda (code)
  • EMR (managed hadoop)
  • Third party (DIY) β—‹ Spark streaming (latency min =1 sec) , near real time, with lot of libraries. β—‹ Storm – Most real time (sub millisec), java code based. β—‹ Flink (similar to spark)

25.Β Kinesis

  • Stream – collect@source and near real time processing
  • Near real time β—‹ High throughput β—‹ Low cost β—‹ Easy administration – set desired level of capacity β—‹ Delivery to : s3,redshift, Dynamo, … β—‹ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. β—‹ Not managed!
  • Analytics – in flight analytics.
  • Firehose – Park you data @ destination.

26.Β Firehose – for Data parking

  • Not for fast lane – no in flight analytics
  • Capture , transform and load. β—‹ Kinesis β—‹ S3 β—‹ Redshift β—‹ elastic search
  • Managed Service

27.Β Comparison of Kinesis product

● Streams β—‹ Sub 1 sec processing latency β—‹ Choice of stream processor (generic) β—‹ For smaller events

● Firehose β—‹ Zero admin β—‹ 4 targets built in (redshift, s3, search, etc) β—‹ Buffering 60 sec minimum. β—‹ For larger β€œevents”

28.Β Big Data Generic Architecture | Data Collection Data Collection S3 Data Transformation Data Modeling Data Visualization

29.Β Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization

30.Β EMR ecosystem

● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper (hbase) ● zeppelin

31.Β EMR Architecture

● Master node

● Core nodes – like data nodes (with storage: HDFS)

● Task nodes – (extends compute)

● Does Not have Standby Master node

● Best for transient cluster (goes up and down every night)

32.Β EMR lesson learned…

● Bigger instance type is good architecture

● Use spot instances – for the tasks.

● Don’t always use TEZ (MR? Spark?)

● Make sure your choose instance with network optimized

● Resize cluster is not recommended

● Bootstrap to automate cluster upon provisioning

● Use Steps to automate steps on running cluster

● Use Glue to share Hive MetaStore

33.Β So use EMR for …

● Most dominant β—‹ Hive β—‹ Spark β—‹ Presto

● And many more….

● Good for: β—‹ Data transformation β—‹ Data modeling β—‹ Batch β—‹ Machine learning

34.Β Hive

● SQL over hadoop.

● Engine: spark, tez, MR


● Not good when need to shuffle.

● Not peta scale.

● SerDe json, parquet,regex,text etc.

● Dynamic partitions

● Insert overwrite

● Data Transformation

● Convert to Columnar

35.Β Presto

● SQL over hadoop

● Not good always for join on 2 large tables.

● Limited by memory

● Not fault tolerant like hive.

● Optimized for ad hoc queries

● No insert overwrite

● No dynamic partitions.

● Has some connectors : redshift and more

● https://amazon-aws-big-data- presto-demystified-everything-you- wanted-to-know-about-presto/

36.Β Pig

● Distributed Shell scripting

● Generating SQL like operations.

● Engine: MR, Tez

● S3, DynamoDB access

● Use Case: for data science who don’t know SQL, for system people, for those who want to avoid java/scala

● Fair fight compared to hive in term of performance only

● Good for unstructured files ETL : file to file , and use sqoop.

37.Β Hue

Hadoop user experience

● Logs in real time and failures.

● Multiple users

● Native access to S3.

● File browser to HDFS.

● Manipulate metascore

● Job Browser

● Query editor

● Hbase browser

● Sqoop editor, oozier editor, Pig Editor

38.Β Orchestration

● EMR Oozie β—‹ Opens source workflow

β–  Workflow: graph of action

β–  Coordinator: scheduler jobs β—‹ Support: hive, sqoop , spark etc. ● Other: AirFlow, Knime, Luigi, Azkaban,AWS Data Pipeline

39.Β Big Data Generic Architecture | Transformation Data Collection S3 Data Transformation Data Modeling Data Visualization

40.Β Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization

41.Β Spark

● In memory

● X10 to X100 times faster

● Good optimizer for distribution

● Rich API

● Spark SQL

● Spark Streaming

● Spark ML (ML lib)

● Spark GraphX (DB graphs)

● SparkR

42.Β Spark Streaming

● Near real time (1 sec latency)

● like batch of 1sec windows

● Streaming jobs with API

● Not relevant to us…

43.Β Spark ML

● Classification

● Regression

● Collaborative filtering

● Clustering

● Decomposition

● Code: java, scala, python, sparkR

44.Β Spark flavours

● Standalone

● With yarn

● With mesos

45.Β Spark Downside

● Compute intensive

● Performance gain over mapreduce is not guaranteed.

● Streaming processing is actually batch with very small window.

● Different behaviour between hive and spark SQL

46.Β Spark SQL

● Same syntax as hive

● Optional JDBC via thrift

● Non trivial learning curve

● Upto X10 faster than hive.

● Works well with Zeppelin (out of the box)

● Does not replaces Hive

● Spark not always faster than hive ● insert overwrite –

47.Β Apache Zeppelin

● Notebook – visualizer

● Built in spark integration

● Interactive data analytics

● Easy collaboration.

● Uses SQL

● work s on top of Hive/ SparkSQL

● Inside EMR.

● Uses in the background: β—‹ Shiro β—‹ Livy

48.Β R + spark R

● Open source package for statistical computing.

● Works with EMR

● β€œMatlab” equivalent

● Works with spark

● Not for developer πŸ™‚ for statistician

● R is single threaded – use spark R to distribute.

● Not everything works perfect.

49.Β Redshift

● OLAP, not OLTPβ†’ analytics , not transaction

● Fully SQL

● Fully ACID

● No indexing

● Fully managed

● Petabyte Scale


● Can create slow queue for queries β—‹ which are long lasting. ● DO NOT USE FOR transformation.

● Good for : DW, Complex Joins.

50.Β Redshift spectrum

● Extension of Redshift, use external table on S3.

● Require redshift cluster.

● Not possible for CTAS to s3, complex data structure, joins.

● Good for β—‹ Read only Queries β—‹ Aggregations on Exabyte.

51.Β EMR vs Redshift

● How much data loaded and unloaded?

● Which operations need to performed?

● Recycling data? β†’ EMR

● History to be analyzed again and again ? β†’ emr

● What the data needs to end up? BI?

●Use spectrum in some use cases. (aggregations)?

● Raw data? s3.

52.Β Hive VS. Redshift

● Amount of concurrency ? low β†’ hive, high β†’ redshift

● Access to customers? Redshift?

● Transformation, Unstructured , batch, ETL β†’ hive.

● Peta scale ? redshift ● Complex joins β†’ Redshift

53.Β Big Data Generic Architecture | Modeling Data Collection S3 Data Transformation Data Modeling Data Visualization

54.Β Big Data Generic Architecture | Visualize Data Collection S3 Data Transformation Data Modeling Data Visualization

55.Β Athena

● Presto SQL

● In memory

● Hive metastore for DDL functionality

β—‹ Complex data types β—‹ Multiple formats β—‹ Partitions

● Good for: β—‹ Read only SQL, β—‹ Ad hoc query, β—‹ low cost, β—‹ managed

56.Β Visualize

● QuickSight

● Managed Visualizer, simple, cheap

57.Β Big Data Generic Architecture | Summary Data Collection S3 Data Transformation Data Modeling Data Visualization

58.Β Summary: Lesson learned

● Productivity of Data Science and Data engineering β—‹ Common language of both teams IS SQL! β—‹ Spark cluster has many bridges: SparkR, Spark ML, SparkSQL , Spark core.

● Minimize the amount DB’s used β—‹ Different syntax (presto/hive/redshift) β—‹ Different data types β—‹ Minimize ETLS via External Tables+Glue!

● Not always Streaming is justified (what is the business use case? PaaS?)

● Spark SQL β—‹ Sometimes faster than redshift β—‹ Sometimes slower than hive β—‹ Learning curve is non trivial

● Smart Big Data Architecture is all about: β—‹ Faster, Cheaper, Simpler, More Secured.

59.Β Stay in touch…

● Omid Vahdaty

● +972-54-2384178


● Join our meetup, FB group and youtube channel

Airflow Installation manual and workflow example

Airflow (ubuntu manual):

I used the below command, it took me several attempts, so i list here the list of CMD’s that worked for me.



sudo apt-get update –fix-missing

Β  Β  Β  sudo apt-get -y install build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev

Β  Β  Β sudo apt-get -y install build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev

Β  Β  Β  sudo pip install airflow

Β  Β  Β sudo apt-get -y install python-setuptools python-dev build-essential

Β  Β  Β sudo easy_install pip

Β sudo pip install airflow

pip install pystan
apt install -y libmysqlclient-dev 

sudo -H pip install apache-airflow[all_dbs]

sudo -H pip install apache-airflow[devel]

pipΒ installΒ apache-airflow[all]

export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080


notice SQLlite is the default DB , it is not possible to run parralel on it. it is used to get started
can you mysql

scales out with mesos.

DAG - is just a containr script to connect all task, cant pass data betwean task via the dag, there is s spesific module for it. 

operator - what runs defacto

Airflow provides operators for many common tasks, including:

  • BashOperatorΒ – executes a bash command
  • PythonOperatorΒ – calls an arbitrary Python function
  • EmailOperatorΒ – sends an email
  • HTTPOperatorΒ – sends an HTTP request
  • MySqlOperator,Β SqliteOperator,Β PostgresOperator,Β MsSqlOperator,Β OracleOperator,Β JdbcOperator, etc. – executes a SQL command
  • SensorΒ – waits for a certain time, file, database row, S3 key, etc…

In addition to these basic building blocks, there are many more specific operators:Β DockerOperator,Β HiveOperator,Β S3FileTransferOperator,Β PrestoToMysqlOperator,Β SlackOperator… you get the idea!


Quick summary of the terms used by Airflow

TASK- a running operator…
HooksΒ are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig

Airflow poolsΒ can be used toΒ limit the execution parallelismΒ on arbitrary sets of tasks.Β 

TheΒ connectionΒ information to external systems is stored in the Airflow metadata database and managed in the UI


When using the CeleryExecutor, the celery queues that tasks are sent to can be specified.Β queueΒ is an attribute of BaseOperator,

XComsΒ let tasks exchange messages

VariablesΒ are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow.


Sometimes you need a workflow to branch, or only go down a certain path based on an arbitrary condition which is typically related to something that happened in an upstream task. One way to do this is by using theΒ BranchPythonOperator.


SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow.


Service Level Agreements, or time by which a task or DAG should have succeeded, can be set at a task level as aΒ timedelta

Trigger Rules

Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more complex dependency settings.

Jinja Templating

Airflow leverages the power ofΒ Jinja TemplatingΒ and this can be a powerful tool to use in combination with macros (see theΒ MacrosΒ section).


My first DAG:

confirm no syntax errors:

python ~/airflow/dags/

# print the list of active DAGs
airflow list_dags

# prints the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial

# prints the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial --tree

# command layout: command subcommand dag_id task_id date

# testing print_date
airflow test tutorial print_date 2015-06-01

# testing sleep
airflow test tutorial sleep 2015-06-01

Now remember what we did with templating earlier? See how this template gets rendered and executed by running this command:

# testing templated
airflow test tutorial templated 2015-06-01

setting the default interpreter of zeppelin for bootstrapping

When you bootstrap a new EMR zeppelin, once you open the notebook, you will be asked to save the default interpreter. in transient cluster you may want to set the default interpreter automatically.

To set the default interpreter, checkΒ /etc/zeppelin/conf/interpreter.jsonΒ and look for something like:

  "name": "spark",
  "class": "org.apache.zeppelin.spark.SparkInterpreter",
  "defaultInterpreter": true,
  "editor": {
    "language": "scala",
    "editOnDblClick": false
  "name": "pyspark",
  "class": "org.apache.zeppelin.spark.PySparkInterpreter",
  "defaultInterpreter": false,
  "editor": {
    "language": "python",
    "editOnDblClick": false

Now everything seems trivial. We just need to change theΒ defaultInterpreterΒ ofΒ sparkΒ toΒ false, andΒ defaultInterpreterΒ ofΒ pysparkΒ toΒ true.

And then restart the zeppelin

(sudo stop zeppelin; sudo start zeppelin).