Airflow Installation manual and workflow example

Airflow (ubuntu manual):

I used the below command, it took me several attempts, so i list here the list of CMD’s that worked for me.

Prerequisites

 

sudo apt-get update –fix-missing

      sudo apt-get -y install build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev

     sudo apt-get -y install build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev

      sudo pip install airflow

     sudo apt-get -y install python-setuptools python-dev build-essential

     sudo easy_install pip

 sudo pip install airflow


pip install pystan
apt install -y libmysqlclient-dev 

sudo -H pip install apache-airflow[all_dbs]

sudo -H pip install apache-airflow[devel]

pip install apache-airflow[all]


export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

----------------------------

notice SQLlite is the default DB , it is not possible to run parralel on it. it is used to get started
can you mysql

scales out with mesos.

DAG - is just a containr script to connect all task, cant pass data betwean task via the dag, there is s spesific module for it. 

operator - what runs defacto

Airflow provides operators for many common tasks, including:

  • BashOperator – executes a bash command
  • PythonOperator – calls an arbitrary Python function
  • EmailOperator – sends an email
  • HTTPOperator – sends an HTTP request
  • MySqlOperatorSqliteOperatorPostgresOperatorMsSqlOperatorOracleOperatorJdbcOperator, etc. – executes a SQL command
  • Sensor – waits for a certain time, file, database row, S3 key, etc…

In addition to these basic building blocks, there are many more specific operators: DockerOperatorHiveOperatorS3FileTransferOperatorPrestoToMysqlOperatorSlackOperator… you get the idea!

 

Quick summary of the terms used by Airflow

TASK- a running operator…
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig

Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks. 

The connection information to external systems is stored in the Airflow metadata database and managed in the UI

Queues

When using the CeleryExecutor, the celery queues that tasks are sent to can be specified. queue is an attribute of BaseOperator,

XComs let tasks exchange messages

Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow.

Branching

Sometimes you need a workflow to branch, or only go down a certain path based on an arbitrary condition which is typically related to something that happened in an upstream task. One way to do this is by using the BranchPythonOperator.

SubDAGs

SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow.

SLAs

Service Level Agreements, or time by which a task or DAG should have succeeded, can be set at a task level as a timedelta

Trigger Rules

Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more complex dependency settings.

Jinja Templating

Airflow leverages the power of Jinja Templating and this can be a powerful tool to use in combination with macros (see the Macros section).

——

My first DAG:

https://airflow.apache.org/tutorial.html

confirm no syntax errors:

python ~/airflow/dags/tutorial.py

# print the list of active DAGs
airflow list_dags

# prints the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial

# prints the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial --tree


# command layout: command subcommand dag_id task_id date

# testing print_date
airflow test tutorial print_date 2015-06-01

# testing sleep
airflow test tutorial sleep 2015-06-01

Now remember what we did with templating earlier? See how this template gets rendered and executed by running this command:

# testing templated
airflow test tutorial templated 2015-06-01

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s