Airflow Ec2

  1. Airflow Ecs Operator
  2. Airflow Ec2 Operator
  3. Airflow Ec2 Fan
  4. Airflow Start Ec2 Instance
  5. Airflow Ec2 Filters
  6. Airflow 2

An Amazon Elastic Compute Cloud (EC2) instance to set up the Airflow components. An Amazon Relational Database Service (RDS) Postgres instance to host the Airflow metadata database. An Amazon Simple Storage Service (S3) bucket to store the Amazon SageMaker model artifacts, outputs, and Airflow DAG with ML workflow. In the installation process of airflow aws Ubuntu ec2 instance. I am using windows 10 os, I read a medium article and install the airflow it was installed but not connecting to the web server. Leave a Comment Cancel Reply. Your email address will not be published. Required fields are marked. Type here.

Introduction

This post uses Redis and celery to scale-out airflow. Redis is a simple caching server and scales out quite well. It can be made resilient by deploying it as a cluster.

In my previous post, the airflow scale-out was done using celery with rabbitmq as the message broker. On the whole, I found the idea of maintaining a rabbitmq a bit fiddly unless you happen to be an expert in rabbitmq.

Redis seems to be a better solution when compared to rabbitmq. On the whole, it is a lot easier to deploy and maintain when compared with the various steps taken to deploy a RabbitMQ broker. In a nutshell, I like it more than RabbitMQ!

To create an infrastructure like this we need to do the following steps

  1. Install & Configure Redis server on a separate host – 1 server
  2. Install & Configure Airflow with Redis and Celery Executor support – 4 servers
  3. Start airflow scheduler & webserver
  4. Start airflow workers
  5. Start flower UI
  6. Run a DAG
  7. Watch it all happen!

Install & Configure Redis Server

We will install Redis-server on a separate machine and for the purposes of this blog entry, I have created it on an AWS EC2 instance( RedHat 8) on which Redis is installed.

Note: For production – It is recommended you set up a redis in a cluster just to make things redundant and eliminate single points of failure.

Download the source code from the following link

Execute the following command to install the required utilities to build the source code

Extract the redis source code using the following command

Navigate to the redis folder

Navigate to redis-stable directory. Execute the following commands one by one

Job done! – Redis is now ready to be run. Execute the following command

Install & Configure Airflow with Redis and Celery Executor

This part needs to be performed for all the Airflow servers exactly the same way. So have as many airflow servers just make sure all of them have the same airflow.cfg! All the airflow configuration for all the components is stored in the same airflow.cfg.

Step-2a – Install Airflow with Redis and Celery Support

Execute the following on your EC2 Instance

sudoyum installgcc python2-devel
sudo pip install setuptools -U
sudo pip install apache-airflow[celery,redis,s3,postgres,crypto,jdbc]1.10.4
sudo pip install psycopg2
sudo pip installkombu4.5.0
sudo pip uninstall marshmallow-sqlalchemy
sudo pip install marshmallow-sqlalchemy0.17.1

Step-2b – Configure Airflow – Set Executor

Set the executor in airflow.cfg to CeleryExecutor

Step-2c – Configure Airflow – Metadata DB

The hard part is now out of the way and all the configuration is now done in airflow.cfg. Provide database details – see an example below

Step-2d – Configure Airflow – Celery configuration

Scroll down the airflow.cfg and there is a section called celery do the following modifications

Airflow

Set the Celery broker URL to point to redis-server as below

Set the Celery Result Backend DB – this is the same database which airflow uses. Celery sends updates on airflow tasks. Note the way database is configured – slightly different from the way mentioned earlier in step2c

The hard part is now out of the way and all the configuration is now done in airflow.cfg.

Start airflow scheduler & webserver

Airflow Ec2

Start airflow webserver using the following command

Start the airflow scheduler using the following command

Start airflow workers

The idea here is that the airflow worker should be executing tasks sent to it by the airflow scheduler. Also, each worker is associated with a queue. The name of the queue is given when the worker is started. See below

We are in business our environment is up and running and now we can start are remaining worker nodes. Start the remaining two worker nodes with the following command. Essentially we want to have two worker nodes servicing the queue cloudwalker_q2 so that just in case one worker node goes down airflow will still be able to service the queue.airflow worker -q cloudwalker_q2

IMPORTANT: It is possible to serve a single queue with multiple workers. This introduces resilience in workflow orchestration. If one of the airflow workers becomes unavailable the tasks are then executed by the other workers servicing the queue

Start Flower UI

This will start the flower UI for monitoring all the tasks. You can access the flower UI using the following URL – http://<server-name>:5555

Start a DAG

Now let’s execute a DAG. Because we are now using workers and queues. In code for DAGs additional information needs to be added for queues. This is done on the task level. See Below

# Filename: hello_world2.py
from airflow import DAG
from airflow.operators.bash_operatorimport BashOperator
fromdatetimeimportdatetime, timedelta
default_args ={
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019,9,28),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('hello_world2',
schedule_interval='0 0 * * *',
catchup=False,
default_args=default_args
)
create_command ='echo *************************************$(hostname)*************************************'
t1 = BashOperator(
task_id='task_for_q1',
bash_command=create_command,
queue='cloudwalker_q1',
dag=dag
)
t2 = BashOperator(
task_id='task_for_q2',
bash_command=create_command,
queue='cloudwalker_q2',
dag=dag
)
t2.set_upstream(t1)

Observe that in the code we have added an additional parameter called queue when calling the BashOperator.

Watch it happen!

Step-7a Goto the Airflow UI & Execute the task

Step-7b Check the Dag

Step-7c Check the logs – task_for_q1

Step-7c Check the logs – task_for_q2

You now have a working airflow which is scaled out with celery and redis. This brings us to the end of this blog post. If you have made it till here – well done!. Hope this blog entry is useful!

This section will guide you through the pre requisites for the workshop.Please make sure to install the libraries before the workshop as the conference WiFican get quite slow when having too many people downloading and installing things at the sametime.

Make sure to follow all the steps as detailed here especially 🐍 PyCon attendeesas there are specific details for the PyCon setup that needs to be done in advance.

Python 3.x¶

3.7 Preferred

We will be using Python.Installing all of Python’s packages individually can be a bitdifficult, so we recommend using Anaconda whichprovides a variety of useful packages/tools.

To download Anaconda, follow the link https://www.anaconda.com/download/ and selectPython 3. Following the download, run the installer as per usual on your machine.

If you prefer not using Anaconda then this tutorial can help you with the installation andsetup.

If you already have Python installed but not via Anaconda do not worry.Make sure to have either venv or pipenv installed. Then follow the instructions to setyour virtual environment further down.

Git¶

Git is a version control software that records changesto a file or set of files. Git is especially helpful for software developersas it allows changes to be tracked (including who and when) when working on aproject.

Start

To download Git, go to the following link and choose the correct version for youroperating system: https://git-scm.com/downloads.

Windows¶

Download the git for Windows installer .Make sure to select “use Git from the Windows command prompt”this will ensure that Git is permanently added to your PATH.

Airflow Ecs Operator

Also select “Checkout Windows-style, commit Unix-style line endings” selected and click on “Next”.

This will provide you both git and git bash. We will use the command line quite a lot during the workshopso using git bash is a good option.

GitHub¶

GitHub is a web-based service for version control using Git. You will needto set up an account at https://github.com. Basic GitHub accounts arefree and you can now also have private repositories.

Text Editors/IDEs¶

Ec2

Text editors are tools with powerful features designed to optimize writing code.There are several text editors that you can choose from.Here are some we recommend:

  • VS code: this is your facilitator’s favourite 💜 and it is worth trying if you have not checked it yet

We suggest trying several editors before settling on one.

If you decide to go for VSCode make sure to alsohave the Python extensioninstalled. This will make your life so much easier (and it comes with a lot of niftyfeatures 😎).

Airflow Ec2 Operator

Microsoft Azure¶

You will need to get an Azure account as we will be using this to deploy theAirflow instance.

Note

If you are doing this tutorial live at PyCon US then yourfacilitator will provide you with specific instructions to set up your Azure subscription. If you have not received these please let your facilitator know ASAP.

Follow this linkto get an Azure free subscription. This will give you 150 dollars in credit so youcan get started getting things up and experimenting with Azure and Airflow.

MySQL¶

MySQL is one of the most popular databases used/We need MySQL to follow along with the tutorial. Make sure to install it beforehand.

Mac users¶

Warning

There are some known issues with MySQL in Mac so we recommend using this approach to install and set MySQL up: https://gist.github.com/nrollr/3f57fc15ded7dddddcc4e82fe137b58e.

Also, note that you will need to make sure that OpenSSL is on your path to make sure this is added accordingly:If using zsh:

If using bash:

make sure to reload using source~/.bashrc or source~/.zshrc

Troubleshooting¶

Later on, during the setup, you will be installing mysqlclient.If during the process you get compilation errorstry the following:

if you want to be safe before installing the library we recommend you set the following env variables:

Windows users¶

Download and install MySQL from the official website https://dev.mysql.com/downloads/installer/ and execute it.For additional configuration and pre-requisites make sure to visit the official MySQL docs.

Linux users¶

You can install the Python and MySQL headers and libraries like so:

Debian/Ubuntu:

Red Hat / Centos

After installation you need to start the service with:

To ensure that the database launches after a reboot:

You should now be able to start the mysql shell through /usr/bin/mysql-uroot-pyou will be asked for the password you set during installation.

Creating a virtual environment¶

You will need to create a virtual environment to make sure that you have the right packages and setup needed to follow along the tutorial.Follow the instructions that best suit your installation.

Anaconda¶

If you are using Anaconda first you will need to make a directory for the tutorial, for example mkdirairflow-tutorial.Once created make sure to change into it using cdairflow-tutorial.

Next, make a copy of this environment.yaml
and install the

dependencies via condaenvcreate-fenvironment.yml.Once all the dependencies are installed you can activate your environment through the following commands

To exit the environment you can use

pipenv¶

Create a directory for the tutorial, for example:

and change your working directory to this newly created one cdairflow-tutorial.

Once then make a copy of this Pipfilein your new directory and install via pipenvinstall.This will install the dependencies you need. This might take a while so you can make yourself a brew in the meantime.

Once all the dependencies are installed you can run pipenvshell which will start a session with the correct virtual environment activated. To exit the shell session using exit.

virtualenv¶

Create a directory for the tutorial, for example :

and change directories into it (cdairflow-tutorial).Now you need to run venv

this will create a virtual Python environment in the env/airflow folder.Before installing the required packages you need to activate your virtual environment:

Make a copy of this requirements filein your new directory.Now you can install the packages using via pip pipinstall-rrequirements.txt

To leave the virtual environment run deactivate

Twitter and twitter developer account¶

This tutorial uses the Twitter API for some examples and to build some of the pipelines included.

Please make sure to follow the next steps to get you all set up.

  1. Create an account at https://twitter.com/.

  2. Next, you will need to apply for a developer account, head to https://developer.twitter.com/en/apply.

    You will need to provide detailed information about what you want to use the API for.Make sure to complete all the steps and confirm your email address so that you can be notified about the status of your application.

    Warning

    Before completing the application read the PyCon attendees section below ⬇️ Twitter developer app

  3. Once your application has been approved you will need to go to https://developer.twitter.com/en/apps login with your details (they should be the same as your Twitter account ones).

  4. On your app dashboard click on the create an app button

    Make sure to give it a descriptive name, something like airflow-tutorial or the such

Airflow Ec2 Fan

5. Once you complete the details and create your new app you should be able to access it via the main app dashboard. Click on details button next to the app name and head over to permissions.We only need read permissions for the tutorial, so these should look something like this

  1. Now if you click on the Keys and tokens you will be able to see a set of an API key, an API secret, an Access token, and an Access secret

    They are only valid for the permissions you specified before. Keep a record of these in a safe place as we will need them for the Airflow pipelines.

Docker¶

We are going to use Docker for some bits of the tutorial (this will make it easier to have a local Airflow instance).

Follow the instructions at https://docs.docker.com/v17.12/install/ make sure to read the pre-requisites quite carefully before starting the installation.

🐍 PyCon attendees¶

Twitter developer app¶

The Twitter team will be expediting your applications to make sure you are all set up for the day 😎.

When filling in your application make sure to add the following details (as written here) to make sure this is processed.

In the what are you planning to use the developer account for:

Azure Pass account¶

As a PyCon attendee, you will be issued with an Azure pass worth 200 dollars with a 90 days validity.You will not need to add credit card details to activate but you will need to follow this process to redeem your credits.

Airflow

1. Send an email your facilitator at trallard@bitsandchips.me with the subject line AirflowPyCon-AzurePass, they will send you an email with a unique code to redeem. Please do not share with anyone,this is a single-use pass and once activated it will be invalid.

2. Go to this site to redeem your pass.We recommend doing this in a private/incognito window. You can then click start and attach your new pass to your existing account.

If you see the following error (see image)

you can go to this site to register the email and proceed.

Airflow Start Ec2 Instance

4. Confirm your email address. You will then be asked to add the promo code that you were sent by your instructor.Do not close or refresh the window until you have received a confirmation that this has been successful.

Airflow Ec2 Filters

  1. Activate your subscription: click on the activate button and fill in the personal details

Again once completed, do not refresh the window until you see this image

Airflow 2

At this point, your subscription will be ready, click on Get started to go to your Azure portal