Apache Airflow is a powerful open source tool to manage and execute workflows, expressed as directed acyclic graphs of tasks. It is both extensible and scalable, making it suitable for many different use cases and workloads.
In order to install Airflow you need to either downgrade pip to version 20.2.4 pip install -upgrade pip20.2.4 or, in case you use Pip 20.3, you need to add option -use-deprecated legacy-resolver to your pip install command. Airflow provides many plug-and-play operators that are ready to execute your tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure and many other third-party services. This makes Airflow easy to apply to current infrastructure and extend to next-gen technologies.
Apache Airflow is a solution for managing and scheduling data pipelines. Airflow represents data pipelines as directed acyclic graphs (DAGs) of operations, where an edge represents a logical dependency between operations. Airflow provides tight integration between Azure. Provider package. This is a provider package for microsoft.azure provider. All classes for this provider package are in airflow.providers.microsoft.azure python package. You can find package information and changelog for the provider in the documentation. The Microsoft Azure Airflow provider has an Azure Data Factory hook that is the easiest way to interact with ADF from your Airflow DAG. This hook builds off of the azure-mgmt-datafactory Python package; since this is used under the hood, this resource on interacting with ADF using Python could be helpful for determining parameter names, etc.
Bitnami's Apache Airflow Helm chart makes it quick and easy to deploy Apache Airflow on Kubernetes. This chart gives you a preconfigured Apache Airflow deployment that is up-to-date and compliant with current security best practices. It is also highly customizable, allowing you to (for example) integrate your Apache Airflow deployment with external services or scale out the solution with more nodes after deployment.
This guide gets you started with Bitnami's Apache Airflow Helm chart on Microsoft Azure, showing you how to deploy Apache Airflow on Azure Kubernetes Service (AKS) and connect it with Azure Database for PostgreSQL and Azure Cache for Redis to create a scalable, cloud-based Apache Airflow deployment.
This guide assumes that:
- You have provisioned an Azure Kubernetes Service cluster.
- You have the kubectl CLI and the Helm v3.x package manager installed and configured to work with your Kubernetes cluster. Learn how to install kubectl and Helm v3.x.
- You have a domain name and the ability to configure a DNS record for that domain name.
- You have access to the psql PostgreSQL client, either installed locally or via a Docker container like the Bitnami PostgreSQL Docker image. Learn more about psql.
The first step is to create an Azure Database for PostgreSQL service, as follows:
Apache Airflow On Azure
- Log in to the Microsoft Azure portal.
- Navigate to the Azure Database for PostgreSQL service page using the left navigation bar or the search field. Click the 'Add' button to create a new service.
- Select the option for a 'single server'. Click 'Create'.
- On the service deployment page, enter a server name, administrator account username and administrator password. Modify the deployment location if required. Select the same deployment resource group as your AKS service. Click 'Review + create'.
- Review the details shown. Click 'Create' to proceed.
A new Azure Database for PostgreSQL service will be created. This process may take a few minutes to complete. Once the service has been created, it will appear within the selected resource group. Select the newly-created service to be transferred to its detail page. Note the server host name and administrator username, as you will need this to interact further with the service.
It is now necessary to make some changes to the service's default configuration, to enable easier integration with both the Bitnami Apache Airflow Helm chart and external PostgreSQL client tools. Follow the steps below:
- From the service detail page, navigate to the 'Settings -> Connection security' page.
- In the 'Firewall rules' section:
- Set the 'Allow access to Azure services' field to 'Yes'.
- Create a new firewall rule for the IP address of your psql client host/Docker host. This is a temporary rule only to enable you to connect to the database service and create a database and user account for Apache Airflow.
- Set the 'Enforce SSL connection' field to 'Disabled'.
- Click 'Save' to save the new configuration.
Airflow Azure Ad
SSL access is disabled because at the time of writing, the Bitnami Apache Airflow Helm chart does not currently support SSL access to external PostgreSQL and Redis services.
Airflow Azure Key Vault
You can now connect to the Azure Database for PostgreSQL service using the psql client and create a database and user for Apache Airflow.
Use the command below to initiate the connection, replacing the POSTGRES-HOST and POSTGRES-ADMIN-USER placeholders with the server name and administrator username obtained previously. When prompted for a password, enter the administrator password supplied at deployment-time.
Use the commands below at the psql client command prompt to create a new database and user account for Apache Airflow. Replace the POSTGRES-AIRFLOW-PASSWORD placeholder with a unique password for the new user account. Note this password as you will need it in Step 3.
Delete the temporary firewall rule.
Next, create an Azure Cache for Redis service, as follows:
- Log in to the Microsoft Azure portal (if you're not already logged in).
- Navigate to the Azure Cache for Redis service page using the left navigation bar or the search field. Click the 'Add' button to create a new service.
- On the service deployment page, enter a DNS name for the service and check the box to unblock port 6379. Modify the deployment location and pricing tier if required. Select the same deployment resource group as your AKS service and your Azure Database for PostgreSQL service. Click 'Create'.
The Azure Cache for Redis service will be created. This process may take a few minutes to complete. Once the service has been created, it will appear within the selected resource group. Select the newly-created service to be transferred to its detail page. Navigate to the 'Settings -> Access keys' section and note the primary access key, as you will need this in Step 3.
You can now go ahead and deploy Apache Airflow on the AKS cluster. Follow these steps:
Add the Bitnami charts repository to Helm:
Deploy the Apache Airflow Helm chart using the command below. Replace the placeholders as explained below:
- Replace the DOMAIN placeholder with your domain name.
- Replace the POSTGRES-HOST placeholder with the host name of the Azure Database for PostgreSQL service (obtained in Step 1).
- Replace the POSTGRES-AIRFLOW-PASSWORD placeholder with the password assigned to the airflow_user user account (defined by you when creating the airflow database in Step 1).
- Replace the REDIS-HOST placeholder with the DNS name of the Azure Cache for Redis service (defined by you in Step 1 at deployment time).
- Replace the REDIS-KEY placeholder with the primary key for the Redis service (obtained in Step 2).
- Replace the AIRFLOW-PASSWORD placeholder with a unique password to access the Apache Airflow Web user interface.
Here is a brief explanation of the parameters supplied to the chart:
- The postgresql.enabled and redis.enabled parameters, when set to false, ensure that the chart does not create its own PostgreSQL and Redis service and instead uses external services.
- The service.type parameter makes the Apache Airflow service available at a public load balancer IP address.
- The airflow.auth.password parameter defines the password for the Apache Airflow Web control panel.
- The airflow.loadExamples parameter installs some example DAGs. If you already have custom DAGs, you can set this parameter to false and install your custom DAGs from the file system, a GitHub repository or a ConfigMap, as described in the chart documentation.
Review the complete list of parameters in the chart documentation.
Wait for the deployment to complete and obtain the public load balancer IP address using the command below:
Update the DNS record for your domain name to point to the above load balancer IP address.
You should now be able to access the Apache Airflow Web control panel by browsing to http://DOMAIN:8080 and logging in with the username user and the password set in the AIRFLOW-PASSWORD placeholder. Here is an example of what you should see:
You can now proceed to use Apache Airflow to manage and execute your workflows.