Dask Airflow

airflow.executors.dask_executor.DaskExecutor allows you to run Airflow tasks in a Dask Distributed cluster.

We understand that Dask is important for a number of people, but also in order to keep it in Airflow we need someone who actually uses Dask (and Dask Executor in Airflow) to keep it in a healthy state. We are querying at our devlist and checking if someone is willing to maintain it but we also think it might be a good idea to ask here. Dask Executor airflow.executors.daskexecutor.DaskExecutor allows you to run Airflow tasks in a Dask Distributed cluster. Dask clusters can be run on a single machine or on remote networks. For complete details, consult the Distributed documentation.

Dask clusters can be run on a single machine or on remote networks. For completedetails, consult the Distributed documentation.

To create a cluster, first start a Scheduler:

Next start at least one Worker on any machine that can connect to the host:

Airflow celery executor

Edit your airflow.cfg to set your executor to airflow.executors.dask_executor.DaskExecutor and providethe Dask Scheduler address in the [dask] section. For more information on setting the configuration,see Setting Configuration Options.

Please note:

Dask airflow filters
  • Each Dask worker must be able to import Airflow and any dependencies yourequire.

  • Dask does not support queues. If an Airflow task was created with a queue, awarning will be raised but the task will be submitted to the cluster.

Dask.distributed is a lightweight library for distributed computing in Python.It extends both the concurrent.futures and dask APIs to moderate sizedclusters.

See the quickstart to get started.

Dask Airflow Fan

Motivation¶

Distributed serves to complement the existing PyData analysis stack.In particular it meets the following needs:

  • Low latency: Each task suffers about 1ms of overhead. A smallcomputation and network roundtrip can complete in less than 10ms.
  • Peer-to-peer data sharing: Workers communicate with each other to sharedata. This removes central bottlenecks for data transfer.
  • Complex Scheduling: Supports complex workflows (not justmap/filter/reduce) which are necessary for sophisticated algorithms used innd-arrays, machine learning, image processing, and statistics.
  • Pure Python: Built in Python using well-known technologies. This easesinstallation, improves efficiency (for Python users), and simplifies debugging.
  • Data Locality: Scheduling algorithms cleverly execute computations wheredata lives. This minimizes network traffic and improves efficiency.
  • Familiar APIs: Compatible with the concurrent.futures API in thePython standard library. Compatible with dask API for parallelalgorithms
  • Easy Setup: As a Pure Python package distributed is pip installableand easy to set up on your own cluster.

Architecture¶

Dask.distributed is a centrally managed, distributed, dynamic task scheduler.The central dask-scheduler process coordinates the actions of severaldask-worker processes spread across multiple machines and the concurrentrequests of several clients.

The scheduler is asynchronous and event driven, simultaneously responding torequests for computation from multiple clients and tracking the progress ofmultiple workers. The event-driven and asynchronous nature makes it flexibleto concurrently handle a variety of workloads coming from multiple users at thesame time while also handling a fluid worker population with failures andadditions. Workers communicate amongst each other for bulk data transfer overTCP.

Internally the scheduler tracks all work as a constantly changing directedacyclic graph of tasks. A task is a Python function operating on Pythonobjects, which can be the results of other tasks. This graph of tasks grows asusers submit more computations, fills out as workers complete tasks, andshrinks as users leave or become disinterested in previous results.

Dask Compute

Users interact by connecting a local Python session to the scheduler andsubmitting work, either by individual calls to the simple interfaceclient.submit(function,*args,**kwargs) or by using the large datacollections and parallel algorithms of the parent dask library. Thecollections in the dask library like dask.array and dask.dataframeprovide easy access to sophisticated algorithms and familiar APIs like NumPyand Pandas, while the simple client.submit interface provides users withcustom control when they want to break out of canned “big data” abstractionsand submit fully custom workloads.

Contents¶

Dask Airflow 2

Getting Started

Build Understanding

Airflow Dask Distributed

Additional Features

Dask Airflow Filter

Developer Documentation