Astronomer Airflow

By Jo Stichbury, Technical Writer & Yetunde Dada, Product Manager at QuantumBlack

Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. Its focus is on authoring code and not orchestrating, scheduling and monitoring pipeline runs. We emphasise infrastructure independence, and this is crucial for consultancies such as QuantumBlack, where Kedro was born.

Astronomer makes it easy to run, monitor, and scale Apache Airflow deployments in our cloud or yours. Source code is made available for the benefit of customers. Apache Airflow is an extremely popular open-source workflow management platform. Workflows in Airflow are modelled and organised as DAGs, making it a suitable engine to orchestrate and execute a pipeline authored with Kedro. Astronomer is a managed Airflow platform which allows users to spin up and run an Airflow cluster easily in production.


Kedro is not an orchestrator. It aims to stay very lean and unopinionated about where and how your pipeline will be run.


You can deploy your Kedro projects virtually anywhere with minimal effort as long as you can run Python. Our users have the freedom to choose their deployment targets. The future of deploying Kedro pipelines is in designing a deployment process with a great developer experience in mind.

GithubAirflow

One of the benefits of being an open source community is that we can explore partnerships with other, like-minded frameworks and technologies. We are particularly excited to work with the Astronomer team, who helps organisations adopt Apache Airflow, the leading open-source data workflow orchestration platform.

Workflows in Airflow are modelled and organised as DAGs, making it a suitable engine to orchestrate and execute a pipeline authored with Kedro. To keep the workflow seamless, we are pleased to unveil the latest version of the Kedro-Airflow plugin, which simplifies deployment of a Kedro project on Airflow.

Our work with Astronomer provides a simple way for our users to deploy their pipelines. We would like to continue our work and make the process even smoother and eventually achieve a “one-click-deployment” workflow for Kedro pipelines on Airflow.

We have edited the conversation for length and clarity.

Pete DeJoy, you’re a Product Manager at Astronomer. Tell us a little about yourself!


I’m one of the founding team members at Astronomer, where we’ve built a company around the open source orchestration framework Apache Airflow. I’ve done many things here through the years, but have spent most of my energy working on our product as it has developed from an idea on a whiteboard to a high-scale system supporting thousands of users.

What prompted the creation of Airflow 2.0? And what does the success of this version of Airflow look like?


Airflow has evolved quite a lot since its inception in 2014; it now has over 20,000 stars on Github, 600k downloads/month, and tens of thousands of users worldwide. Airflow 1.x solved a lot of first-order problems for developers, but an uptick in enterprise requirements followed Airflow’s widespread adoption, along with increased pressure to improve the developer experience. Airflow 2.0 meets the needs of users with a handful of much anticipated features. These include:

  • A highly available, horizontally scalable scheduler
  • An upgraded, stable REST API
  • Decoupled workflow integrations (called “providers” in Airflow) as independently versioned and maintained python package and much more

We see 2.0 as a major milestone for the project; not only does it significantly improve the scalability of Airflow, but also it sets a foundation upon which we can continuously build new features.

How did you find out about Kedro? When did you realise it was compatible with Airflow for users?


I had chatted with a few data scientists who were using Kedro to author their pipelines and looking for a good way to deploy those pipelines to Airflow. Kedro does an outstanding job of allowing data scientists to apply good software engineering principles to their code and make it modular, but Kedro pipelines need a separate scheduling and execution environment to run at scale. Given this need, there was a natural bond between Kedro pipeline and Airflow: we wanted to do everything we could to build a great developer experience at the intersection of the two tools.

Where do you think Kedro-Airflow could go, in terms of future development?


Airflow 2.0 extends and upgrades the Airflow REST API, allowing it to be robust in the coming years. As the API develops, there will be new opportunities for specific abstraction layers to assist with DAG authoring and deployment, leading to a richer plugin ecosystem. There will be extra opportunity to integrate the kedro-airflow package with the Airflow API for a great developer experience.

What is the future of Airflow?

Filter

Astronomer Airflow Meter


As we look towards Airflow 3.0 and beyond, building upon developer love and trust is inevitable. But it won’t stop there. As data orchestration becomes critical to a growing number of business units, we want Airflow to become a medium for making data engineering more approachable. We seek to democratise access such that product owners and data scientists alike can leverage Airflow’s distributed execution and scheduling power without being a master in Python or Kubernetes. Empowering users to author and deploy data pipelines from a framework of their choice will become increasingly important in that journey.

What is the future of workflow orchestration technologies?


Airflow’s inception kicked off a “data pipelines as code” movement that changed the way enterprises thought about workflow orchestration. For many years, job scheduling was handled by a combination of legacy drag-and-drop frameworks and complex networks of cron jobs. As we transitioned into the “big data” era and companies began building dedicated teams to operationalise their siloed data, the need for additional flexibility, control, and governance became apparent.

When Maxime Beauchemin and the folks at Airbnb built and open sourced Airflow with flexible, codified data pipelines as a first-class feature, they propelled code-driven orchestration into the spotlight. Airflow solved many first-order problems for data engineers, which explains its explosive adoption. But with that early adoption came some pitfalls; since Airflow is highly configurable by design, users began applying it to use cases it was not necessarily designed for. This imposed evolutionary stress on the project, pushing the community to add additional configuration options to “mould” Airflow to various use cases.

While the added configuration options helped Airflow extend to accommodate these additional use cases, they introduced a new class of user needs. Data platform owners and administrators now need a way to deliver standard patterns to their pipeline authors to abate business risk. Likewise, pipeline authors need additional guardrails to be sure they don’t “use Airflow wrong”. Finally, engineers with a pythonic background now need to learn how to operationalise big data infrastructure for stable & reliable orchestration at scale.

Astronomer

We see the future of workflow orchestration technology accommodating some of these categorical changes in the needs of the user. If the journey thus far has been one of “The Rise of the Data Engineer”, we see the future as “The Democratisation of Data Engineering”. All users — from the data scientists to the data platform owner — will have access to powerful, distributed, flexible data pipeline orchestration. They’ll benefit as it integrates from the authoring tools that they know and love, but has guardrails to accommodate specific usage patterns that prevent folks from straying off of the happy path.


You can find out more about the Kedro-Airflow plugin in the Kedro documentation and check out the GitHub repository too. This article was edited by Jo Stichbury — Technical Writer and Yetunde Dada — Product Manager, with input from Ivan Danov (Tech Lead at Kedro) and Lim Hoang (Senior Software Engineer at Kedro).


Original. Reposted with permission.

Related:

CINCINNATI--(BUSINESS WIRE)--Astronomer has released a major upgrade to its enterprise-ready Apache Airflow platform, making it easier to get Airflow running in minutes on Kubernetes. The latest release allows users to spin up multiple Airflow clusters anywhere Kubernetes runs, including public clouds (AWS, GCP, Azure) and on-prem private clouds. It is designed to simplify the process of running and monitoring data workflows and clusters at scale, with high-availability and maximum security.

While the latest release scales up to organizations with many workflows and complex requirements, Astronomer scales down to serve organizations of all sizes, including technology startups, and can run on a single-node Kubernetes cluster. Airflow is being adopted for data workflow management at an incredible rate, but operationalizing Airflow at scale requires considerable effort to set up, monitor and maintain. Reliability is a constant concern when customers are running their own clusters. Astronomer is designed to greatly reduce the time and effort to get Airflow up and running and keep it running smoothly, even at large scale.

“Airflow is now being used by everyone from individual data scientists at small startups to huge data teams at the largest global companies,” said Ry Walker, CEO of Astronomer. “We are excited to work with organizations all over the world to help them automate their data workflows and refine valuable information from their data into all aspects of their business.”

Specific benefits of the latest Astronomer release include:

  • Multiple clouds and teams. Once Astronomer is installed, users are up and running with the ability to deploy multiple Airflow clusters. Users can create team workspaces that are isolated, with the ability to invite team members to collaborate.
  • High availability. Astronomer relies exclusively on Kubernetes to provide the most reliable Airflow service in the market.
  • Simple scalability. Astronomer’s integration with Kubernetes makes it easy to scale resources up or down easily. Future releases will target a “scale to zero” auto-scaling strategy where organizations will only pay for compute resources that are actively used, versus the common practice today of reserving a pool of resources for maximum load.
  • Powered by open-source. Widely-used, widely-supported and constantly improving, Astronomer is built on open source, free from the obfuscation of black-box commercial alternatives.

More than just software

Along with its platform subscriptions that include Airflow user support, the company also offers Airflow professional services and training, as well as a growing library of guides, podcast episodes and blog posts. Recently, Astronomer also launched Data Engineering meetups in Denver, San Francisco and Cincinnati.

“Astronomer provides a turn-key, flexible, scalable, and affordable ETL solution to power our batch processing of billions of rows of data per day for our customers. In addition, their expert data engineers have trained our engineering team so we could be self-sufficient; they've been a great partner!' - Nic Zangre, VP Product, CaliberMind

Astronomer Airflow Meter

About Astronomer

Astronomer Airflow

Astronomer is dedicated to helping Apache Airflow win in the marketplace, and is the first company to offer a commercial platform and support for Apache Airflow. The Astronomer platform is open-source and fully hackable, and can be deployed in minutes to your Kubernetes for maximum security and control. Astronomer also provides support, services and training to help your organization adopt an agile data engineering culture. Astronomer is trusted by Fortune 100 companies and startups, around the world. For more information, visit www.astronomer.io.