Landscape Of Data Orchestration Tools
What are the categories. What tools are available. How do they compare to each other.
Hey reader, welcome to the š„ free edition š„ of my weekly newsletter. I write about ML concepts, how to build ML products, and how to thrive at work. You can learn more about me here. Feel free to send me your questions and Iām happy to offer my thoughts. Subscribe to this newsletter to receive it in your inbox every week.
A data orchestration tool manages the transformation of data from one form to another. It achieves the outcome by stitching a network of tasks together. This network uses a structure called Directed Acyclic Graph (DAG). Each task in the DAG consumes an input and produces an output. I've discussed it in my last post.
The data orchestration tool controls how the tasks are executed. It consists of the following key responsibilities:
Scheduling the tasks
Starting the tasks
Saving the states
Monitoring the workflows
Generating alerts
There are many data orchestration tools available. Here are a few key characteristics to remember:
Operator-native tools: These data orchestration tools have predefined task templates. You need to stitch together these tasks to build your DAG.
Container-native tools: In this category, each task is represented as a container. And these tasks are orchestrated using Kubernetes. This offers a lot of flexibility as to what a task can do. There are general purpose container-native tools available. Also there are tools available that are specifically built for data-related tasks.
Open source tools: You need to install and manage these tools yourself. The strength of a tool's developer community will decide how it gets adopted. You'll also need a good engineering team to use these tools in production. There are open source tools available in both operator-based and container-native tools.
Managed solutions: A company provides a hosted solution at a price. If a tool becomes particularly popular, a company gets formed that provides managed solutions. It's like hosting your own cloud server vs using AWS.
General purpose workflow tools: These tools are not specifically built for data related tasks, although they can certainly do the job. The advantage here is that you can orchestrate any kind of tasks, not just data related.
What does the landscape look like?
Apache Airflow
Open source tool created by Airbnb. It's the most popular tool for data orchestration. Due to its vast developer community, you'll be able to use the collective knowledge of all the developers working on it. AWS and GCP provide managed instance of Airflow, so it's even easier for the customers of AWS and GCP to use Airflow.
Dagster
Open source tool created by Elementl. It's relatively new, but the creators of this tool aim to address a few important gaps in Airflow such as:
continuous integration and delivery (CI/CD)
automated testing
scaling up and down with the size of the org
monitoring and observability
You can find a more detailed comparison here. There are many early enthusiasts of Dagster, but we need to see if it's going to gain widespread developer adoption.
Metaflow
Open source tool created by Netflix. It's built to address challenges around scalability and version control. You can design your workflow, run it at scale, and deploy it to production.
Luigi
Open source tool created by Spotify. It was an early pioneer in data orchestration, but it seems to have slowed down. The features are not being released as rapidly as Airflow.
Prefect
This is a company that offers an open source tool. And their business uses a freemium model. They also provide a managed cloud solution called Prefect Cloud for data workflow management.
Flyte
Open source tool created by Lyft. It's a container and Kubernetes native tool built for data related tasks. If you use this tool, the code and libraries are packaged in a container. This leads to its most significant advantage, which is the isolation of environment and dependency.
Union.ai
It's a company providing a fully managed solution powered by Flyte. It's run by the people who are the core developers behind Flyte. They recently came out of stealth.
Astronomer
It's a company that provides a fully managed solution called Astro that's powered by Airflow. This company is run by the people who are the core developers behind Airflow. If you want to use Airflow but want a solution that is managed by someone else, Astronomer is a great option here. It's meant for mid-sized to large companies.
Google Cloud Composer
This is a hosted solution for Airflow built by Google. If you're in the GCP ecosystem and want to use Airflow, this could be a great option as opposed to implementing Airflow from scratch.
Argo Workflows
It's an open source container-native tool to orchestrate jobs on Kubernetes. It's not specifically built for data related tasks, although it can certainly do the job.
AWS Step Functions
It's a cloud task manager that can also be used as a data orchestration tool. It's not specifically meant for data tasks. If you're in the AWS ecosystem, you can use this tool.
Tekton
It's a Kubernetes-native open source tool for creating CI/CD systems. It's not specifically built for data related tasks, although you can certainly do that.
Kubeflow Pipelines
This tool doesn't exactly fall under the category of data orchestration, but I wanted to mention it since you'll come across it frequently. Kubeflow is built for ML workflows. It's not necessarily built for data orchestration, although you can do that using Kubeflow. It provides features to do orchestration, experimentation, and model management. ML workflow management is a separate topic.
Kedro
This tool also doesn't fall under the category of data orchestration, but I wanted to mention it since you'll come across it frequently. It's an open source tool to create modular data science code. It's great for data scientists to quickly prototype the pipeline and experiment with it. Kedro is not a workflow scheduler like Airflow. To deploy your product, you'll have to use Kedro along with a workflow management tool like Airflow.
Where to go from here?
The goal of this post is to provide an overview of all the tools. And understand the key characteristics of data orchestration tools. It helps you pick and choose the right tools depending on the task at hand.
šš„ Two new episodes on the Infinite ML pod
You can listen and subscribe to the podcast on:
š§ Apple Podcasts
š§ Spotify
š§ Google Podcasts
We had two amazing guests on the podcast last week:
Richad Nieves-Becker: He talks about doing academic vs business work in data science, why sales skills are important for data scientists, how to define tractable problems, the advantage of modular products, rise of MLOps, data versioning, and monetization of machine learning models.
Tushar Gupta: He talks about the step-by-step process of getting a tech book published, identifying topics for books, the process of writing, and trends in machine learning books.
š Job Opportunities in AI
Check out this job board for the latest opportunities in AI. It features a list of open roles in Machine Learning, Data Science, Computer Vision, and NLP at startups and big tech.
šš»āāļø šš»āāļø How would you rate this weekās newsletter?
You can rate this newsletter to let me know what you think. Your feedback will help make it better.