Report on the presentation of a PoC result around the orchestration of data workflows/jobs. 2 different tools were evaluated: Cadende and Airflow.
The daily mood
It's already been 10 days of part-time ramp-up. I have a first opportunity to attend a peer's presentations on a different topic.
This time is about the evaluation of 2 different solutions for workflow orchestration outside Kubernetes: Airflow and Cadence. This DataOps initiative has the goal to improve our architecture and governance around scheduling of Data flows.
What is DataOps
DataOps is the application to Data analytics of Continuous Delivery (CD) and DevOps. In short, it is about the automation of all stages of the Data flow including ETL, Data science, BI and Analytics.
In a world of Microservices, the service orchestrator becomes a major enabler of DataOps.
What is Airflow
Airflow started at Airbnb in 2014 as a solution to manage increasing workflow complexity. It is written in Python and was open-sourced from the beginning on Airbnb's public repository. It was then incubated by the Apache Software Foundation (ASF) in 2016 and reached Top-Level project status in 2019. As compared to older workflow schedulers and orchestrators implementing Directed Acyclic Graphs (DAG), like for example Oozie, Airflow stores its definitions in a single, comprehensive configuration file, instead of multiple complex ones. The strength of the project is its ease of use, committer base, large industry adoption and multitude of integrations.
UPDATE 01.12.2020: In answer to managed Airflow service Google Cloud Composer, AWS recently announced the general availability of Amazon Managed Workflows for Apache Airflow (MWAA).
What is Cadence
Cadence started at Uber as a solution to support their asynchronous microservice architecture. It is written in Go and was open-sourced under MIT license in 2017. The strength of the project is to offer a state-full code platform, so that Go and Java developers can build fault-tolerant workflows and activities which react to events, and return their internal state through queries. On the backend side, activities are orchestrated by a distributed engine and run in a scale-able and resilient way.
UPDATE 01.12.2020: Temporal.io, a promising fork of Cadence, recently pushed its 1st production release. Differences between Cadence and Temporal are explained here.
Other interesting solutions
- Dagster.io - My personal favorite and bet for the future, because it offers best interoperability!
- Argo Workflows - Abstracts native Kubernetes CronJobs via Custom Resource Definitions (CRDs)
- Netflix Conductor
PoC
I used to run lots of software evaluations, but from the other side of the field. In terms of context, this initiative is part of an architecture track driven by our company's roadmap, continuous improvement and strategic innovation. In terms of approach, we've been collecting the needs from different departments, sorting, analyzing and implementing them in order to draw our conclusions. Based on the outcome, product owners may or may not take action.
Content
Foundation for the talk was a general definition of DataOps orchestration and how their workflows differ from well-known Enterprise Service Bus (ESB) and Business Process Management (BPM) concepts.
We are basically discussing about compute/execution only, nothing more and nothing less. The following core questions are addressed:
- When to execute? ex. defined by trigger or event or retry-policy
- What to execute? ex. a job or request or another workflow
- How to execute? ex. a distributed processing engine
Some more requirements were discussed:
- When: A workflow supervision API and UI for administration, monitoring and auditing.
- What: A Domain Specific Languages (DSL) for external users to implement their own workflow.
- How: A potential distribution of the workload accross multiple, auto-scaling workers.
In general, some simple functionality with very complex details when you want to make it usable and reliable. A use-case driven implementation was made and presented. The PoC outcome will lead to no decision yet, but already provides an objective assessment.
Conclusion
We've been evaluating 2 open-source solutions for workflow orchestration outside Kubernetes. Airflow happens to be very popular as it follows a more user-friendly, though too permissive (mistakes allowed), approach. So this might the preferred option for external use. Cadence comes with a more developer- (Java modules) and pattern-oriented approach. It offers a strong support for customization, debugging and testing. So this might be the preferred option for internal use.
Comments
Post a Comment