What is dagster, what it is at a high level and how is it different from Apache Airflow?

Nikhil Das Nomula

Orchestrators play an important role in data engineering to automate workflows. Orchestrators have grown from something that can run a sequence of steps to now where we expect orchestrators to

  • Manage and co-ordinate complex workflows and data pipelines.
  • Monitor workflows to understand the sequence of steps
  • Show what succeeded and what failed?
  • How long each step took?
  • How these steps are related?

Basically providing a nice interface where we can "observe" what is going on with our workflows/data pipelines.

When it comes to orchestrators, Apache airflow and Kestra have been a great orchestrator but their approach is task based. What it means is that - the way we approach the problem is by focusing on hows? The tasks/verbs

Dagster takes a different approach where it is focused on the whats - which dagster terms them as assets. Dagster provides a great example in its documentation of how this makes a difference when it comes to reusability.

For e.g. if we want to make cookies the task centric way, the way we approach the problem is

  1. Get the ingredients
  2. Mix the ingredients
  3. Add chocolate chips
  4. Bake

Now if we take the asset centric approach, the way we approach the problem is

  1. Get the ingredients, mix them to make cookie dough
  2. Get chocolate chips and mix with the cookie dough to get chocolate chip cookie dough
  3. Bake the cookied dough to get cookies

Now what makes asset centric approach different is that, we can re-use these assets. For e.g. in the above example, if we go with asset based approach to make peanut based cookies, you can use the existing asset which is cookie dough and add peanuts to it.

We will get into more detail in the series of dagster articles, but this should give you an idea of what Dagster is and how it is different from Apache Airflow?

You Might Also Like

Data warehouse vs Data Lake vs Data mesh
Data warehouse vs Data Lake vs Data mesh

In this post we will go over three approaches that we see across organizations when it comes to data engineering. The three approaches are...

Read More
What is NoSQL and why you should have it in your toolkit as an architect
What is NoSQL and why you should have it in your toolkit as an architect

One might wonder, if relational databases have been mainstream for so long - how come document based databases i.e. NoSQL have become so popular.

Read More
Logo

© 2024 Yajur LLC . All rights reserved