What is dagster, what it is at a high level and how is it different from Apache Airflow?

Orchestrators play an important role in data engineering to automate workflows. Orchestrators have grown from something that can run a sequence of steps to now where we expect orchestrators to

  • Manage and co-ordinate complex workflows and data pipelines.
  • Monitor workflows to understand the sequence of steps
  • Show what succeeded and what failed?
  • How long each step took?
  • How these steps are related?

Basically providing a nice interface where we can "observe" what is going on with our workflows/data pipelines.

When it comes to orchestrators, Apache airflow and Kestra have been a great orchestrator but their approach is task based. What it means is that - the way we approach the problem is by focusing on hows? The tasks/verbs

Dagster takes a different approach where it is focused on the whats - which dagster terms them as assets. Dagster provides a great example in its documentation of how this makes a difference when it comes to reusability.

For e.g. if we want to make cookies the task centric way, the way we approach the problem is

  1. Get the ingredients
  2. Mix the ingredients
  3. Add chocolate chips
  4. Bake

Now if we take the asset centric approach, the way we approach the problem is

  1. Get the ingredients, mix them to make cookie dough
  2. Get chocolate chips and mix with the cookie dough to get chocolate chip cookie dough
  3. Bake the cookied dough to get cookies

Now what makes asset centric approach different is that, we can re-use these assets. For e.g. in the above example, if we go with asset based approach to make peanut based cookies, you can use the existing asset which is cookie dough and add peanuts to it.

We will get into more detail in the series of dagster articles, but this should give you an idea of what Dagster is and how it is different from Apache Airflow?

Nikhil Das Nomula

You Might Also Like

145
145