Problem Statement:

Consider we have 3 pipelines and each pipeline have 4 stages in our Data Engineering project and we want create Airflow DAGS for each and every stage for all the pipelines. We end-up with 12 DAGS. However that’s only for development environment. If we want DAGS for the staging and production environment as well, we might end-up creating 36 DAGS i.e. 36 python script files in our git repo. In future, if we add any new stage or pipeline, we need to create DAGS for the same in all environments. Consider we need to accommodate one extra argument in our DAG, we need to update all the 36 scripts, which seems to be a hectic and cumbersome process.

Solution:

How nice it will be, if we generate 36 DAGS dynamically on the fly using 4 scripts? moreover maintaining these 4 scripts will be easy right?

Base idea for this is to create to DAGS templates instead of the DAGS with hard coded data. We will pass the required configuration(metadata) to the template and it will generate the DAGS.

DAG Template:

 

We have template ready with us. Now the questions is how to pass data to this template so that we will generate the DAGS on the fly?

Airflow Variables is the solution. In Airflow Variables we will create the metadata and Template will read the metadata from here. Airflow variables are key-value pairs where key should match with pipeline+”_stage1″, So that Airflow will read this metadata and populate the DAGS dynamically.

Airflow Variables:

Consider if we want to add/change one of the param in default args, then we will update only in these 4 scripts and  it will be reflected automatically. If we want add any new stage/pipeline, we will add that stage/pipeline name in template and it will generate DAG automatically.

Further to this, we can use one script instead of four scripts to generate all the 36 DAGS dynamically.

Code:

Here we are trying to iterate on all the four pipelines and four stages.

References: https://www.astronomer.io/guides/dynamically-generating-dags/