A Simple Guide to Airflow start_date and execution_date

A Simple Guide to Airflow start_date and execution_date

·

3 min read

In Airflow, the start_date and execution_date can be very counterintuitive for those who are not familiar with them.

TLDR

  • In this article, I will explain why the execution_date in Airflow is different from what we might expect.

  • The Usage of start_date, schedule_interval, execution_date, next_execution_date

Before delving into the definitions of these three terms, please keep in mind that when using Airflow, it is advised by the official documentation to use UTC+0 as the time zone.

Why

The reason start_date and execution_date are difficult to understand is that data for a period of time can only be processed after the time has passed. Therefore, they are not what people would literally think of as the "task start time" and "task execution time", but:

  • The start time of the data collected by this task

  • The start time of the data collected by this run (which is actually a kind of start time as well)

import datetime as dt
from airflow import DAG

dag = DAG(
  dag_id="some_dag",
  schedule_interval="@daily",
  start_date=dt.datetime(2023, 01, 01),
)

Assuming we have a DAG like the code above, what we are actually doing would be:

  • On 2023/01/02, execute this dag and collect data from 2023/01/01 to 2023/01/02.

  • On 2023/01/03, execute this dag and collect data from 2023/01/02 to 2023/01/03.

  • On 2023/01/04, execute this dag and collect data from 2023/01/03 to 2023/01/04.

(P.S. We must wait until the very beginning of the 2nd, which is right after 1/1 is over, to collect data from 01/01 to 01/02.)

For example, in the case of a DAG run that says "collect data from 2023/01/01 to 2023/01/02 on 2023/01/02", its execution_date is 2023/01/01, which represents the start of the data collection interval, not the actual time the dag is executed. The actual execution time of this dag can be obtained through next_execution_date.

Usage

From the example above, we can also see that if you want your daily DAG to start running on 01/01/2023, then you should:

schedule_interval="@daily",
start_date=dt.datetime(2022, 12, 31) #2023/01/01 - sechedule_interval

Similarly, if this is an hourly DAG, but you also want it to run on 01/01/2023:

schedule_interval="@hourly",
start_date=dt.datetime(2022, 12, 31, 23, 0, 0) #2023/01/01 - sechedule_interval

In the two examples above, we can see that the actual execution time is 2023/01/01. However, due to varying schedule_interval settings, the execution_date will differ.

Therefore, to get the actual execution time of this DAG (i.e., 2023/01/01), we need to use next_execution_date.

In the end

  1. In the new version of Airflow, the official recommendation is to use logical_date instead of the easily misunderstood execution_date.

  2. I believe the concept of execution_date can be perplexing, partly because we frequently use Airflow for purposes beyond ETL, such as managing various cronjob. Consequently, we tend to interpret it as the task execution time rather than from a data collection standpoint.

Acknowledge

  1. Photo by Estée Janssens on Unsplash