In Airflow, the start_date
and execution_date
can be very counterintuitive for those who are not familiar with them.
TLDR
In this article, I will explain why the
execution_date
in Airflow is different from what we might expect.The Usage of
start_date
,schedule_interval
,execution_date
,next_execution_date
Before delving into the definitions of these three terms, please keep in mind that when using Airflow, it is advised by the official documentation to use UTC+0 as the time zone.
Why
The reason start_date
and execution_date
are difficult to understand is that data for a period of time can only be processed after the time has passed. Therefore, they are not what people would literally think of as the "task start time" and "task execution time", but:
The start time of the data collected by this task
The start time of the data collected by this run (which is actually a kind of start time as well)
import datetime as dt
from airflow import DAG
dag = DAG(
dag_id="some_dag",
schedule_interval="@daily",
start_date=dt.datetime(2023, 01, 01),
)
Assuming we have a DAG like the code above, what we are actually doing would be:
On 2023/01/02, execute this dag and collect data from 2023/01/01 to 2023/01/02.
On 2023/01/03, execute this dag and collect data from 2023/01/02 to 2023/01/03.
On 2023/01/04, execute this dag and collect data from 2023/01/03 to 2023/01/04.
(P.S. We must wait until the very beginning of the 2nd, which is right after 1/1 is over, to collect data from 01/01 to 01/02.)
For example, in the case of a DAG run that says "collect data from 2023/01/01 to 2023/01/02 on 2023/01/02", its execution_date
is 2023/01/01, which represents the start of the data collection interval, not the actual time the dag is executed. The actual execution time of this dag can be obtained through next_execution_date
.
Usage
From the example above, we can also see that if you want your daily DAG to start running on 01/01/2023, then you should:
schedule_interval="@daily",
start_date=dt.datetime(2022, 12, 31) #2023/01/01 - sechedule_interval
Similarly, if this is an hourly DAG, but you also want it to run on 01/01/2023:
schedule_interval="@hourly",
start_date=dt.datetime(2022, 12, 31, 23, 0, 0) #2023/01/01 - sechedule_interval
In the two examples above, we can see that the actual execution time is 2023/01/01. However, due to varying schedule_interval settings, the execution_date will differ.
Therefore, to get the actual execution time of this DAG (i.e., 2023/01/01), we need to use next_execution_date
.
In the end
In the new version of Airflow, the official recommendation is to use
logical_date
instead of the easily misunderstoodexecution_date
.I believe the concept of
execution_date
can be perplexing, partly because we frequently use Airflow for purposes beyond ETL, such as managing various cronjob. Consequently, we tend to interpret it as the task execution time rather than from a data collection standpoint.
Acknowledge
- Photo by Estée Janssens on Unsplash