airflow-core/docs/administration-and-deployment/dagfile-processing.rst
.. Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Dag File Processing refers to the process of reading the python files that define your Dags and storing them such that the scheduler can schedule them.
There are two primary components involved in Dag file processing. The DagFileProcessorManager is a process executing an infinite loop that determines which files need
to be processed, and the DagFileProcessorProcess is a separate process that is started to convert an individual file into one or more Dag objects.
The DagFileProcessorManager runs user codes. As a result, it runs as a standalone process by running the airflow dag-processor CLI command.
.. image:: /img/dag_file_processing_diagram.png
DagFileProcessorManager has the following steps:
config:dag_processor__refresh_interval then update the file paths listmin_file_process_interval<config:dag_processor__min_file_process_interval> and have not been modifiedDagFileProcessorProcess for each file, up to a maximum of :ref:config:dag_processor__parsing_processesdag_processing.total_parse_timeDagFileProcessorProcess has the following steps:
dag_file_processor_timeout<config:dag_processor__dag_file_processor_timeout>dagbag_import_timeout<config:core__dagbag_import_timeout>DagFileProcessorManager a list of the discovered Dag objectsWhat impacts Dag processor's performance """"""""""""""""""""""""""""""""""""""""
The Dag processor is responsible for continuously parsing Dag files and synchronizing with the Dag in the database In order to fine-tune your Dag processor, you need to include a number of factors:
The kind of deployment you have
The logic and definition of your Dag structure:
best_practices/top_level_code)The Dag processor configuration
How to approach Dag processor's fine-tuning """""""""""""""""""""""""""""""""""""""""""
Airflow gives you a lot of "knobs" to turn to fine tune the performance but it's a separate task, depending on your particular deployment, your Dag structure, hardware availability and expectations, to decide which knobs to turn to get best effect for you. Part of the job when managing the deployment is to decide what you are going to optimize for. Some users are ok with 30 seconds delays of new Dag parsing, at the expense of lower CPU usage, whereas some other users expect the Dags to be parsed almost instantly when they appear in the Dags folder at the expense of higher CPU usage for example.
Airflow gives you the flexibility to decide, but you should find out what aspect of performance is most important for you and decide which knobs you want to turn in which direction.
Generally for fine-tuning, your approach should be the same as for any performance improvement and optimizations (we will not recommend any specific tools - just use the tools that you usually use to observe and monitor your systems):
What resources might limit Dag processors's performance """""""""""""""""""""""""""""""""""""""""""""""""""""""
There are several areas of resource usage that you should pay attention to:
PGBouncer <https://www.pgbouncer.org/>_ as a proxy to your database. The :doc:helm-chart:index
supports PGBouncer out-of-the-box.config:dag_processor__min_file_process_interval, but this is one of the mentioned trade-offs,
result of this is that changes to such files will be picked up slower and you will see delays between
submitting the files and getting them available in Airflow UI and executed by Scheduler. Optimizing
the way how your Dags are built, avoiding external data sources is your best approach to improve CPU
usage. If you have more CPUs available, you can increase number of processing threads
:ref:config:dag_processor__parsing_processes.working memory (names might vary depending
on your deployment) rather than total memory used.What can you do, to improve Dag processor's performance """""""""""""""""""""""""""""""""""""""""""""""""""""""
When you know what your resource usage is, the improvements that you can consider might be:
best_practices/top_level_code explains what are the best practices for writing your top-level
Python code. The :ref:best_practices/reducing_dag_complexity document provides some areas that you might
look at when you want to reduce complexity of your code.Dag processor Configuration options """""""""""""""""""""""""""""""""""
The following config settings can be used to control aspects of the Dag processor.
However, you can also look at other non-performance-related Dag processor configuration parameters available at
:doc:../configurations-ref in the [dag_processor] section.
:ref:config:dag_processor__file_parsing_sort_mode
The Dag processor will list and sort the Dag files to decide the parsing order.
:ref:config:dag_processor__min_file_process_interval
Number of seconds after which a Dag file is re-parsed. The Dag file is parsed every
min_file_process_interval number of seconds. Updates to Dags are reflected after
this interval. Keeping this number low will increase CPU usage.
:ref:config:dag_processor__parsing_processes
The Dag processor can run multiple processes in parallel to parse Dag files. This defines
how many processes will run.