chart/docs/manage-dag-files.rst
.. Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
When you create new or modify existing Dag files, it is necessary to deploy them into the environment. This section will describe some basic techniques which you can use.
With this approach, you include your Dag files and related code in the Airflow image.
This method requires redeploying the services in the helm chart with the new docker image in order to deploy the new Dag code. This can work well particularly if Dag code is not expected to change frequently.
.. code-block:: bash
docker build --pull --tag "my-company/airflow:8a0da78" . -f - <<EOF FROM apache/airflow
COPY ./dags/ ${AIRFLOW_HOME}/dags/
EOF
Then publish it in the accessible registry:
.. code-block:: bash
docker push my-company/airflow:8a0da78
Finally, update the Airflow pods with that image:
.. code-block:: bash
helm upgrade --install airflow apache-airflow/airflow
--set images.airflow.repository=my-company/airflow
--set images.airflow.tag=8a0da78
If you are deploying an image with a constant tag, you need to make sure that the image is pulled every time as e.g. presented in the code below:
.. code-block:: bash
helm upgrade --install airflow apache-airflow/airflow
--set images.airflow.repository=my-company/airflow
--set images.airflow.tag=8a0da78
--set images.airflow.pullPolicy=Always
--set airflowPodAnnotations.random=r$(uuidgen)
The randomly generated pod annotation will ensure that pods are refreshed on helm upgrade.
.. warning::
Using constant tag should be used only for testing/development purpose. It is a bad practice to use the same tag as you'll lose the history of your code.
If you are deploying an image from a private repository, you need to create a secret, e.g. gitlab-registry-credentials (refer Pull an Image from a Private Registry <https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/>_ for details), and specify it using --set registry.secretName like:
.. code-block:: bash
helm upgrade --install airflow apache-airflow/airflow
--set images.airflow.repository=my-company/airflow
--set images.airflow.tag=8a0da78
--set images.airflow.pullPolicy=Always
--set registry.secretName=gitlab-registry-credentials
Mounting Dags using Git-Sync sidecar with persistence enabled ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This option will use a Persistent Volume Claim with ReadWriteMany access mode.
The dag-processor pod (if standalone dag-processor is disabled it will be scheduler pod) will sync Dags from
a git repository onto the PVC every configured number of seconds. The other pods will read the synced Dags.
Not all volume plugins have support for ReadWriteMany access mode.
Refer Persistent Volume Access Modes <https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes>__
for details.
.. code-block:: bash
helm upgrade --install airflow apache-airflow/airflow
--set dags.persistence.enabled=true
--set dags.gitSync.enabled=true
# You can also override the other persistence or gitSync values
# by setting the dags.persistence.* and dags.gitSync.* values
# Please refer to values.yaml for details
Mounting Dags using Git-Sync sidecar without persistence ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This option will always use running Git-Sync sidecar on every dag-processor, worker and triggerer pods (In Airflow 2.11, if separate dag-processor is not enabled, the Git-Sync sidecar will run on scheduler for Dag parsing as well).
The Git-Sync sidecar containers will sync Dags from a git repository every configured number of
seconds. If you are using the KubernetesExecutor, Git-Sync will run as an init container on your worker pods.
.. code-block:: bash
helm upgrade --install airflow apache-airflow/airflow
--set dags.persistence.enabled=false
--set dags.gitSync.enabled=true
# You can also override the other gitSync values
# by setting the dags.gitSync.* values
# Refer values.yaml for details
Notes for combining Git-Sync and persistence ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While using git-sync and persistence for Dags is possible, it is generally not recommended unless the Deployment Manager carefully considered the trade-offs it brings. There are cases when git-sync without persistence has other trade-offs (for example delays in synchronization of Dags vs. rate-limiting of Git servers) that can often be mitigated (for example by sending signals to git-sync containers via web-hooks when new commits are pushed to the repository), but there might be cases where you still might want to choose git-sync and persistence together.
Git-sync solution is primarily designed to be used for local, POSIX-compliant volumes to checkout Git repositories. Part of the process of commits synchronization from git-sync involves checking out new version of files in a freshly created folder and swapping symbolic links to the new folder, after the checkout is complete. This is done to ensure that the whole Dags folder is consistent at all times. The way git-sync works with symbolic-link swaps, makes sure that Parsing the Dags always work on a consistent (single-commit-based) set of files in the whole Dag folder.
This approach, however might have undesirable side effects when the folder that git-sync works on is not a local volume, but is a persistent volume (so effectively a networked, distributed volume). Depending on the technology behind the persistent volumes might handle git-sync approach differently and with non-obvious consequences. There are a lot of persistence solutions available for various K8S installations and each of them has different characteristics, so you need to carefully test and monitor your filesystem to make sure those undesired side effects do not affect you. Those effects might change over time or depend on parameters like how often the files are being scanned by the Dag Processor, the number and complexity of your Dags, how remote and how distributed your persistent volumes are, how many IOPS you allocate for some of the filesystem (usually highly paid feature of such filesystems is how many IOPS you can get) and many other factors.
The way git-sync works with symbolic links swapping generally causes a linear growth of the throughput and potential delays in synchronization. The networking traffic from checkouts comes in bursts and the bursts are linearly proportional to the number and size of files you have in the repository, makes it vulnerable to pretty sudden and unexpected demand increase. Most of the persistence solution work "good enough" for smaller/shorter burst of traffic, but when they outgrow certain thresholds, you need to upgrade the networking to a much more capable and expensive options. This is difficult to control and impossible to mitigate, so you might be suddenly faced with situation to pay a lot more for IOPS/persistence option to keep your Dags sufficiently synchronized to avoid inconsistencies and delays in synchronization.
The side-effects that you might observe are:
General recommendation is to use git-sync with local volumes only, and if you want to it with persistence, you need to make sure that the persistence solution you use is POSIX-compliant and you monitor the side-effects it might have.
Synchronizing multiple Git repositories with Git-Sync ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Airflow git-sync integration in the Helm Chart does not allow synchronization of multiple repositories at
the same time. The Dag folder must come from single git repository. However, it is possible
to use submodules <https://git-scm.com/book/en/v2/Git-Tools-Submodules>_ to create an "umbrella" repository
that you can use to bring a number of git repositories checked out together (with --submodules recursive
option). There are success stories of Airflow users using such approach with 100s of repositories put
together as submodules via such "umbrella" repo approach. When you choose this solution, you need to work out
the way how to link the submodules, when to update the umbrella repo when "submodule" repository change and
work out versioning approach and automate it. This might be as simple as always using latest versions of all
the submodule repositories, or as complex as managing versioning of shared libraries, Dags and code across
multiple teams and doing that following your release process.
An example of such complex approach can found in this
Manage Dags at scale <https://s.apache.org/airflow-manage-dags-at-scale>_ presentation from the Airflow
Summit.
In this approach, Airflow will read the Dags from a PVC which has ReadOnlyMany or ReadWriteMany access mode.
You will have to ensure that the PVC is populated/updated with the required Dags (this won't be handled by the chart).
You can pass the name of the volume claim to the chart by using dags.persistence.existingClaim parameter:
.. code-block:: bash
helm upgrade --install airflow apache-airflow/airflow
--set dags.persistence.enabled=true
--set dags.persistence.existingClaim=my-volume-claim
--set dags.gitSync.enabled=false
To configure mounting Dags from private GitHub repository, follow below steps:
Create a private repo on GitHub if you have not created one already.
Then create your ssh keys:
.. code-block:: bash
ssh-keygen -t rsa -b 4096 -C "[email protected]"
Add the public key to your private repo under Settings > Deploy keys.
Convert the private ssh key to a base64 string and save it's value.
.. note::
You can convert the private ssh key file like:
.. code-block:: bash
base64 <my-private-ssh-key> -w 0 > temp.txt
Then copy the string from the temp.txt file.
The converted to base64 string will be used in the override-values.yaml file.
Create a yaml file called override-values.yaml to override default values, instead of using --set:
.. code-block:: yaml :caption: override-values.yaml
dags: gitSync: enabled: true repo: [email protected]:<username>/<private-repo-name>.git branch: <branch-name> subPath: "" sshKeySecret: airflow-ssh-secret extraSecrets: airflow-ssh-secret: data: | gitSshKey: '<base64-converted-ssh-private-key>'
Copied base64 string should be as a value for the gitSshKey key.
Finally, from the context of your Airflow Helm chart directory, install Airflow:
.. code-block:: bash
helm upgrade --install airflow apache-airflow/airflow -f override-values.yaml
If you have done everything correctly, Git-Sync will pick up the changes you make to the Dags in your private GitHub repo.
You should take this a step further and set dags.gitSync.knownHosts, so you are not susceptible to man-in-the-middle
attacks. This process is documented in the :ref:production guide <production-guide:knownhosts>.