contributing-docs/05_pull_requests.rst
contributing-docs/05_pull_requests.rst .. Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This document describes how you can create Pull Requests (PRs) and describes coding standards we use when implementing them.
.. contents:: Table of Contents :depth: 2 :local:
Once a PR is merged, the commit author's (and any co-author's) name and email address become permanently public as part of the commit history and cannot be changed or removed afterward.
GitHub provides a built-in option to anonymize your commit email address
(Setting your commit email address <https://docs.github.com/en/account-and-profile/how-tos/email-preferences/setting-your-commit-email-address>__)
by using a GitHub-provided email; however, this setting applies only to future commits and cannot be
applied retrospectively.
On computers used with multiple GitHub accounts (for example, shared or work computers), the local Git configuration may be set to a different name or email than expected. This can result in unintended personal or work-related details being permanently published. If you are using the Git CLI, you can verify your current configuration by running the following commands in your terminal within your local Airflow repository:
.. code-block:: bash
git config --get user.name git config --get user.email
To keep your email address private on GitHub, enable Keep my email addresses private in your GitHub
settings and configure git to use your GitHub-provided noreply address (typically
<id>+<username>@users.noreply.github.com) as user.email for future commits.
If you use the Git CLI, you can set the appropriate configuration by running the following commands:
.. code-block:: bash
git config user.email "<id>+<username>@users.noreply.github.com"
git config user.name "Your Name"
Notes:
Before opening a pull request, please ensure you are comfortable with the author and co-author details that will be publicly visible. Maintainers might point out unexpected personal information before merging, but they are not obliged to do so.
Before you submit a Pull Request (PR) from your forked repo, check that it meets these guidelines:
Start with Draft: Until you are sure that your PR passes all the quality checks and tests, keep it in Draft status. This will signal to maintainers that the PR is not yet ready for review and it will prevent maintainers from accidentally merging it before it's ready. Once you are sure that your PR is ready for review, you can mark it as "Ready for review" in the GitHub UI. Our regular check will convert all PRs from non-collaborators that do not pass our quality gates to Draft status, so if you see that your PR is in Draft status and you haven't set it to Draft. Check the comments to see what needs to be fixed.
Include tests, either as doctests, unit tests, or both, to your pull request.
The Airflow repo uses GitHub Actions <https://help.github.com/en/actions>_ to
run the tests and codecov <https://codecov.io/gh/apache/airflow>_ to track
coverage. You can set up both for free on your fork. It will help you make sure you do not
break the build with your PR and that you help increase coverage.
Also we advise to install locally prek hooks <08_static_code_checks.rst#prek-hooks>_ to
apply various checks, code generation and formatting at the time you make a local commit - which
gives you near-immediate feedback on things you need to fix before you push your code to the PR, or in
many case it will even fix it for you locally so that you can add and commit it straight away.
Follow our project's Coding style and best practices_. Usually we attempt to enforce the practices by
having appropriate prek hooks. There are checks amongst them that aren't currently enforced
programmatically (either because they are too hard or just not yet done).
Maintainers will not merge a PR that regresses linting or does not pass CI tests (unless you have good justification that it a transient error or something that is being fixed in other PR).
Maintainers will not merge PRs that have unresolved conversation.
We prefer that you rebase your PR (and do it quite often) rather than merge. It leads to
easier reviews and cleaner changes where you know exactly what changes you've done. You can learn more
about rebase vs. merge workflow in Rebase and merge your pull request <https://github.blog/2016-09-26-rebase-and-merge-pull-requests/>_
and Rebase your fork <http://stackoverflow.com/a/7244456/1110993>_. Make sure to resolve all conflicts
during rebase.
When merging PRs, Maintainer will use Squash and Merge which means then your PR will be merged as one commit, regardless of the number of commits in your PR. During the review cycle, you can keep a commit history for easier review, but if you need to, you can also squash all commits to reduce the maintenance burden during rebase.
Add an Apache License <http://www.apache.org/legal/src-headers.html>_ header to all new files. If you
have prek installed, prek will do it automatically for you. If you hesitate to install
prek for your local repository - for example because it takes a few seconds to commit your changes,
this one thing might be a good reason to convince anyone to install prek.
If your PR adds functionality, make sure to update the docs as part of the same PR, not only code and tests. Docstring is often sufficient. Make sure to follow the Sphinx compatible standards.
Make sure your code fulfills all the
static code checks <08_static_code_checks.rst#static-code-checks>_ we have in our code. The easiest way
to make sure of that is - again - to install prek hooks <08_static_code_checks.rst#prek-hooks>_.
Make sure your PR is small and focused on one change only - avoid adding unrelated changes, mixing
adding features and refactoring. Keeping to that rule will make it easier to review your PR and will make
it easier for release managers if they decide that your change should be cherry-picked to release it in a
bug-fix release of Airflow. If you want to add a new feature and refactor the code, it's better to split the
PR to several smaller PRs. It's also quite a good and common idea to keep a big Draft PR if you have
a bigger change that you want to make and then create smaller PRs from it that are easier to review and
merge and cherry-pick. It takes a long time (and a lot of attention and focus of a reviewer to review
big PRs so by splitting it to smaller PRs you actually speed up the review process and make it easier
for your change to be eventually merged.
To learn more about cherry-pick of PRs, refer to the Airflow 3 development documentation <../blob/main/dev/README_AIRFLOW3_DEV.md>_.
Run relevant tests locally before opening PR. Often tests are placed in the files that are corresponding
to the changed code (for example for airflow/cli/cli_parser.py changes you have tests in
tests/cli/test_cli_parser.py). However there are a number of cases where the tests that should run
are placed elsewhere - you can either run tests for the whole TEST_TYPE that is relevant (see
breeze testing core-tests --help or breeze testing providers-tests --help output for
available test types for each of the testing commands) or you can run all tests, or eventually
you can push your code to PR and see results of the tests in the CI.
You can use any supported python version to run the tests, but the best is to check
if it works for the oldest supported version (Python 3.10 currently). In rare cases
tests might fail with the oldest version when you use features that are available in newer Python
versions. For that purpose we have airflow.compat package where we keep back-ported
useful features from newer versions.
Adhere to guidelines for commit messages described in this article <https://cbea.ms/git-commit/>_.
This makes the lives of those who come after you (and your future self) a lot easier.
.. _pull-request-quality-criteria:
Every open PR must meet the following minimum quality criteria before maintainers will review it. PRs that do not meet these criteria may be automatically converted to draft status by project tooling, with a comment explaining what needs to be fixed.
Descriptive title — The PR title must clearly describe the change. Generic titles such as "Fix bug", "Update code", "Changes", single-word titles, or titles that only reference an issue number (e.g. "Fixes #12345") do not meet this bar.
Meaningful description — The PR body must contain a meaningful description of what the PR does and why. An empty body, a body consisting only of the PR template checkboxes/headers with no added text, or a body that merely repeats the title is not sufficient.
Passing static checks — The PR's static checks (pre-commit / ruff / mypy) must pass.
You can run them locally with prek run --from-ref main before pushing.
Gen-AI disclosure — If the PR was created with the assistance of generative AI tools,
the description must include a disclosure (see Gen-AI Assisted contributions_ below).
Coherent changes — The PR should contain related changes only. Completely unrelated changes bundled together will be flagged.
What happens when a PR is converted to draft?
Converting a PR to draft is not a rejection — it is an invitation to bring the PR up to the project's standards so that maintainer review time is spent productively.
For details on how maintainers use tooling to triage and review PRs, see
Maintainer PR Triage and Review <25_maintainer_pr_triage.md>__.
What happens when a PR is closed for quality violations?
If a contributor has more than 3 open PRs that are flagged for quality issues, maintainers may
choose to close the PR instead of converting it to draft. Closed PRs receive the
closed because of multiple quality violations label and a comment listing the violations.
Contributors are welcome to open a new PR that addresses the issues listed in the comment.
What happens when suspicious changes are detected?
When maintainers review a PR's diff before approving CI workflow runs and determine that it
contains suspicious changes (e.g. attempts to exfiltrate secrets, modify CI pipelines
maliciously, or inject harmful code), all open PRs by the same author will be closed
and labeled suspicious changes detected. A comment is posted on each PR explaining that
the closure was triggered by suspicious changes found in the flagged PR.
Generally, it's fine to use Gen-AI tools to help you create Pull Requests for Apache Airflow as long as you adhere to the following guidelines:
Apache Software Foundation (ASF) Generative Tooling guidelines <https://www.apache.org/legal/generative-tooling.html>__.When a contributor does not follow these guidelines, maintainers might decide to close the PR (and all the PRs of that contributor) without reviewing them - to avoid extra burden on the maintainers and to protect the project from potential risks of merging unvetted code by a tired maintainer. This should be accompanied by a comment explaining the reason for closing the PR and pointing to this section of the documentation.
If the contributor repeatedly ignores these guidelines, PMC members might decide to block the contributor from making further contributions to the project, this is a last resort measure to protect the project from potential risks of unvetted code and to avoid overburdening the maintainers. Such blocking is accompanied with a LAZY CONSENSUS vote amongst the PMC members to make sure that the decision is agreed upon and not vetoed by any of the PMC members and it is kept in the records of the Apache Software Foundation. In extreme cases the PMC might request the ASF Infrastructure team to block the contributor at the Organization level - for all ASF projects.
If the contributor is evidently spamming the project with the content that is violating GitHub terms and condition, maintainers might decide to report such behaviour to GitHub for further actions, which often results in deletion of the user account by GitHub or blocking the user from making further contributions at GitHub level.
We have decided to enable the requirement to resolve all the open conversations in a PR in order to make it merge-able. You will see in the status of the PR that it needs to have all the conversations resolved before it can be merged.
The goal is to make it easier to see when there are some conversations that are not
resolved for everyone involved in the PR - author, reviewers and maintainers who try to figure out if
the PR is ready to merge and - eventually - merge it. The goal is also to use conversations more as a "soft" way
to request changes and limit the use of Request changes status to only those cases when the maintainer
is sure that the PR should not be merged in the current state. This leads to faster review/merge
cycle and less problems with stalled PRs that have Request changes status but all the issues are
already solved.
.. _coding_style:
Most of our coding style rules are enforced programmatically by ruff and mypy, which are run automatically with static checks and on every Pull Request (PR), but there are some rules that are not yet automated and are more Airflow specific or semantic than style.
Don't Use Asserts Outside Tests ...............................
Our community agreed that to various reasons we do not use assert in production code of Apache Airflow.
For details check the relevant mailing list thread <https://lists.apache.org/thread.html/bcf2d23fcd79e21b3aac9f32914e1bf656e05ffbcb8aa282af497a2d%40%3Cdev.airflow.apache.org%3E>_.
In other words instead of doing:
.. code-block:: python
assert some_predicate()
you should do:
.. code-block:: python
if not some_predicate():
handle_the_case()
The one exception to this is if you need to make an assert for type checking (which should be almost a last resort) you can do this:
.. code-block:: python
if TYPE_CHECKING:
assert isinstance(x, MyClass)
Database Session Handling .........................
Explicit is better than implicit. If a function accepts a session parameter it should not commit the
transaction itself. Session management is up to the caller.
To make this easier, there is the create_session helper:
.. code-block:: python
from sqlalchemy.orm import Session
from airflow.utils.session import create_session
def my_call(x, y, *, session: Session):
...
# You MUST not commit the session here.
with create_session() as session:
my_call(x, y, session=session)
.. warning::
DO NOT add a default to the session argument unless @provide_session is used.
If this function is designed to be called by "end-users" (i.e. Dag authors) then using the @provide_session wrapper is okay:
.. code-block:: python
from sqlalchemy.orm import Session
from airflow.utils.session import NEW_SESSION, provide_session
@provide_session
def my_method(arg, *, session: Session = NEW_SESSION):
...
# You SHOULD not commit the session here. The wrapper will take care of commit()/rollback() if exception
In both cases, the session argument is a keyword-only argument_. This is the most preferred form if
possible, although there are some exceptions in the code base where this cannot be used, due to backward
compatibility considerations. In most cases, session argument should be last in the argument list.
.. _keyword-only argument: https://www.python.org/dev/peps/pep-3102/
Don't use time() for duration calculations ..........................................
If you wish to compute the time difference between two events with in the same process, use
time.monotonic(), not time.time() nor timezone.utcnow().
If you are measuring duration for performance reasons, then time.perf_counter() should be used. (On many
platforms, this uses the same underlying clock mechanism as monotonic, but perf_counter is guaranteed to be
the highest accuracy clock on the system, monotonic is simply "guaranteed" to not go backwards.)
If you wish to time how long a block of code takes, use Stats.timer() -- either with a metric name, which
will be timed and submitted automatically:
.. code-block:: python
from airflow.sdk.observability.stats import Stats
...
with Stats.timer("my_timer_metric"):
...
or to time but not send a metric:
.. code-block:: python
from airflow.sdk.observability.stats import Stats
...
with Stats.timer() as timer:
...
log.info("Code took %.3f ms", timer.duration)
For full docs on timer() check out shared/observability/src/airflow_shared/observability/metrics/base_stats_logger.py_.
If the start_date of a duration calculation needs to be stored in a database, then this has to be done using datetime objects. In all other cases, using datetime for duration calculation MUST be avoided as creating and diffing datetime operations are (comparatively) slow.
Templated fields in Operator's init method ..............................................
Airflow Operators might have some fields added to the list of template_fields. Such fields should be
set in the constructor (__init__ method) of the operator and usually their values should
come from the __init__ method arguments. The reason for that is that the templated fields
are evaluated at the time of the operator execution and when you pass arguments to the operator
in the Dag, the fields that are set on the class just before the execute method is called
are processed through templating engine and the fields values are set to the result of applying the
templating engine to the fields (in case the field is a structure such as dict or list, the templating
engine is applied to all the values of the structure).
That's why we expect two things in case of template fields:
execute method, not in the constructor because in
the constructor, the field value might be a templated value, not the final value.The exceptions are cases where we want to assign empty default value to a mutable field (list or dict) or when we have a more complex structure which we want to convert into a different format (say dict or list) but where we want to keep the original strings in the converted structure.
In such cases we can usually do something like this
.. code-block:: python
def __init__(self, *, my_field: list[str] = None, **kwargs):
super().__init__(**kwargs)
my_field = my_field or []
self.my_field = my_field
The reason for doing it is that we are working on a cleaning up our code to have
prek hook <../scripts/ci/prek/validate_operators_init.py>_
that will make sure all the cases where logic (such as validation and complex conversion)
is not done in the constructor are detected in PRs.
Don't raise AirflowException directly ..............................................
Our community has decided to stop adding new raise AirflowException and to adopt the following practices when an exception is necessary. For details check the relevant mailing list thread <https://lists.apache.org/thread/t8bnhyqy77kq4fk7fj3fmjd5wo9kv6w0>_.
ValueError, TypeError, OSError)
instead of wrapping everything in AirflowException.airflow-core, we should define and utilize more specific exception classes under airflow-core/src/airflow/exceptions.py.providers/<provider>/src/airflow/providers/<provider>/exceptions.py.The use of points 2 and 3 should only be considered when point 1 is inappropriate, which should be a rare occurrence.
In other words instead of doing:
.. code-block:: python
if key not in conf: raise AirflowException(f"Required key {key} is missing")
you should do:
.. code-block:: python
if key not in conf: raise ValueError(f"Required key {key} is missing")
If you want to learn what are the options for your development environment, follow to the
Development environments <06_development_environments.rst>_ document.