doc/developers/tips.rst
.. _developers-tips:
In this section we gather some useful advice and tools that may increase your
quality-of-life when reviewing pull requests, running unit tests, and so forth.
Some of these tricks consist of userscripts that require a browser extension
such as TamperMonkey_ or GreaseMonkey_; to set up userscripts you must have
one of these extensions installed, enabled and running. We provide userscripts
as GitHub gists; to install them, click on the "Raw" button on the gist page.
.. _TamperMonkey: https://tampermonkey.net/ .. _GreaseMonkey: https://www.greasespot.net/
GitHub hides discussions on PRs when the corresponding lines of code have been
changed in the meantime. This userscript <https://raw.githubusercontent.com/lesteve/userscripts/master/github-expand-all.user.js>__
provides a shortcut (Control-Alt-P at the time of writing but look at the code
to be sure) to unfold all such hidden discussions at once, so you can catch up.
In your local fork, add to your .git/config, under the [remote "upstream"] heading, the line::
fetch = +refs/pull//head:refs/remotes/upstream/pr/
You may then use git checkout pr/PR_NUMBER to navigate to the code of the
pull-request with the given number. (Read more in this gist. <https://gist.github.com/piscisaureus/3342247>_)
To overlay the code coverage reports generated by the CodeCov continuous
integration, consider this browser extension <https://github.com/codecov/browser-extension>_. The coverage of each line
will be displayed as a color background behind the line number.
.. _pytest_tips:
The full test suite takes fairly long to run. For faster iterations,
it is possible to select a subset of tests using pytest selectors.
In particular, one can run a single test based on its node ID <https://docs.pytest.org/en/latest/example/markers.html#selecting-tests-based-on-their-node-id>_:
.. prompt:: bash $
pytest -v sklearn/linear_model/tests/test_logistic.py::test_sparsify
or use the -k pytest parameter <https://docs.pytest.org/en/latest/example/markers.html#using-k-expr-to-select-tests-based-on-their-name>_
to select tests based on their name. For instance,:
.. prompt:: bash $
pytest sklearn/tests/test_common.py -v -k LogisticRegression
will run all :term:common tests for the LogisticRegression estimator.
When a unit test fails, the following tricks can make debugging easier:
The command line argument pytest -l instructs pytest to print the local
variables when a failure occurs.
The argument pytest --pdb drops into the Python debugger on failure. To
instead drop into the rich IPython debugger ipdb, you may set up a
shell alias to:
.. prompt:: bash $
pytest --pdbcls=IPython.terminal.debugger:TerminalPdb --capture no
Other pytest options that may become useful include:
-x which exits on the first failed test,--lf to rerun the tests that failed on the previous run,--ff to rerun all previous tests, running the ones that failed first,-s so that pytest does not capture the output of print() statements,--tb=short or --tb=line to control the length of the logs,--runxfail also run tests marked as a known failure (XFAIL) and report errors.Since our continuous integration tests will error if
FutureWarning isn't properly caught,
it is also recommended to run pytest along with the
-Werror::FutureWarning flag.
.. _saved_replies:
It may be helpful to store some of these in GitHub's saved replies <https://github.com/settings/replies/>_ for reviewing:
.. highlight:: none
.. Note that putting this content on a single line in a literal is the easiest way to make it copyable and wrapped on screen.
Issue: Usage questions
::
You are asking a usage question. The issue tracker is for bugs and new features. For usage questions, it is recommended to try [Stack Overflow](https://stackoverflow.com/questions/tagged/scikit-learn) or [the Mailing List](https://mail.python.org/mailman/listinfo/scikit-learn).
Unfortunately, we need to close this issue as this issue tracker is a communication tool used for the development of scikit-learn. The additional activity created by usage questions crowds it too much and impedes this development. The conversation can continue here, however there is no guarantee that it will receive attention from core developers.
Issue: You're welcome to update the docs
::
Please feel free to offer a pull request updating the documentation if you feel it could be improved.
Issue: Self-contained example for bug
::
Please provide [self-contained example code](https://scikit-learn.org/dev/developers/minimal_reproducer.html), including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.
Issue: Software versions
::
To help diagnose your issue, please paste the output of:
```py
import sklearn; sklearn.show_versions()
```
Thanks.
Issue: Code blocks
::
Readability can be greatly improved if you [format](https://help.github.com/articles/creating-and-highlighting-code-blocks/) your code snippets and complete error messages appropriately. For example:
```python
print(something)
```
generates:
```python
print(something)
```
And:
```pytb
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'hello'
```
generates:
```pytb
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'hello'
```
You can edit your issue descriptions and comments at any time to improve readability. This helps maintainers a lot. Thanks!
Issue/Comment: Linking to code
::
Friendly advice: for clarity's sake, you can link to code like [this](https://help.github.com/articles/creating-a-permanent-link-to-a-code-snippet/).
Issue/Comment: Linking to comments
::
Please use links to comments, which make it a lot easier to see what you are referring to, rather than just linking to the issue. See [this](https://stackoverflow.com/questions/25163598/how-do-i-reference-a-specific-issue-comment-on-github) for more details.
PR-NEW: Better description and title
::
Thanks for the pull request! Please make the title of the PR more descriptive. The title will become the commit message when this is merged. You should state what issue (or PR) it fixes/resolves in the description using the syntax described [here](https://scikit-learn.org/dev/developers/contributing.html#contributing-pull-requests).
PR-NEW: Fix #
::
Please use "Fix #issueNumber" in your PR description (and you can do it more than once). This way the associated issue gets closed automatically when the PR is merged. For more details, look at [this](https://github.com/blog/1506-closing-issues-via-pull-requests).
PR-NEW or Issue: Maintenance cost
::
Every feature we include has a [maintenance cost](https://scikit-learn.org/dev/faq.html#why-are-you-so-selective-on-what-algorithms-you-include-in-scikit-learn). Our maintainers are mostly volunteers. For a new feature to be included, we need evidence that it is often useful and, ideally, [well-established](https://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms) in the literature or in practice. Also, we expect PR authors to take part in the maintenance for the code they submit, at least initially. That doesn't stop you implementing it for yourself and publishing it in a separate repository, or even [scikit-learn-contrib](https://scikit-learn-contrib.github.io).
PR-WIP: What's needed before merge?
::
Please clarify (perhaps as a TODO list in the PR description) what work you believe still needs to be done before it can be reviewed for merge. When it is ready, please prefix the PR title with `[MRG]`.
PR-WIP: Regression test needed
::
Please add a [non-regression test](https://en.wikipedia.org/wiki/Non-regression_testing) that would fail at main but pass in this PR.
PR-MRG: Patience
::
Before merging, we generally require two core developers to agree that your pull request is desirable and ready. [Please be patient](https://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention), as we mostly rely on volunteered time from busy core developers. (You are also welcome to help us out with [reviewing other PRs](https://scikit-learn.org/dev/developers/contributing.html#code-review-guidelines).)
PR-MRG: Add to what's new
::
Please add an entry to the future changelog by adding an RST fragment into the module associated with your change located in `doc/whats_new/upcoming_changes`. Refer to the following [README](https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/upcoming_changes/README.md) for full instructions.
PR: Don't change unrelated
::
Please do not change unrelated lines. It makes your contribution harder to review and may introduce merge conflicts to other pull requests.
.. _debugging_ci_issues:
CI issues may arise for a variety of reasons, so this is by no means a comprehensive guide, but rather a list of useful tips and tricks.
Using a lock-file to get an environment close to the CI +++++++++++++++++++++++++++++++++++++++++++++++++++++++
conda-lock can be used to create a conda environment with the exact same
conda and pip packages as on the CI. For example, the following command will
create a conda environment named scikit-learn-doc that is similar to the CI:
.. prompt:: bash $
conda-lock install -n scikit-learn-doc build_tools/circle/doc_linux-64_conda.lock
.. note::
It only works if you have the same OS as the CI build (check `platform:` in
the lock-file). For example, the previous command will only work if you are
on a Linux machine. Also this may not allow you to reproduce some of the
issues that are more tied to the particularities of the CI environment, for
example CPU architecture reported by OpenBLAS in `sklearn.show_versions()`.
If you don't have the same OS as the CI build you can still create a conda environment from the right environment yaml file, although it won't be as close as the CI environment as using the associated lock-file. For example for the doc build:
.. prompt:: bash $
conda env create -n scikit-learn-doc -f build_tools/circle/doc_environment.yml -y
This may not give you exactly the same package versions as in the CI for a variety of reasons, for example:
main branch and the time you run the conda create
command. You can always try to look at the version in the lock-file and
specify the versions by hand for some specific packages that you think would
help reproducing the issue.Also the problem may be OS specific so the only way to be able to reproduce would be to have the same OS as the CI build.
.. highlight:: default
While python/numpy's built-in memory management is relatively robust, it can lead to performance penalties for some routines. For this reason, much of the high-performance code in scikit-learn is written in cython. This performance gain comes with a tradeoff, however: it is very easy for memory bugs to crop up in cython code, especially in situations where that code relies heavily on pointer arithmetic.
Memory errors can manifest themselves a number of ways. The easiest ones to debug are often segmentation faults and related glibc errors. Uninitialized variables can lead to unexpected behavior that is difficult to track down. A very useful tool when debugging these sorts of errors is valgrind_.
Valgrind is a command-line tool that can trace memory errors in a variety of code. Follow these steps:
Install valgrind_ on your system.
Download the python valgrind suppression file: valgrind-python.supp_.
Follow the directions in the README.valgrind_ file to customize your
python suppressions. If you don't, you will have spurious output coming
related to the python interpreter instead of your own code.
Run valgrind as follows:
.. prompt:: bash $
valgrind -v --suppressions=valgrind-python.supp python my_test_script.py
.. _valgrind: https://valgrind.org
.. _README.valgrind: https://github.com/python/cpython/blob/master/Misc/README.valgrind
.. _valgrind-python.supp: https://github.com/python/cpython/blob/master/Misc/valgrind-python.supp
The result will be a list of all the memory-related errors, which reference lines in the C-code generated by cython from your .pyx file. If you examine the referenced lines in the .c file, you will see comments which indicate the corresponding location in your .pyx source file. Hopefully the output will give you clues as to the source of your memory error.
For more information on valgrind and the array of options it has, see the
tutorials and documentation on the valgrind web site <https://valgrind.org>_.
.. _arm64_dev_env:
ARM-based machines are a popular target for mobile, edge or other low-energy deployments (including in the cloud, for instance on Scaleway or AWS Graviton).
Here are instructions to setup a local dev environment to reproduce ARM-specific bugs or test failures on an x86_64 host laptop or workstation. This is based on QEMU user mode emulation using docker for convenience (see https://github.com/multiarch/qemu-user-static).
.. note::
The following instructions are illustrated for ARM64 but they also apply to
ppc64le, after changing the Docker image and Miniforge paths appropriately.
Prepare a folder on the host filesystem and download the necessary tools and source code:
.. prompt:: bash $
mkdir arm64
pushd arm64
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-aarch64.sh
git clone https://github.com/scikit-learn/scikit-learn.git
Use docker to install QEMU user mode and run an ARM64v8 container with access
to your shared folder under the /io mount point:
.. prompt:: bash $
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
docker run -v `pwd`:/io --rm -it arm64v8/ubuntu /bin/bash
In the container, install miniforge3 for the ARM64 (a.k.a. aarch64) architecture:
.. prompt:: bash $
bash Miniforge3-Linux-aarch64.sh
# Choose to install miniforge3 under: `/io/miniforge3`
Whenever you restart a new container, you will need to reinit the conda env
previously installed under /io/miniforge3:
.. prompt:: bash $
/io/miniforge3/bin/conda init
source /root/.bashrc
as the /root home folder is part of the ephemeral docker container. Every
file or directory stored under /io is persistent on the other hand.
You can then build scikit-learn as usual (you will need to install compiler
tools and dependencies using apt or conda as usual). Building scikit-learn
takes a lot of time because of the emulation layer, however it needs to be
done only once if you put the scikit-learn folder under the /io mount
point.
Then use pytest to run only the tests of the module you are interested in debugging.
.. _meson_build_backend:
Since scikit-learn 1.5.0 we use meson-python as the build tool. Meson is a new tool for scikit-learn and the PyData ecosystem. It is used by several other packages that have written good guides about what it is and how it works.
pandas setup doc <https://pandas.pydata.org/docs/development/contributing_environment.html#step-3-build-and-install-pandas>_:
pandas has a similar setup as ours (no spin or dev.py)scipy Meson doc <https://scipy.github.io/devdocs/building/understanding_meson.html>_ gives
more background about how Meson works behind the scenes