doc/roadmap.rst
.. |ss| raw:: html
<strike>.. |se| raw:: html
</strike>.. _roadmap:
This document lists general directions that core contributors are interested to see developed in scikit-learn. The fact that an item is listed here is in no way a promise that it will happen, as resources are limited. Rather, it is an indication that help is welcomed on this topic.
Eleven years after the inception of Scikit-learn, much has changed in the world of machine learning. Key changes include:
A more subtle change over the last decade is that, due to changing interests in ML, PhD students in machine learning are more likely to contribute to PyTorch, Dask, etc. than to Scikit-learn, so our contributor pool is very different to a decade ago.
Scikit-learn remains very popular in practice for trying out canonical machine learning techniques, particularly for applications in experimental science and in data science. A lot of what we provide is now very mature. But it can be costly to maintain, and we cannot therefore include arbitrary new implementations. Yet Scikit-learn is also essential in defining an API framework for the development of interoperable machine learning components external to the core library.
Thus our main goals in this era are to:
Many of the more fine-grained goals can be found under the API tag <https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3AAPI>_
on the issue tracker.
The list is numbered not as an indication of the order of priority, but to make referring to specific points easier. Please add new entries only at the bottom. Note that the crossed out entries are already done, and we try to keep the document up to date as we work on these issues.
#. Improved handling of Pandas DataFrames
#. Improved handling of categorical features
29437.#. Improved handling of missing data
6284#. More didactic documentation
#. Passing around information that is not (X, y): Feature properties
8480#. Passing around information that is not (X, y): Target information
6231 :issue:8100#. Make it easier for external users to write Scikit-learn-compatible components
#. Support resampling and sample reduction
3855#. Better interfaces for interactive development
estimator_html_repr.#. Improved tools for model diagnostics and basic inference
#. Better tools for selecting hyperparameters with transductive estimators
#. Better support for manual and automatic pipeline building
7608 :issue:5082 :issue:8243searchgrid <https://searchgrid.readthedocs.io/en/latest/>_#. Improved tracking of fitting
6929, :issue:78#. Distributed parallelism
__array_function__#. A way forward for more out of core
#. Backwards-compatible de/serialization of some estimators
#. Documentation and tooling for model lifecycle management
Document good practices for model deployments and lifecycle: before deploying a model: snapshot the code versions (numpy, scipy, scikit-learn, custom code repo), the training script and an alias on how to retrieve historical training data + snapshot a copy of a small validation set + snapshot of the predictions (predicted probabilities for classifiers) on that validation set.
Document and tools to make it easy to manage upgrade of scikit-learn versions:
#. Everything in scikit-learn should probably conform to our API contract. We are still in the process of making decisions on some of these related issues.
Pipeline <pipeline.Pipeline> and FeatureUnion modify their input
parameters in fit. Fixing this requires making sure we have a good
grasp of their use cases to make sure all current functionality is
maintained. :issue:8157 :issue:7382#. (Optional) Improve scikit-learn common tests suite to make sure that (at least for frequently used) models have stable predictions across-versions (to be discussed);
ONNX <https://github.com/onnx/sklearn-onnx>_.
and use the above best practices to assess predictive consistency between
scikit-learn and ONNX prediction functions on validation set.