doc/whats_new/v0.17.rst
.. include:: _contributors.rst
.. currentmodule:: sklearn
.. _changes_0_17_1:
February 18, 2016
Bug fixes .........
Upgrade vendored joblib to version 0.9.4 that fixes an important bug in
joblib.Parallel that can silently yield to wrong results when working
on datasets larger than 1MB:
https://github.com/joblib/joblib/blob/0.9.4/CHANGES.rst
Fixed reading of Bunch pickles generated with scikit-learn
version <= 0.16. This can affect users who have already
downloaded a dataset with scikit-learn 0.16 and are loading it
with scikit-learn 0.17. See :issue:6196 for
how this affected :func:datasets.fetch_20newsgroups. By Loic Esteve_.
Fixed a bug that prevented using ROC AUC score to perform grid search on
several CPU / cores on large arrays. See :issue:6147
By Olivier Grisel_.
Fixed a bug that prevented to properly set the presort parameter
in :class:ensemble.GradientBoostingRegressor. See :issue:5857
By Andrew McCulloh.
Fixed a joblib error when evaluating the perplexity of a
:class:decomposition.LatentDirichletAllocation model. See :issue:6258
By Chyi-Kwei Yau.
.. _changes_0_17:
November 5, 2015
New features ............
All the Scaler classes but :class:preprocessing.RobustScaler can be fitted online by
calling partial_fit. By :user:Giorgio Patrini <giorgiop>.
The new class :class:ensemble.VotingClassifier implements a
"majority rule" / "soft voting" ensemble classifier to combine
estimators for classification. By Sebastian Raschka_.
The new class :class:preprocessing.RobustScaler provides an
alternative to :class:preprocessing.StandardScaler for feature-wise
centering and range normalization that is robust to outliers.
By :user:Thomas Unterthiner <untom>.
The new class :class:preprocessing.MaxAbsScaler provides an
alternative to :class:preprocessing.MinMaxScaler for feature-wise
range normalization when the data is already centered or sparse.
By :user:Thomas Unterthiner <untom>.
The new class :class:preprocessing.FunctionTransformer turns a Python
function into a Pipeline-compatible transformer object.
By Joe Jevnik.
The new classes cross_validation.LabelKFold and
cross_validation.LabelShuffleSplit generate train-test folds,
respectively similar to cross_validation.KFold and
cross_validation.ShuffleSplit, except that the folds are
conditioned on a label array. By Brian McFee, :user:Jean Kossaifi <JeanKossaifi> and Gilles Louppe.
:class:decomposition.LatentDirichletAllocation implements the Latent
Dirichlet Allocation topic model with online variational
inference. By :user:Chyi-Kwei Yau <chyikwei>, with code based on an implementation
by Matt Hoffman. (:issue:3659)
The new solver sag implements a Stochastic Average Gradient descent
and is available in both :class:linear_model.LogisticRegression and
:class:linear_model.Ridge. This solver is very efficient for large
datasets. By :user:Danny Sullivan <dsullivan7> and Tom Dupre la Tour_.
(:issue:4738)
The new solver cd implements a Coordinate Descent in
:class:decomposition.NMF. Previous solver based on Projected Gradient is
still available setting new parameter solver to pg, but is
deprecated and will be removed in 0.19, along with
decomposition.ProjectedGradientNMF and parameters sparseness,
eta, beta and nls_max_iter. New parameters alpha and
l1_ratio control L1 and L2 regularization, and shuffle adds a
shuffling step in the cd solver.
By Tom Dupre la Tour_ and Mathieu Blondel_.
Enhancements ............
:class:manifold.TSNE now supports approximate optimization via the
Barnes-Hut method, leading to much faster fitting. By Christopher Erick Moody.
(:issue:4025)
:class:cluster.MeanShift now supports parallel execution,
as implemented in the mean_shift function. By :user:Martino Sorbaro <martinosorb>.
:class:naive_bayes.GaussianNB now supports fitting with sample_weight.
By Jan Hendrik Metzen_.
:class:dummy.DummyClassifier now supports a prior fitting strategy.
By Arnaud Joly_.
Added a fit_predict method for mixture.GMM and subclasses.
By :user:Cory Lorenz <clorenz7>.
Added the :func:metrics.label_ranking_loss metric.
By Arnaud Joly_.
Added the :func:metrics.cohen_kappa_score metric.
Added a warm_start constructor parameter to the bagging ensemble
models to increase the size of the ensemble. By :user:Tim Head <betatim>.
Added option to use multi-output regression metrics without averaging.
By Konstantin Shmelkov and :user:Michael Eickenberg<eickenberg>.
Added stratify option to cross_validation.train_test_split
for stratified splitting. By Miroslav Batchkarov.
The :func:tree.export_graphviz function now supports aesthetic
improvements for :class:tree.DecisionTreeClassifier and
:class:tree.DecisionTreeRegressor, including options for coloring nodes
by their majority class or impurity, showing variable names, and using
node proportions instead of raw sample counts. By Trevor Stephens_.
Improved speed of newton-cg solver in
:class:linear_model.LogisticRegression, by avoiding loss computation.
By Mathieu Blondel_ and Tom Dupre la Tour_.
The class_weight="auto" heuristic in classifiers supporting
class_weight was deprecated and replaced by the class_weight="balanced"
option, which has a simpler formula and interpretation.
By Hanna Wallach_ and Andreas Müller_.
Add class_weight parameter to automatically weight samples by class
frequency for :class:linear_model.PassiveAggressiveClassifier. By
Trevor Stephens_.
Added backlinks from the API reference pages to the user guide. By
Andreas Müller_.
The labels parameter to :func:sklearn.metrics.f1_score,
:func:sklearn.metrics.fbeta_score,
:func:sklearn.metrics.recall_score and
:func:sklearn.metrics.precision_score has been extended.
It is now possible to ignore one or more labels, such as where
a multiclass problem has a majority class to ignore. By Joel Nothman_.
Add sample_weight support to :class:linear_model.RidgeClassifier.
By Trevor Stephens_.
Provide an option for sparse output from
:func:sklearn.metrics.pairwise.cosine_similarity. By
:user:Jaidev Deshpande <jaidevd>.
Add :func:preprocessing.minmax_scale to provide a function interface for
:class:preprocessing.MinMaxScaler. By :user:Thomas Unterthiner <untom>.
dump_svmlight_file now handles multi-label datasets.
By Chih-Wei Chang.
RCV1 dataset loader (:func:sklearn.datasets.fetch_rcv1).
By Tom Dupre la Tour_.
The "Wisconsin Breast Cancer" classical two-class classification dataset
is now included in scikit-learn, available with
:func:datasets.load_breast_cancer.
Upgraded to joblib 0.9.3 to benefit from the new automatic batching of
short tasks. This makes it possible for scikit-learn to benefit from
parallelism when many very short tasks are executed in parallel, for
instance by the grid_search.GridSearchCV meta-estimator
with n_jobs > 1 used with a large grid of parameters on a small
dataset. By Vlad Niculae, Olivier Grisel and Loic Esteve_.
For more details about changes in joblib 0.9.3 see the release notes: https://github.com/joblib/joblib/blob/master/CHANGES.rst#release-093
Improved speed (3 times per iteration) of
decomposition.DictLearning with coordinate descent method
from :class:linear_model.Lasso. By :user:Arthur Mensch <arthurmensch>.
Parallel processing (threaded) for queries of nearest neighbors (using the ball-tree) by Nikolay Mayorov.
Allow :func:datasets.make_multilabel_classification to output
a sparse y. By Kashif Rasul.
:class:cluster.DBSCAN now accepts a sparse matrix of precomputed
distances, allowing memory-efficient distance precomputation. By
Joel Nothman_.
:class:tree.DecisionTreeClassifier now exposes an apply method
for retrieving the leaf indices samples are predicted as. By
:user:Daniel Galvez <galv> and Gilles Louppe_.
Speed up decision tree regressors, random forest regressors, extra trees
regressors and gradient boosting estimators by computing a proxy
of the impurity improvement during the tree growth. The proxy quantity is
such that the split that maximizes this value also maximizes the impurity
improvement. By Arnaud Joly, :user:Jacob Schreiber <jmschrei>
and Gilles Louppe.
Speed up tree based methods by reducing the number of computations needed
when computing the impurity measure taking into account linear
relationship of the computed statistics. The effect is particularly
visible with extra trees and on datasets with categorical or sparse
features. By Arnaud Joly_.
:class:ensemble.GradientBoostingRegressor and
:class:ensemble.GradientBoostingClassifier now expose an apply
method for retrieving the leaf indices each sample ends up in under
each try. By :user:Jacob Schreiber <jmschrei>.
Add sample_weight support to :class:linear_model.LinearRegression.
By Sonny Hu. (:issue:#4881)
Add n_iter_without_progress to :class:manifold.TSNE to control
the stopping criterion. By Santi Villalba. (:issue:5186)
Added optional parameter random_state in :class:linear_model.Ridge
, to set the seed of the pseudo random generator used in sag solver. By Tom Dupre la Tour_.
Added optional parameter warm_start in
:class:linear_model.LogisticRegression. If set to True, the solvers
lbfgs, newton-cg and sag will be initialized with the
coefficients computed in the previous fit. By Tom Dupre la Tour_.
Added sample_weight support to :class:linear_model.LogisticRegression for
the lbfgs, newton-cg, and sag solvers. By Valentin Stolbunov.
Support added to the liblinear solver. By Manoj Kumar.
Added optional parameter presort to :class:ensemble.GradientBoostingRegressor
and :class:ensemble.GradientBoostingClassifier, keeping default behavior
the same. This allows gradient boosters to turn off presorting when building
deep trees or using sparse data. By :user:Jacob Schreiber <jmschrei>.
Altered :func:metrics.roc_curve to drop unnecessary thresholds by
default. By :user:Graham Clenaghan <gclenaghan>.
Added :class:feature_selection.SelectFromModel meta-transformer which can
be used along with estimators that have coef_ or feature_importances_
attribute to select important features of the input data. By
:user:Maheshakya Wijewardena <maheshakya>, Joel Nothman_ and Manoj Kumar_.
Added :func:metrics.pairwise.laplacian_kernel. By Clyde Fare <https://github.com/Clyde-fare>_.
covariance.GraphLasso allows separate control of the convergence criterion
for the Elastic-Net subproblem via the enet_tol parameter.
Improved verbosity in :class:decomposition.DictionaryLearning.
:class:ensemble.RandomForestClassifier and
:class:ensemble.RandomForestRegressor no longer explicitly store the
samples used in bagging, resulting in a much reduced memory footprint for
storing random forest models.
Added positive option to :class:linear_model.Lars and
:func:linear_model.lars_path to force coefficients to be positive.
(:issue:5131)
Added the X_norm_squared parameter to :func:metrics.pairwise.euclidean_distances
to provide precomputed squared norms for X.
Added the fit_predict method to :class:pipeline.Pipeline.
Added the :func:preprocessing.minmax_scale function.
Bug fixes .........
Fixed non-determinism in :class:dummy.DummyClassifier with sparse
multi-label output. By Andreas Müller_.
Fixed the output shape of :class:linear_model.RANSACRegressor to
(n_samples, ). By Andreas Müller_.
Fixed bug in decomposition.DictLearning when n_jobs < 0. By
Andreas Müller_.
Fixed bug where grid_search.RandomizedSearchCV could consume a
lot of memory for large discrete grids. By Joel Nothman_.
Fixed bug in :class:linear_model.LogisticRegressionCV where penalty was ignored
in the final fit. By Manoj Kumar_.
Fixed bug in ensemble.forest.ForestClassifier while computing
oob_score and X is a sparse.csc_matrix. By :user:Ankur Ankan <ankurankan>.
All regressors now consistently handle and warn when given y that is of
shape (n_samples, 1). By Andreas Müller_ and Henry Lin.
(:issue:5431)
Fix in :class:cluster.KMeans cluster reassignment for sparse input by
Lars Buitinck_.
Fixed a bug in :class:discriminant_analysis.LinearDiscriminantAnalysis that
could cause asymmetric covariance matrices when using shrinkage. By Martin Billinger_.
Fixed cross_validation.cross_val_predict for estimators with
sparse predictions. By Buddha Prakash.
Fixed the predict_proba method of :class:linear_model.LogisticRegression
to use soft-max instead of one-vs-rest normalization. By Manoj Kumar_.
(:issue:5182)
Fixed the partial_fit method of :class:linear_model.SGDClassifier
when called with average=True. By :user:Andrew Lamb <andylamb>.
(:issue:5282)
Dataset fetchers use different filenames under Python 2 and Python 3 to
avoid pickling compatibility issues. By Olivier Grisel_.
(:issue:5355)
Fixed a bug in :class:naive_bayes.GaussianNB which caused classification
results to depend on scale. By Jake Vanderplas_.
Fixed temporarily :class:linear_model.Ridge, which was incorrect
when fitting the intercept in the case of sparse data. The fix
automatically changes the solver to 'sag' in this case.
:issue:5360 by Tom Dupre la Tour_.
Fixed a performance bug in decomposition.RandomizedPCA on data
with a large number of features and fewer samples. (:issue:4478)
By Andreas Müller, Loic Esteve and :user:Giorgio Patrini <giorgiop>.
Fixed bug in cross_decomposition.PLS that yielded unstable and
platform dependent output, and failed on fit_transform.
By :user:Arthur Mensch <arthurmensch>.
Fixes to the Bunch class used to store datasets.
Fixed ensemble.plot_partial_dependence ignoring the
percentiles parameter.
Providing a set as vocabulary in CountVectorizer no longer
leads to inconsistent results when pickling.
Fixed the conditions on when a precomputed Gram matrix needs to
be recomputed in :class:linear_model.LinearRegression,
:class:linear_model.OrthogonalMatchingPursuit,
:class:linear_model.Lasso and :class:linear_model.ElasticNet.
Fixed inconsistent memory layout in the coordinate descent solver
that affected linear_model.DictionaryLearning and
covariance.GraphLasso. (:issue:5337)
By Olivier Grisel_.
:class:manifold.LocallyLinearEmbedding no longer ignores the reg
parameter.
Nearest Neighbor estimators with custom distance metrics can now be pickled.
(:issue:4362)
Fixed a bug in :class:pipeline.FeatureUnion where transformer_weights
were not properly handled when performing grid-searches.
Fixed a bug in :class:linear_model.LogisticRegression and
:class:linear_model.LogisticRegressionCV when using
class_weight='balanced' or class_weight='auto'.
By Tom Dupre la Tour_.
Fixed bug :issue:5495 when
doing OVR(SVC(decision_function_shape="ovr")). Fixed by
:user:Elvis Dohmatob <dohmatob>.
Attribute data_min, data_max and data_range in
:class:preprocessing.MinMaxScaler are deprecated and won't be available
from 0.19. Instead, the class now exposes data_min_, data_max_
and data_range_. By :user:Giorgio Patrini <giorgiop>.
All Scaler classes now have an scale_ attribute, the feature-wise
rescaling applied by their transform methods. The old attribute std_
in :class:preprocessing.StandardScaler is deprecated and superseded
by scale_; it won't be available in 0.19. By :user:Giorgio Patrini <giorgiop>.
:class:svm.SVC and :class:svm.NuSVC now have an decision_function_shape
parameter to make their decision function of shape (n_samples, n_classes)
by setting decision_function_shape='ovr'. This will be the default behavior
starting in 0.19. By Andreas Müller_.
Passing 1D data arrays as input to estimators is now deprecated as it
caused confusion in how the array elements should be interpreted
as features or as samples. All data arrays are now expected
to be explicitly shaped (n_samples, n_features).
By :user:Vighnesh Birodkar <vighneshbirodkar>.
lda.LDA and qda.QDA have been moved to
:class:discriminant_analysis.LinearDiscriminantAnalysis and
:class:discriminant_analysis.QuadraticDiscriminantAnalysis.
The store_covariance and tol parameters have been moved from
the fit method to the constructor in
:class:discriminant_analysis.LinearDiscriminantAnalysis and the
store_covariances and tol parameters have been moved from the
fit method to the constructor in
:class:discriminant_analysis.QuadraticDiscriminantAnalysis.
Models inheriting from _LearntSelectorMixin will no longer support the
transform methods. (i.e, RandomForests, GradientBoosting, LogisticRegression,
DecisionTrees, SVMs and SGD related models). Wrap these models around the
metatransfomer :class:feature_selection.SelectFromModel to remove
features (according to coefs_ or feature_importances_)
which are below a certain threshold value instead.
:class:cluster.KMeans re-runs cluster-assignments in case of non-convergence,
to ensure consistency of predict(X) and labels_. By
:user:Vighnesh Birodkar <vighneshbirodkar>.
Classifier and Regressor models are now tagged as such using the
_estimator_type attribute.
Cross-validation iterators always provide indices into training and test set, not boolean masks.
The decision_function on all regressors was deprecated and will be
removed in 0.19. Use predict instead.
datasets.load_lfw_pairs is deprecated and will be removed in 0.19.
Use :func:datasets.fetch_lfw_pairs instead.
The deprecated hmm module was removed.
The deprecated Bootstrap cross-validation iterator was removed.
The deprecated Ward and WardAgglomerative classes have been removed.
Use :class:cluster.AgglomerativeClustering instead.
cross_validation.check_cv is now a public function.
The property residues_ of :class:linear_model.LinearRegression is deprecated
and will be removed in 0.19.
The deprecated n_jobs parameter of :class:linear_model.LinearRegression has been moved
to the constructor.
Removed deprecated class_weight parameter from :class:linear_model.SGDClassifier's fit
method. Use the construction parameter instead.
The deprecated support for the sequence of sequences (or list of lists) multilabel
format was removed. To convert to and from the supported binary
indicator matrix format, use
:class:MultiLabelBinarizer <preprocessing.MultiLabelBinarizer>.
The behavior of calling the inverse_transform method of Pipeline.pipeline will
change in 0.19. It will no longer reshape one-dimensional input to two-dimensional input.
The deprecated attributes indicator_matrix_, multilabel_ and classes_ of
:class:preprocessing.LabelBinarizer were removed.
Using gamma=0 in :class:svm.SVC and :class:svm.SVR to automatically set the
gamma to 1. / n_features is deprecated and will be removed in 0.19.
Use gamma="auto" instead.
Aaron Schumacher, Adithya Ganesh, akitty, Alexandre Gramfort, Alexey Grigorev, Ali Baharev, Allen Riddell, Ando Saabas, Andreas Mueller, Andrew Lamb, Anish Shah, Ankur Ankan, Anthony Erlinger, Ari Rouvinen, Arnaud Joly, Arnaud Rachez, Arthur Mensch, banilo, Barmaley.exe, benjaminirving, Boyuan Deng, Brett Naul, Brian McFee, Buddha Prakash, Chi Zhang, Chih-Wei Chang, Christof Angermueller, Christoph Gohlke, Christophe Bourguignat, Christopher Erick Moody, Chyi-Kwei Yau, Cindy Sridharan, CJ Carey, Clyde-fare, Cory Lorenz, Dan Blanchard, Daniel Galvez, Daniel Kronovet, Danny Sullivan, Data1010, David, David D Lowe, David Dotson, djipey, Dmitry Spikhalskiy, Donne Martin, Dougal J. Sutherland, Dougal Sutherland, edson duarte, Eduardo Caro, Eric Larson, Eric Martin, Erich Schubert, Fernando Carrillo, Frank C. Eckert, Frank Zalkow, Gael Varoquaux, Ganiev Ibraim, Gilles Louppe, Giorgio Patrini, giorgiop, Graham Clenaghan, Gryllos Prokopis, gwulfs, Henry Lin, Hsuan-Tien Lin, Immanuel Bayer, Ishank Gulati, Jack Martin, Jacob Schreiber, Jaidev Deshpande, Jake Vanderplas, Jan Hendrik Metzen, Jean Kossaifi, Jeffrey04, Jeremy, jfraj, Jiali Mei, Joe Jevnik, Joel Nothman, John Kirkham, John Wittenauer, Joseph, Joshua Loyal, Jungkook Park, KamalakerDadi, Kashif Rasul, Keith Goodman, Kian Ho, Konstantin Shmelkov, Kyler Brown, Lars Buitinck, Lilian Besson, Loic Esteve, Louis Tiao, maheshakya, Maheshakya Wijewardena, Manoj Kumar, MarkTab marktab.net, Martin Ku, Martin Spacek, MartinBpr, martinosorb, MaryanMorel, Masafumi Oyamada, Mathieu Blondel, Matt Krump, Matti Lyra, Maxim Kolganov, mbillinger, mhg, Michael Heilman, Michael Patterson, Miroslav Batchkarov, Nelle Varoquaux, Nicolas, Nikolay Mayorov, Olivier Grisel, Omer Katz, Óscar Nájera, Pauli Virtanen, Peter Fischer, Peter Prettenhofer, Phil Roth, pianomania, Preston Parry, Raghav RV, Rob Zinkov, Robert Layton, Rohan Ramanath, Saket Choudhary, Sam Zhang, santi, saurabh.bansod, scls19fr, Sebastian Raschka, Sebastian Saeger, Shivan Sornarajah, SimonPL, sinhrks, Skipper Seabold, Sonny Hu, sseg, Stephen Hoover, Steven De Gryze, Steven Seguin, Theodore Vasiloudis, Thomas Unterthiner, Tiago Freitas Pereira, Tian Wang, Tim Head, Timothy Hopper, tokoroten, Tom Dupré la Tour, Trevor Stephens, Valentin Stolbunov, Vighnesh Birodkar, Vinayak Mehta, Vincent, Vincent Michel, vstolbunov, wangz10, Wei Xue, Yucheng Low, Yury Zhauniarovich, Zac Stewart, zhai_pro, Zichen Wang