doc/whats_new/v0.15.rst
.. include:: _contributors.rst
.. currentmodule:: sklearn
.. _changes_0_15_2:
September 4, 2014
Fixed handling of the p parameter of the Minkowski distance that was
previously ignored in nearest neighbors models. By :user:Nikolay Mayorov <nmayorov>.
Fixed duplicated alphas in :class:linear_model.LassoLars with early
stopping on 32 bit Python. By Olivier Grisel_ and Fabian Pedregosa_.
Fixed the build under Windows when scikit-learn is built with MSVC while
NumPy is built with MinGW. By Olivier Grisel_ and :user:Federico Vaggi <FedericoV>.
Fixed an array index overflow bug in the coordinate descent solver. By
Gael Varoquaux_.
Better handling of numpy 1.9 deprecation warnings. By Gael Varoquaux_.
Removed unnecessary data copy in :class:cluster.KMeans.
By Gael Varoquaux_.
Explicitly close open files to avoid ResourceWarnings under Python 3.
By Calvin Giles.
The transform of :class:discriminant_analysis.LinearDiscriminantAnalysis
now projects the input on the most discriminant directions. By Martin Billinger.
Fixed potential overflow in _tree.safe_realloc by Lars Buitinck_.
Performance optimization in :class:isotonic.IsotonicRegression.
By Robert Bradshaw.
nose is no longer a runtime dependency to import sklearn, only for
running the tests. By Joel Nothman_.
Many documentation and website fixes by Joel Nothman, Lars Buitinck
:user:Matt Pico <MattpSoftware>, and others.
.. _changes_0_15_1:
August 1, 2014
Made cross_validation.cross_val_score use
cross_validation.KFold instead of
cross_validation.StratifiedKFold on multi-output classification
problems. By :user:Nikolay Mayorov <nmayorov>.
Support unseen labels :class:preprocessing.LabelBinarizer to restore
the default behavior of 0.14.1 for backward compatibility. By
:user:Hamzeh Alsalhi <hamsal>.
Fixed the :class:cluster.KMeans stopping criterion that prevented early
convergence detection. By Edward Raff and Gael Varoquaux_.
Fixed the behavior of :class:multiclass.OneVsOneClassifier.
in case of ties at the per-class vote level by computing the correct
per-class sum of prediction scores. By Andreas Müller_.
Made cross_validation.cross_val_score and
grid_search.GridSearchCV accept Python lists as input data.
This is especially useful for cross-validation and model selection of
text processing pipelines. By Andreas Müller_.
Fixed data input checks of most estimators to accept input data that
implements the NumPy __array__ protocol. This is the case for
for pandas.Series and pandas.DataFrame in recent versions of
pandas. By Gael Varoquaux_.
Fixed a regression for :class:linear_model.SGDClassifier with
class_weight="auto" on data with non-contiguous labels. By
Olivier Grisel_.
.. _changes_0_15:
July 15, 2014
Many speed and memory improvements all across the code
Huge speed and memory improvements to random forests (and extra trees) that also benefit better from parallel computing.
Incremental fit to :class:BernoulliRBM <neural_network.BernoulliRBM>
Added :class:cluster.AgglomerativeClustering for hierarchical
agglomerative clustering with average linkage, complete linkage and
ward strategies.
Added :class:linear_model.RANSACRegressor for robust regression
models.
Added dimensionality reduction with :class:manifold.TSNE which can be
used to visualize high-dimensional data.
New features ............
Added :class:ensemble.BaggingClassifier and
:class:ensemble.BaggingRegressor meta-estimators for ensembling
any kind of base estimator. See the :ref:Bagging <bagging> section of
the user guide for details and examples. By Gilles Louppe_.
New unsupervised feature selection algorithm
:class:feature_selection.VarianceThreshold, by Lars Buitinck_.
Added :class:linear_model.RANSACRegressor meta-estimator for the robust
fitting of regression models. By :user:Johannes Schönberger <ahojnnes>.
Added :class:cluster.AgglomerativeClustering for hierarchical
agglomerative clustering with average linkage, complete linkage and
ward strategies, by Nelle Varoquaux_ and Gael Varoquaux_.
Shorthand constructors :func:pipeline.make_pipeline and
:func:pipeline.make_union were added by Lars Buitinck_.
Shuffle option for cross_validation.StratifiedKFold.
By :user:Jeffrey Blackburne <jblackburne>.
Incremental learning (partial_fit) for Gaussian Naive Bayes by
Imran Haque.
Added partial_fit to :class:BernoulliRBM <neural_network.BernoulliRBM>
By :user:Danny Sullivan <dsullivan7>.
Added learning_curve utility to
chart performance with respect to training size. See
:ref:sphx_glr_auto_examples_model_selection_plot_learning_curve.py. By Alexander Fabisch.
Add positive option in :class:LassoCV <linear_model.LassoCV> and
:class:ElasticNetCV <linear_model.ElasticNetCV>.
By Brian Wignall and Alexandre Gramfort_.
Added :class:linear_model.MultiTaskElasticNetCV and
:class:linear_model.MultiTaskLassoCV. By Manoj Kumar_.
Added :class:manifold.TSNE. By Alexander Fabisch.
Enhancements ............
Add sparse input support to :class:ensemble.AdaBoostClassifier and
:class:ensemble.AdaBoostRegressor meta-estimators.
By :user:Hamzeh Alsalhi <hamsal>.
Memory improvements of decision trees, by Arnaud Joly_.
Decision trees can now be built in best-first manner by using max_leaf_nodes
as the stopping criteria. Refactored the tree code to use either a
stack or a priority queue for tree building.
By Peter Prettenhofer_ and Gilles Louppe_.
Decision trees can now be fitted on fortran- and c-style arrays, and
non-continuous arrays without the need to make a copy.
If the input array has a different dtype than np.float32, a
fortran-style copy will be made since fortran-style memory layout has speed
advantages. By Peter Prettenhofer_ and Gilles Louppe_.
Speed improvement of regression trees by optimizing the
the computation of the mean square error criterion. This lead
to speed improvement of the tree, forest and gradient boosting tree
modules. By Arnaud Joly_
The img_to_graph and grid_tograph functions in
:mod:sklearn.feature_extraction.image now return np.ndarray
instead of np.matrix when return_as=np.ndarray. See the
Notes section for more information on compatibility.
Changed the internal storage of decision trees to use a struct array.
This fixed some small bugs, while improving code and providing a small
speed gain. By Joel Nothman_.
Reduce memory usage and overhead when fitting and predicting with forests
of randomized trees in parallel with n_jobs != 1 by leveraging new
threading backend of joblib 0.8 and releasing the GIL in the tree fitting
Cython code. By Olivier Grisel_ and Gilles Louppe_.
Speed improvement of the sklearn.ensemble.gradient_boosting module.
By Gilles Louppe_ and Peter Prettenhofer_.
Various enhancements to the sklearn.ensemble.gradient_boosting
module: a warm_start argument to fit additional trees,
a max_leaf_nodes argument to fit GBM style trees,
a monitor fit argument to inspect the estimator during training, and
refactoring of the verbose code. By Peter Prettenhofer_.
Faster sklearn.ensemble.ExtraTrees by caching feature values.
By Arnaud Joly_.
Faster depth-based tree building algorithm such as decision tree,
random forest, extra trees or gradient tree boosting (with depth based
growing strategy) by avoiding trying to split on found constant features
in the sample subset. By Arnaud Joly_.
Add min_weight_fraction_leaf pre-pruning parameter to tree-based
methods: the minimum weighted fraction of the input samples required to be
at a leaf node. By Noel Dawe_.
Added :func:metrics.pairwise_distances_argmin_min, by Philippe Gervais.
Added predict method to :class:cluster.AffinityPropagation and
:class:cluster.MeanShift, by Mathieu Blondel_.
Vector and matrix multiplications have been optimised throughout the
library by Denis Engemann, and Alexandre Gramfort.
In particular, they should take less memory with older NumPy versions
(prior to 1.7.2).
Precision-recall and ROC examples now use train_test_split, and have more
explanation of why these metrics are useful. By Kyle Kastner_
The training algorithm for :class:decomposition.NMF is faster for
sparse matrices and has much lower memory complexity, meaning it will
scale up gracefully to large datasets. By Lars Buitinck_.
Added svd_method option with default value to "randomized" to
:class:decomposition.FactorAnalysis to save memory and
significantly speedup computation by Denis Engemann, and
Alexandre Gramfort.
Changed cross_validation.StratifiedKFold to try and
preserve as much of the original ordering of samples as possible so as
not to hide overfitting on datasets with a non-negligible level of
samples dependency.
By Daniel Nouri_ and Olivier Grisel_.
Add multi-output support to :class:gaussian_process.GaussianProcessRegressor
by John Novak.
Support for precomputed distance matrices in nearest neighbor estimators
by Robert Layton_ and Joel Nothman_.
Norm computations optimized for NumPy 1.6 and later versions by
Lars Buitinck_. In particular, the k-means algorithm no longer
needs a temporary data structure the size of its input.
:class:dummy.DummyClassifier can now be used to predict a constant
output value. By Manoj Kumar_.
:class:dummy.DummyRegressor has now a strategy parameter which allows
to predict the mean, the median of the training set or a constant
output value. By :user:Maheshakya Wijewardena <maheshakya>.
Multi-label classification output in multilabel indicator format
is now supported by :func:metrics.roc_auc_score and
:func:metrics.average_precision_score by Arnaud Joly_.
Significant performance improvements (more than 100x speedup for
large problems) in :class:isotonic.IsotonicRegression by
Andrew Tulloch_.
Speed and memory usage improvements to the SGD algorithm for linear
models: it now uses threads, not separate processes, when n_jobs>1.
By Lars Buitinck_.
Grid search and cross validation allow NaNs in the input arrays so that
preprocessors such as preprocessing.Imputer can be trained within the cross
validation loop, avoiding potentially skewed results.
Ridge regression can now deal with sample weights in feature space
(only sample space until then). By :user:Michael Eickenberg <eickenberg>.
Both solutions are provided by the Cholesky solver.
Several classification and regression metrics now support weighted
samples with the new sample_weight argument:
:func:metrics.accuracy_score,
:func:metrics.zero_one_loss,
:func:metrics.precision_score,
:func:metrics.average_precision_score,
:func:metrics.f1_score,
:func:metrics.fbeta_score,
:func:metrics.recall_score,
:func:metrics.roc_auc_score,
:func:metrics.explained_variance_score,
:func:metrics.mean_squared_error,
:func:metrics.mean_absolute_error,
:func:metrics.r2_score.
By Noel Dawe_.
Speed up of the sample generator
:func:datasets.make_multilabel_classification. By Joel Nothman_.
Documentation improvements ...........................
The Working With Text Data tutorial
has now been worked in to the main documentation's tutorial section.
Includes exercises and skeletons for tutorial presentation.
Original tutorial created by several authors including
Olivier Grisel, Lars Buitinck and many others.
Tutorial integration into the scikit-learn documentation
by Jaques Grobler
Added :ref:Computational Performance <computational_performance>
documentation. Discussion and examples of prediction latency / throughput
and different factors that have influence over speed. Additional tips for
building faster models and choosing a relevant compromise between speed
and predictive power.
By :user:Eustache Diemert <oddskool>.
Bug fixes .........
Fixed bug in :class:decomposition.MiniBatchDictionaryLearning :
partial_fit was not working properly.
Fixed bug in linear_model.stochastic_gradient :
l1_ratio was used as (1.0 - l1_ratio) .
Fixed bug in :class:multiclass.OneVsOneClassifier with string
labels.
Fixed a bug in :class:LassoCV <linear_model.LassoCV> and
:class:ElasticNetCV <linear_model.ElasticNetCV>: they would not
pre-compute the Gram matrix with precompute=True or
precompute="auto" and n_samples > n_features. By Manoj Kumar_.
Fixed incorrect estimation of the degrees of freedom in
:func:feature_selection.f_regression when variates are not centered.
By :user:Virgile Fritsch <VirgileFritsch>.
Fixed a race condition in parallel processing with
pre_dispatch != "all" (for instance, in cross_val_score).
By Olivier Grisel_.
Raise error in :class:cluster.FeatureAgglomeration and
cluster.WardAgglomeration when no samples are given,
rather than returning meaningless clustering.
Fixed bug in gradient_boosting.GradientBoostingRegressor with
loss='huber': gamma might have not been initialized.
Fixed feature importances as computed with a forest of randomized trees
when fit with sample_weight != None and/or with bootstrap=True.
By Gilles Louppe_.
sklearn.hmm is deprecated. Its removal is planned
for the 0.17 release.
Use of covariance.EllipticEnvelop has now been removed after
deprecation.
Please use :class:covariance.EllipticEnvelope instead.
cluster.Ward is deprecated. Use
:class:cluster.AgglomerativeClustering instead.
cluster.WardClustering is deprecated. Use
:class:cluster.AgglomerativeClustering instead.
cross_validation.Bootstrap is deprecated.
cross_validation.KFold or
cross_validation.ShuffleSplit are recommended instead.
Direct support for the sequence of sequences (or list of lists) multilabel
format is deprecated. To convert to and from the supported binary
indicator matrix format, use
:class:preprocessing.MultiLabelBinarizer.
By Joel Nothman_.
Add score method to :class:decomposition.PCA following the model of
probabilistic PCA and deprecate
ProbabilisticPCA model whose
score implementation is not correct. The computation now also exploits the
matrix inversion lemma for faster computation. By Alexandre Gramfort_.
The score method of :class:decomposition.FactorAnalysis
now returns the average log-likelihood of the samples. Use score_samples
to get log-likelihood of each sample. By Alexandre Gramfort_.
Generating boolean masks (the setting indices=False)
from cross-validation generators is deprecated.
Support for masks will be removed in 0.17.
The generators have produced arrays of indices by default since 0.10.
By Joel Nothman_.
1-d arrays containing strings with dtype=object (as used in Pandas)
are now considered valid classification targets. This fixes a regression
from version 0.13 in some classifiers. By Joel Nothman_.
Fix wrong explained_variance_ratio_ attribute in
RandomizedPCA.
By Alexandre Gramfort_.
Fit alphas for each l1_ratio instead of mean_l1_ratio in
:class:linear_model.ElasticNetCV and :class:linear_model.LassoCV.
This changes the shape of alphas_ from (n_alphas,) to
(n_l1_ratio, n_alphas) if the l1_ratio provided is a 1-D array like
object of length greater than one.
By Manoj Kumar_.
Fix :class:linear_model.ElasticNetCV and :class:linear_model.LassoCV
when fitting intercept and input data is sparse. The automatic grid
of alphas was not computed correctly and the scaling with normalize
was wrong. By Manoj Kumar_.
Fix wrong maximal number of features drawn (max_features) at each split
for decision trees, random forests and gradient tree boosting.
Previously, the count for the number of drawn features started only after
one non constant features in the split. This bug fix will affect
computational and generalization performance of those algorithms in the
presence of constant features. To get back previous generalization
performance, you should modify the value of max_features.
By Arnaud Joly_.
Fix wrong maximal number of features drawn (max_features) at each split
for :class:ensemble.ExtraTreesClassifier and
:class:ensemble.ExtraTreesRegressor. Previously, only non constant
features in the split was counted as drawn. Now constant features are
counted as drawn. Furthermore at least one feature must be non constant
in order to make a valid split. This bug fix will affect
computational and generalization performance of extra trees in the
presence of constant features. To get back previous generalization
performance, you should modify the value of max_features.
By Arnaud Joly_.
Fix :func:utils.class_weight.compute_class_weight when class_weight=="auto".
Previously it was broken for input of non-integer dtype and the
weighted array that was returned was wrong. By Manoj Kumar_.
Fix cross_validation.Bootstrap to return ValueError
when n_train + n_test > n. By :user:Ronald Phlypo <rphlypo>.
List of contributors for release 0.15 by number of commits.