doc/whats_new/v0.14.rst
.. include:: _contributors.rst
.. currentmodule:: sklearn
.. _changes_0_14:
August 7, 2013
Missing values with sparse and dense matrices can be imputed with the
transformer preprocessing.Imputer by Nicolas Trésegnie_.
The core implementation of decision trees has been rewritten from
scratch, allowing for faster tree induction and lower memory
consumption in all tree-based estimators. By Gilles Louppe_.
Added :class:ensemble.AdaBoostClassifier and
:class:ensemble.AdaBoostRegressor, by Noel Dawe_ and
Gilles Louppe_. See the :ref:AdaBoost <adaboost> section of the user
guide for details and examples.
Added grid_search.RandomizedSearchCV and
grid_search.ParameterSampler for randomized hyperparameter
optimization. By Andreas Müller_.
Added :ref:biclustering <biclustering> algorithms
(sklearn.cluster.bicluster.SpectralCoclustering and
sklearn.cluster.bicluster.SpectralBiclustering), data
generation methods (:func:sklearn.datasets.make_biclusters and
:func:sklearn.datasets.make_checkerboard), and scoring metrics
(:func:sklearn.metrics.consensus_score). By Kemal Eren_.
Added :ref:Restricted Boltzmann Machines<rbm>
(:class:neural_network.BernoulliRBM). By Yann Dauphin_.
Python 3 support by :user:Justin Vincent <justinvf>, Lars Buitinck,
:user:Subhodeep Moitra <smoitra87> and Olivier Grisel. All tests now pass under
Python 3.3.
Ability to pass one penalty (alpha value) per target in
:class:linear_model.Ridge, by @eickenberg and Mathieu Blondel_.
Fixed sklearn.linear_model.stochastic_gradient.py L2 regularization
issue (minor practical significance).
By :user:Norbert Crombach <norbert> and Mathieu Blondel_ .
Added an interactive version of Andreas Müller's
Machine Learning Cheat Sheet (for scikit-learn) <https://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html>
to the documentation. See :ref:Choosing the right estimator <ml_map>.
By Jaques Grobler_.
grid_search.GridSearchCV and
cross_validation.cross_val_score now support the use of advanced
scoring functions such as area under the ROC curve and f-beta scores.
See :ref:scoring_parameter for details. By Andreas Müller_
and Lars Buitinck_.
Passing a function from :mod:sklearn.metrics as score_func is
deprecated.
Multi-label classification output is now supported by
:func:metrics.accuracy_score, :func:metrics.zero_one_loss,
:func:metrics.f1_score, :func:metrics.fbeta_score,
:func:metrics.classification_report,
:func:metrics.precision_score and :func:metrics.recall_score
by Arnaud Joly_.
Two new metrics :func:metrics.hamming_loss and
metrics.jaccard_similarity_score
are added with multi-label support by Arnaud Joly_.
Speed and memory usage improvements in
:class:feature_extraction.text.CountVectorizer and
:class:feature_extraction.text.TfidfVectorizer,
by Jochen Wersdörfer and Roman Sinayev.
The min_df parameter in
:class:feature_extraction.text.CountVectorizer and
:class:feature_extraction.text.TfidfVectorizer, which used to be 2,
has been reset to 1 to avoid unpleasant surprises (empty vocabularies)
for novice users who try it out on tiny document collections.
A value of at least 2 is still recommended for practical use.
:class:svm.LinearSVC, :class:linear_model.SGDClassifier and
:class:linear_model.SGDRegressor now have a sparsify method that
converts their coef_ into a sparse matrix, meaning stored models
trained using these estimators can be made much more compact.
:class:linear_model.SGDClassifier now produces multiclass probability
estimates when trained under log loss or modified Huber loss.
Hyperlinks to documentation in example code on the website by
:user:Martin Luessi <mluessi>.
Fixed bug in :class:preprocessing.MinMaxScaler causing incorrect scaling
of the features for non-default feature_range settings. By Andreas Müller_.
max_features in :class:tree.DecisionTreeClassifier,
:class:tree.DecisionTreeRegressor and all derived ensemble estimators
now support percentage values. By Gilles Louppe_.
Performance improvements in :class:isotonic.IsotonicRegression by
Nelle Varoquaux_.
:func:metrics.accuracy_score has an option normalize to return
the fraction or the number of correctly classified samples
by Arnaud Joly_.
Added :func:metrics.log_loss that computes log loss, aka cross-entropy
loss. By Jochen Wersdörfer and Lars Buitinck_.
A bug that caused :class:ensemble.AdaBoostClassifier's to output
incorrect probabilities has been fixed.
Feature selectors now share a mixin providing consistent transform,
inverse_transform and get_support methods. By Joel Nothman_.
A fitted grid_search.GridSearchCV or
grid_search.RandomizedSearchCV can now generally be pickled.
By Joel Nothman_.
Refactored and vectorized implementation of :func:metrics.roc_curve
and :func:metrics.precision_recall_curve. By Joel Nothman_.
The new estimator :class:sklearn.decomposition.TruncatedSVD
performs dimensionality reduction using SVD on sparse matrices,
and can be used for latent semantic analysis (LSA).
By Lars Buitinck_.
Added self-contained example of out-of-core learning on text data
:ref:sphx_glr_auto_examples_applications_plot_out_of_core_classification.py.
By :user:Eustache Diemert <oddskool>.
The default number of components for
sklearn.decomposition.RandomizedPCA is now correctly documented
to be n_features. This was the default behavior, so programs using it
will continue to work as they did.
:class:sklearn.cluster.KMeans now fits several orders of magnitude
faster on sparse data (the speedup depends on the sparsity). By
Lars Buitinck_.
Reduce memory footprint of FastICA by Denis Engemann_ and
Alexandre Gramfort_.
Verbose output in sklearn.ensemble.gradient_boosting now uses
a column format and prints progress in decreasing frequency.
It also shows the remaining time. By Peter Prettenhofer_.
sklearn.ensemble.gradient_boosting provides out-of-bag improvement
oob_improvement_
rather than the OOB score for model selection. An example that shows
how to use OOB estimates to select the number of trees was added.
By Peter Prettenhofer_.
Most metrics now support string labels for multiclass classification
by Arnaud Joly_ and Lars Buitinck_.
New OrthogonalMatchingPursuitCV class by Alexandre Gramfort_
and Vlad Niculae_.
Fixed a bug in sklearn.covariance.GraphLassoCV: the
'alphas' parameter now works as expected when given a list of
values. By Philippe Gervais.
Fixed an important bug in sklearn.covariance.GraphLassoCV
that prevented all folds provided by a CV object to be used (only
the first 3 were used). When providing a CV object, execution
time may thus increase significantly compared to the previous
version (bug results are correct now). By Philippe Gervais.
cross_validation.cross_val_score and the grid_search
module is now tested with multi-output data by Arnaud Joly_.
:func:datasets.make_multilabel_classification can now return
the output in label indicator multilabel format by Arnaud Joly_.
K-nearest neighbors, :class:neighbors.KNeighborsRegressor
and :class:neighbors.RadiusNeighborsRegressor,
and radius neighbors, :class:neighbors.RadiusNeighborsRegressor and
:class:neighbors.RadiusNeighborsClassifier support multioutput data
by Arnaud Joly_.
Random state in LibSVM-based estimators (:class:svm.SVC, :class:svm.NuSVC,
:class:svm.OneClassSVM, :class:svm.SVR, :class:svm.NuSVR) can now be
controlled. This is useful to ensure consistency in the probability
estimates for the classifiers trained with probability=True. By
Vlad Niculae_.
Out-of-core learning support for discrete naive Bayes classifiers
:class:sklearn.naive_bayes.MultinomialNB and
:class:sklearn.naive_bayes.BernoulliNB by adding the partial_fit
method by Olivier Grisel_.
New website design and navigation by Gilles Louppe, Nelle Varoquaux,
Vincent Michel and Andreas Müller_.
Improved documentation on :ref:multi-class, multi-label and multi-output classification <multiclass> by Yannick Schwartz_ and Arnaud Joly_.
Better input and error handling in the :mod:sklearn.metrics module by
Arnaud Joly_ and Joel Nothman_.
Speed optimization of the hmm module by :user:Mikhail Korobov <kmike>
Significant speed improvements for :class:sklearn.cluster.DBSCAN
by cleverless <https://github.com/cleverless>_
The auc_score was renamed :func:metrics.roc_auc_score.
Testing scikit-learn with sklearn.test() is deprecated. Use
nosetests sklearn from the command line.
Feature importances in :class:tree.DecisionTreeClassifier,
:class:tree.DecisionTreeRegressor and all derived ensemble estimators
are now computed on the fly when accessing the feature_importances_
attribute. Setting compute_importances=True is no longer required.
By Gilles Louppe_.
:class:linear_model.lasso_path and
:class:linear_model.enet_path can return its results in the same
format as that of :class:linear_model.lars_path. This is done by
setting the return_models parameter to False. By
Jaques Grobler_ and Alexandre Gramfort_
grid_search.IterGrid was renamed to grid_search.ParameterGrid.
Fixed bug in KFold causing imperfect class balance in some
cases. By Alexandre Gramfort_ and Tadej Janež.
:class:sklearn.neighbors.BallTree has been refactored, and a
:class:sklearn.neighbors.KDTree has been
added which shares the same interface. The Ball Tree now works with
a wide variety of distance metrics. Both classes have many new
methods, including single-tree and dual-tree queries, breadth-first
and depth-first searching, and more advanced queries such as
kernel density estimation and 2-point correlation functions.
By Jake Vanderplas_
Support for scipy.spatial.cKDTree within neighbors queries has been
removed, and the functionality replaced with the new
:class:sklearn.neighbors.KDTree class.
:class:sklearn.neighbors.KernelDensity has been added, which performs
efficient kernel density estimation with a variety of kernels.
:class:sklearn.decomposition.KernelPCA now always returns output with
n_components components, unless the new parameter remove_zero_eig
is set to True. This new behavior is consistent with the way
kernel PCA was always documented; previously, the removal of components
with zero eigenvalues was tacitly performed on all data.
gcv_mode="auto" no longer tries to perform SVD on a densified
sparse matrix in :class:sklearn.linear_model.RidgeCV.
Sparse matrix support in sklearn.decomposition.RandomizedPCA
is now deprecated in favor of the new TruncatedSVD.
cross_validation.KFold and
cross_validation.StratifiedKFold now enforce n_folds >= 2
otherwise a ValueError is raised. By Olivier Grisel_.
:func:datasets.load_files's charset and charset_errors
parameters were renamed encoding and decode_errors.
Attribute oob_score_ in :class:sklearn.ensemble.GradientBoostingRegressor
and :class:sklearn.ensemble.GradientBoostingClassifier
is deprecated and has been replaced by oob_improvement_ .
Attributes in OrthogonalMatchingPursuit have been deprecated (copy_X, Gram, ...) and precompute_gram renamed precompute for consistency. See #2224.
:class:sklearn.preprocessing.StandardScaler now converts integer input
to float, and raises a warning. Previously it rounded for dense integer
input.
:class:sklearn.multiclass.OneVsRestClassifier now has a
decision_function method. This will return the distance of each
sample from the decision boundary for each class, as long as the
underlying estimators implement the decision_function method.
By Kyle Kastner_.
Better input validation, warning on unexpected shapes for y.
List of contributors for release 0.14 by number of commits.