Back to Scikit Learn

Version 0.14

doc/whats_new/v0.14.rst

1.8.014.1 KB
Original Source

.. include:: _contributors.rst

.. currentmodule:: sklearn

============ Version 0.14

.. _changes_0_14:

Version 0.14

August 7, 2013

Changelog

  • Missing values with sparse and dense matrices can be imputed with the transformer preprocessing.Imputer by Nicolas Trésegnie_.

  • The core implementation of decision trees has been rewritten from scratch, allowing for faster tree induction and lower memory consumption in all tree-based estimators. By Gilles Louppe_.

  • Added :class:ensemble.AdaBoostClassifier and :class:ensemble.AdaBoostRegressor, by Noel Dawe_ and Gilles Louppe_. See the :ref:AdaBoost <adaboost> section of the user guide for details and examples.

  • Added grid_search.RandomizedSearchCV and grid_search.ParameterSampler for randomized hyperparameter optimization. By Andreas Müller_.

  • Added :ref:biclustering <biclustering> algorithms (sklearn.cluster.bicluster.SpectralCoclustering and sklearn.cluster.bicluster.SpectralBiclustering), data generation methods (:func:sklearn.datasets.make_biclusters and :func:sklearn.datasets.make_checkerboard), and scoring metrics (:func:sklearn.metrics.consensus_score). By Kemal Eren_.

  • Added :ref:Restricted Boltzmann Machines<rbm> (:class:neural_network.BernoulliRBM). By Yann Dauphin_.

  • Python 3 support by :user:Justin Vincent <justinvf>, Lars Buitinck, :user:Subhodeep Moitra <smoitra87> and Olivier Grisel. All tests now pass under Python 3.3.

  • Ability to pass one penalty (alpha value) per target in :class:linear_model.Ridge, by @eickenberg and Mathieu Blondel_.

  • Fixed sklearn.linear_model.stochastic_gradient.py L2 regularization issue (minor practical significance). By :user:Norbert Crombach <norbert> and Mathieu Blondel_ .

  • Added an interactive version of Andreas Müller's Machine Learning Cheat Sheet (for scikit-learn) <https://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html> to the documentation. See :ref:Choosing the right estimator <ml_map>. By Jaques Grobler_.

  • grid_search.GridSearchCV and cross_validation.cross_val_score now support the use of advanced scoring functions such as area under the ROC curve and f-beta scores. See :ref:scoring_parameter for details. By Andreas Müller_ and Lars Buitinck_. Passing a function from :mod:sklearn.metrics as score_func is deprecated.

  • Multi-label classification output is now supported by :func:metrics.accuracy_score, :func:metrics.zero_one_loss, :func:metrics.f1_score, :func:metrics.fbeta_score, :func:metrics.classification_report, :func:metrics.precision_score and :func:metrics.recall_score by Arnaud Joly_.

  • Two new metrics :func:metrics.hamming_loss and metrics.jaccard_similarity_score are added with multi-label support by Arnaud Joly_.

  • Speed and memory usage improvements in :class:feature_extraction.text.CountVectorizer and :class:feature_extraction.text.TfidfVectorizer, by Jochen Wersdörfer and Roman Sinayev.

  • The min_df parameter in :class:feature_extraction.text.CountVectorizer and :class:feature_extraction.text.TfidfVectorizer, which used to be 2, has been reset to 1 to avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A value of at least 2 is still recommended for practical use.

  • :class:svm.LinearSVC, :class:linear_model.SGDClassifier and :class:linear_model.SGDRegressor now have a sparsify method that converts their coef_ into a sparse matrix, meaning stored models trained using these estimators can be made much more compact.

  • :class:linear_model.SGDClassifier now produces multiclass probability estimates when trained under log loss or modified Huber loss.

  • Hyperlinks to documentation in example code on the website by :user:Martin Luessi <mluessi>.

  • Fixed bug in :class:preprocessing.MinMaxScaler causing incorrect scaling of the features for non-default feature_range settings. By Andreas Müller_.

  • max_features in :class:tree.DecisionTreeClassifier, :class:tree.DecisionTreeRegressor and all derived ensemble estimators now support percentage values. By Gilles Louppe_.

  • Performance improvements in :class:isotonic.IsotonicRegression by Nelle Varoquaux_.

  • :func:metrics.accuracy_score has an option normalize to return the fraction or the number of correctly classified samples by Arnaud Joly_.

  • Added :func:metrics.log_loss that computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars Buitinck_.

  • A bug that caused :class:ensemble.AdaBoostClassifier's to output incorrect probabilities has been fixed.

  • Feature selectors now share a mixin providing consistent transform, inverse_transform and get_support methods. By Joel Nothman_.

  • A fitted grid_search.GridSearchCV or grid_search.RandomizedSearchCV can now generally be pickled. By Joel Nothman_.

  • Refactored and vectorized implementation of :func:metrics.roc_curve and :func:metrics.precision_recall_curve. By Joel Nothman_.

  • The new estimator :class:sklearn.decomposition.TruncatedSVD performs dimensionality reduction using SVD on sparse matrices, and can be used for latent semantic analysis (LSA). By Lars Buitinck_.

  • Added self-contained example of out-of-core learning on text data :ref:sphx_glr_auto_examples_applications_plot_out_of_core_classification.py. By :user:Eustache Diemert <oddskool>.

  • The default number of components for sklearn.decomposition.RandomizedPCA is now correctly documented to be n_features. This was the default behavior, so programs using it will continue to work as they did.

  • :class:sklearn.cluster.KMeans now fits several orders of magnitude faster on sparse data (the speedup depends on the sparsity). By Lars Buitinck_.

  • Reduce memory footprint of FastICA by Denis Engemann_ and Alexandre Gramfort_.

  • Verbose output in sklearn.ensemble.gradient_boosting now uses a column format and prints progress in decreasing frequency. It also shows the remaining time. By Peter Prettenhofer_.

  • sklearn.ensemble.gradient_boosting provides out-of-bag improvement oob_improvement_ rather than the OOB score for model selection. An example that shows how to use OOB estimates to select the number of trees was added. By Peter Prettenhofer_.

  • Most metrics now support string labels for multiclass classification by Arnaud Joly_ and Lars Buitinck_.

  • New OrthogonalMatchingPursuitCV class by Alexandre Gramfort_ and Vlad Niculae_.

  • Fixed a bug in sklearn.covariance.GraphLassoCV: the 'alphas' parameter now works as expected when given a list of values. By Philippe Gervais.

  • Fixed an important bug in sklearn.covariance.GraphLassoCV that prevented all folds provided by a CV object to be used (only the first 3 were used). When providing a CV object, execution time may thus increase significantly compared to the previous version (bug results are correct now). By Philippe Gervais.

  • cross_validation.cross_val_score and the grid_search module is now tested with multi-output data by Arnaud Joly_.

  • :func:datasets.make_multilabel_classification can now return the output in label indicator multilabel format by Arnaud Joly_.

  • K-nearest neighbors, :class:neighbors.KNeighborsRegressor and :class:neighbors.RadiusNeighborsRegressor, and radius neighbors, :class:neighbors.RadiusNeighborsRegressor and :class:neighbors.RadiusNeighborsClassifier support multioutput data by Arnaud Joly_.

  • Random state in LibSVM-based estimators (:class:svm.SVC, :class:svm.NuSVC, :class:svm.OneClassSVM, :class:svm.SVR, :class:svm.NuSVR) can now be controlled. This is useful to ensure consistency in the probability estimates for the classifiers trained with probability=True. By Vlad Niculae_.

  • Out-of-core learning support for discrete naive Bayes classifiers :class:sklearn.naive_bayes.MultinomialNB and :class:sklearn.naive_bayes.BernoulliNB by adding the partial_fit method by Olivier Grisel_.

  • New website design and navigation by Gilles Louppe, Nelle Varoquaux, Vincent Michel and Andreas Müller_.

  • Improved documentation on :ref:multi-class, multi-label and multi-output classification <multiclass> by Yannick Schwartz_ and Arnaud Joly_.

  • Better input and error handling in the :mod:sklearn.metrics module by Arnaud Joly_ and Joel Nothman_.

  • Speed optimization of the hmm module by :user:Mikhail Korobov <kmike>

  • Significant speed improvements for :class:sklearn.cluster.DBSCAN by cleverless <https://github.com/cleverless>_

API changes summary

  • The auc_score was renamed :func:metrics.roc_auc_score.

  • Testing scikit-learn with sklearn.test() is deprecated. Use nosetests sklearn from the command line.

  • Feature importances in :class:tree.DecisionTreeClassifier, :class:tree.DecisionTreeRegressor and all derived ensemble estimators are now computed on the fly when accessing the feature_importances_ attribute. Setting compute_importances=True is no longer required. By Gilles Louppe_.

  • :class:linear_model.lasso_path and :class:linear_model.enet_path can return its results in the same format as that of :class:linear_model.lars_path. This is done by setting the return_models parameter to False. By Jaques Grobler_ and Alexandre Gramfort_

  • grid_search.IterGrid was renamed to grid_search.ParameterGrid.

  • Fixed bug in KFold causing imperfect class balance in some cases. By Alexandre Gramfort_ and Tadej Janež.

  • :class:sklearn.neighbors.BallTree has been refactored, and a :class:sklearn.neighbors.KDTree has been added which shares the same interface. The Ball Tree now works with a wide variety of distance metrics. Both classes have many new methods, including single-tree and dual-tree queries, breadth-first and depth-first searching, and more advanced queries such as kernel density estimation and 2-point correlation functions. By Jake Vanderplas_

  • Support for scipy.spatial.cKDTree within neighbors queries has been removed, and the functionality replaced with the new :class:sklearn.neighbors.KDTree class.

  • :class:sklearn.neighbors.KernelDensity has been added, which performs efficient kernel density estimation with a variety of kernels.

  • :class:sklearn.decomposition.KernelPCA now always returns output with n_components components, unless the new parameter remove_zero_eig is set to True. This new behavior is consistent with the way kernel PCA was always documented; previously, the removal of components with zero eigenvalues was tacitly performed on all data.

  • gcv_mode="auto" no longer tries to perform SVD on a densified sparse matrix in :class:sklearn.linear_model.RidgeCV.

  • Sparse matrix support in sklearn.decomposition.RandomizedPCA is now deprecated in favor of the new TruncatedSVD.

  • cross_validation.KFold and cross_validation.StratifiedKFold now enforce n_folds >= 2 otherwise a ValueError is raised. By Olivier Grisel_.

  • :func:datasets.load_files's charset and charset_errors parameters were renamed encoding and decode_errors.

  • Attribute oob_score_ in :class:sklearn.ensemble.GradientBoostingRegressor and :class:sklearn.ensemble.GradientBoostingClassifier is deprecated and has been replaced by oob_improvement_ .

  • Attributes in OrthogonalMatchingPursuit have been deprecated (copy_X, Gram, ...) and precompute_gram renamed precompute for consistency. See #2224.

  • :class:sklearn.preprocessing.StandardScaler now converts integer input to float, and raises a warning. Previously it rounded for dense integer input.

  • :class:sklearn.multiclass.OneVsRestClassifier now has a decision_function method. This will return the distance of each sample from the decision boundary for each class, as long as the underlying estimators implement the decision_function method. By Kyle Kastner_.

  • Better input validation, warning on unexpected shapes for y.

People

List of contributors for release 0.14 by number of commits.

  • 277 Gilles Louppe
  • 245 Lars Buitinck
  • 187 Andreas Mueller
  • 124 Arnaud Joly
  • 112 Jaques Grobler
  • 109 Gael Varoquaux
  • 107 Olivier Grisel
  • 102 Noel Dawe
  • 99 Kemal Eren
  • 79 Joel Nothman
  • 75 Jake VanderPlas
  • 73 Nelle Varoquaux
  • 71 Vlad Niculae
  • 65 Peter Prettenhofer
  • 64 Alexandre Gramfort
  • 54 Mathieu Blondel
  • 38 Nicolas Trésegnie
  • 35 eustache
  • 27 Denis Engemann
  • 25 Yann N. Dauphin
  • 19 Justin Vincent
  • 17 Robert Layton
  • 15 Doug Coleman
  • 14 Michael Eickenberg
  • 13 Robert Marchman
  • 11 Fabian Pedregosa
  • 11 Philippe Gervais
  • 10 Jim Holmström
  • 10 Tadej Janež
  • 10 syhw
  • 9 Mikhail Korobov
  • 9 Steven De Gryze
  • 8 sergeyf
  • 7 Ben Root
  • 7 Hrishikesh Huilgolkar
  • 6 Kyle Kastner
  • 6 Martin Luessi
  • 6 Rob Speer
  • 5 Federico Vaggi
  • 5 Raul Garreta
  • 5 Rob Zinkov
  • 4 Ken Geis
  • 3 A. Flaxman
  • 3 Denton Cockburn
  • 3 Dougal Sutherland
  • 3 Ian Ozsvald
  • 3 Johannes Schönberger
  • 3 Robert McGibbon
  • 3 Roman Sinayev
  • 3 Szabo Roland
  • 2 Diego Molla
  • 2 Imran Haque
  • 2 Jochen Wersdörfer
  • 2 Sergey Karayev
  • 2 Yannick Schwartz
  • 2 jamestwebber
  • 1 Abhijeet Kolhe
  • 1 Alexander Fabisch
  • 1 Bastiaan van den Berg
  • 1 Benjamin Peterson
  • 1 Daniel Velkov
  • 1 Fazlul Shahriar
  • 1 Felix Brockherde
  • 1 Félix-Antoine Fortin
  • 1 Harikrishnan S
  • 1 Jack Hale
  • 1 JakeMick
  • 1 James McDermott
  • 1 John Benediktsson
  • 1 John Zwinck
  • 1 Joshua Vredevoogd
  • 1 Justin Pati
  • 1 Kevin Hughes
  • 1 Kyle Kelley
  • 1 Matthias Ekman
  • 1 Miroslav Shubernetskiy
  • 1 Naoki Orii
  • 1 Norbert Crombach
  • 1 Rafael Cunha de Almeida
  • 1 Rolando Espinoza La fuente
  • 1 Seamus Abshere
  • 1 Sergey Feldman
  • 1 Sergio Medina
  • 1 Stefano Lattarini
  • 1 Steve Koch
  • 1 Sturla Molden
  • 1 Thomas Jarosch
  • 1 Yaroslav Halchenko