RELEASE.md
(uses catboostmodel native libraries from the main CatBoost release v1.2.10)
Probability on CPUs that do not have SSE4 instruction set (that includes all ARM CPUs).
Values with probability 0 have been erroneously computed as nan.__SSE__ compiler flag was not enabled for Windows builds with MSVC compiler. This affected code that relied on this flag including quantization during model inference. It is important to note that the compiler itself was configured for SSE support and could still apply automatic SSE optimizations.CMakeLists.txt files to the standard CMake variable CMAKE_CUDA_ARCHITECTURES, although the default value is non-standard and specified in cuda.cmake. #2540predictTransposed method #2927. Thanks to @levs2001.:warning: There are no JVM artifacts for this release due to issues with publishing.: They will be updated in the next release soon.
[Python-package] Add polars input data support. #2524.
Polars data structures are supported for features, labels and auxiliary data like weight, timestamp etc.
RMSPE metric and loss (both as are CPU-only for now) #1767. Thanks to @ivan339339.LoadFullModelZeroCopy for mmap #2893. Thanks to @gakoshin.Lossguide grow policy on CPU #2883. Approximate speedup is 1.4x. Thanks to @Levachev.numpy numeric types in multithreaded native features data initialization. #1558, #2847pyproject.toml is now PEP-517 compliant.__sklearn_tags__ method to be compatible with scikit-learn >= 1.8.x. #2955__repr__ method with a meaningful description expected by scikit-learn #2307. Thanks to @besteady.dry_run parameters in setuptools 81.0. pypa/setuptools#4872CMakeLists.txt files to the standard CMake variable CMAKE_CUDA_ARCHITECTURES, although the default value is non-standard and specified in cuda.cmake. #2540wheel build dependency no longer required__SSE__ compiler flag was not enabled for Windows builds with MSVC compiler. This affected code that relied on this flag including some operations used during training and quantization during model inference. It is important to note that the compiler itself was configured for SSE support and could still apply automatic SSE optimizations.CatBoostError was missing from __all__ in catboost package. #2862log_cout was used instead of log_cerr by mistake. #2863get_params: deep parameter meaning was inconsistent with scikit-learn expectations. #2991_get_tags: Add missing tags. #3008_get_tags returned incorrect values for several tags. #3009timestamp parameters. #3019MultiRMSEdevices parameter parsing. Parsing was non-robust: in case of non-numbers specified it defaulted to 0 and device ids outside of the available range were silently ignored.GetErrorString in multithreaded programs. It is now thread-local.character and factor types (useful for classes). #1874leaf_estimation_iterations for Tweedie regression on GPU. #2812private by mistake.private by mistake.Probability on CPUs that do not have SSE4 instruction set (that includes all ARM CPUs).
Values with probability 0 have been erroneously computed as nan.(uses catboostmodel native libraries from the main CatBoost release v1.2.7)
numpy dependency specification to prohibit numpy >= 2.0 for now. #2671APT_MULTI_PROBABILITY prediction type is now supported. #2639. Thanks to @aivarasbaranauskas.GroupQuantile metricQueryCrossEntropy (~3x faster on a100 for 6m samples, 350 features, query size near 1).PredictSpecificClassFlat to calcer.exports. #2715numpy.ndarrays with float32 data type multithreaded. Significant speedups of 5x up to 10x (on CPUs with many cores) can be expected. #385, #2542best_score_, evals_result_, best_iteration_ model attributes now work after model saving and loading. Can be removed by model metadata manipulation if needed. #1166Class predictions for models that have been trained with boolean targets will also be boolean instead of True, False strings as before. Such models will be incompatible with the previous versions of CatBoost appliers. If you want the old behavior convert your target to False, True strings before training. #1954jupyterlab version for setup to 3.x for now. Fixes #2530utils.read_cd: Support CD files with non-increasing column indices.log_cout, log_cerr specification consistent, avoid reset in recursive calls.log_cout, log_cerr. #2195Cox, PairLogitPairwise, UserPerObjMetric, SurvivalAft.fit with Pool arguments) and Class prediction in Python. #1954Auxiliary columns by name in evaluation result output. #1659clang-cl from Visual Studio 2022 for the build without CUDA (build with CUDA still uses standard Microsoft toolchain from Visual Studio 2019).os.version to conan host settings to ensure version consistency.-mno-outline-atomics for modern versions of CLang and GCC to avoid unresolved symbols linking errors. #2527CMakeLists for unit tests for util. #2525Pool() when pairs_weight is a numpy array. #1913__call__ method. #2277Targets are required for YetiRank loss function. error in Cross validation. #2083Pool.get_label() returns constant True for boolean labels. #2133best_score_, evals_result_, best_iteration_ attributes values anymore. #1793Precision metric default value in the absense of positive samples is changed to 0 and a warning is added
(similar to the behavior of scikit-learn implementation). #2422Target data is available.Error: can't proceed some features error on GPU. #1024allow_const_label=True for classification. #1933SurvivalAft objective/metric.eval_metric in binary python packages of version 1.2.1 on PyPI. #2486mode parameter. See Which Tricks are Important for Learning to Rank? paper for details (this family of losses is called YetiLoss there). CPU-only for now.catboost.sample_gaussian_process function). #2408, thanks to @TakeOver. See Gradient Boosting Performs Gaussian Process Inference paper for details.int instead of deprecated numpy.int. #2378ModelCalcerWrapper::CalcFlatTransposed, #2413 thanks to @faucctCatBoost's build system has been switched from Ya Make (Yandex's build system) to CMake. This means more transparency in the build process and more familiar tools for Open Source developers. For now it is possible to build CatBoost for:
This allowed us to prepare the Python package in the source distribution form (also known as sdist). #830
msvs subdirectory with the Microsoft Visual Studio solution has been removed. Visual Studio solutions can be generated using CMake instead.make subdirectory with Makefiles has been removed. Use CMake + ninja (recommended) or CMake + make instead.setup.py instead of the custom mk_wheel.py script. All common scenarios (sdist, build, install, editable install, bdist_wheel) are supported.manylinux1 to manylinux2014.fixed_binary_splits to the regressor, classifier, and ranker.String and Vec types for features to AsRef of slices to make code more genericbinary-classification-threshold parameter to the CLI model applier.RMSEWithUncertainty loss function on GPU. #1573MultiLogloss and MultiCrossEntropy loss functions with numerical features on GPU.MultiLogloss loss function with text features on CPU and GPU. #1885Focal loss (CPU-only for now). #1807, thanks to @diditforlulz273.MultiLogloss on CPU by 8% per tree (110K samples, 20 targets, 480 float features, 3 cat features, 16 cores CPU).TFullModel::SetEvaluatorType (it was possible to get a Segmentation fault when calling it for non-available implementstion). Add TFullModel::GetSupportedEvaluatorTypes.allow_write_files=True._get_embedding_feature_indices. #2273set_feature_names with text or embedding features. #2090libs/model_interface applier always produced an error in CUDA mode.catboost/cuda/cuda_util/sort.cpp:166: CUDA error 9 on Nvidia Ampere - based GPUs.utils.eval_metrics for groupwise metrics when group data has not been specified. #2343GetModelUsedFeaturesNames. #2204utils.create_cd. #2193np.ndarray with dtype=object. #2201feature_names in utils.create_cd. #2211Multiquantile regression
Now it's possible to train models with shared tree structure and multiple predicted quantile values in each leaf. Currently this approach doesn't give a strong guarantee for predicted quantile values consistency, but it still provides more consistency than training multiple independent models for each quantile. You can read short description in the documentation. Short example for Python: loss_function='MultiQuantile:alpha=0.2,0.4'. Supported only on CPU for now.
Support text and embedding features for regression and ranking.
Spark: Read/write Spark's Dataset-like API for Pool. #2030
Support HashedCateg column type. This allows to use externally prehashed categorical features both in training and prediction.
New option plot_file in Python functions with plot parameter allows to save plots to file. #758
Add eval_fraction parameter. #1500
Non-symmetric trees model summation.
init_model parameter now works with non-symmetric trees.
Partial support for Apache Spark 3.3 (only for Scala 2.12 and without PySpark).
--fixed-binary-splits or fixed_binary_splits in Python package (by default, there are no fixed splits)fit for PySpark estimators. #1976.MAE, MAPE, Quantile on GPU.BrierScore. #1967.plot_tree example in documentation.cv.sort param to FilteredDCG metric.StochasticRank for FilteredDCG.loss_function.calc_feature_statisticscalc_metrics mode.so in deployed Maven artifacts (no code changes)datetime.timedelta conversion.is_min_optimal, is_max_optimal for BuiltinMetrics. #1890libcatboostr-darwin.dylib instead of libcatboostr-darwin.so on macOS. #1834CatBoostError: (No such file or directory) bad new file name when using grid_search. #1893:warning: PySpark support is broken in this release.: Please use release 1.0.3 instead.
rsm < 1.calc_feature_statistics for cat features. #1882eval_metric for Multitarget trainingIn this release we decided to increment major version as we think that CatBoost is ready for production usage. We know, that CatBoost is used a lot in many different companies and individual projects, and not it's not only a "psychological" maturity - we think, that all the features we added in the last year and in current release is worth to update major version. And of course, as many programmers we love magic of binary numbers and we want to celebrate 100₂ anniversary since CatBoost first release on github :)
use_best_model and early stopping works independently on each fold, as we are trying to make single fold trainig as close to regular training as possible. If one model stops at iteration i we use it's last value in mean score plot for points with [i+1; last iteration).MultiRMSEWithMissingValues loss functionpredict_proba function from X to data, fixes #1785eval_metrics. Thanks to @ebalukova.numba (if available)use_weights for some eval_metrics on GPU - use_weights=False is always respected nowEvalMetricsResult.get_metric() by @RoffildThis release includes CatBoost for Apache Spark package that supports training, model application and feature evaluation on Apache Spark platform. We've prepared CatBoost for Apache Spark introduction and CatBoost for Apache Spark Architecture videos for introduction. More details available at CatBoost for Apache Spark home page.
CatBoost supports recursive feature elimination procedure - when you have lot's of feature candidates and you want to select only most influential features by training models and selecting only strongest by feature importance. You can look for details in our tutorial
leaf_estimation_method=Exact explicitly, in next releases we are planning to set it by default.pathlib.Path in python packagedcg==1 when there is no relevant objects in group (when ideal DCG equals zero), later we used score==0 in that case.boost_from_average for MultiRMSE loss.feature_importances_ for fstr with textsscore() method for RMSEWithUncertainty issue #1482prediction_type in score()MultiRMSE loss function.group_weight parameter added to catboost.utils.eval_metric method to allow passing weights for object groups. Allows correctly match weighted ranking metrics computation when group weights present.Pool constructor or fit function with embedding_features=['EmbeddingFeaturesColumnName1, ...] parameter. Another way of adding your embedding vectors is new type of column in Column Description file NumVector and adding semicolon separated embeddings column to your XSV file: ClassLabel\t0.1;0.2;0.3\t....use_weights for metrics when auto_class_weights parameter is set.plot_predictions function.average parameter is passed to TotalF1 metric while training on GPU.Main feature of this release is total uncertainty prediction support via virtual ensembles.
You can read the theoretical background in the preprint Uncertainty in Gradient Boosting via Ensembles from our research team.
We introduced new training parameter posterior_sampling, that allows to estimate total uncertainty.
Setting posterior_sampling=True implies enabling Langevin boosting, setting model_shrink_rate to 1/(2*N) and setting diffusion_temperature to N, where N is dataset size.
CatBoost object method virtual_ensembles_predict splits model into virtual_ensembles_count submodels.
Calling model.virtual_ensembles_predict(.., prediction_type='TotalUncertainty') returns mean prediction, variance (and knowledge uncertrainty for models, trained with RMSEWithUncertainty loss function).
Calling model.virtual_ensembles_predict(.., prediction_type='VirtEnsembles') returns virtual_ensembles_count predictions of virtual submodels for each object.
n_features_in_ attribute required for using CatBoost in sklearn pipelines. Issue #1363load_model(blob=b'....'), to deserialize form file-like stream use load_model(stream=gzip.open('model.cbm.gz', 'rb'))RMSEWithUncertainty - it allows to estimate data uncertainty for trained regression models. The trained model will give you a two-element vector for each object with the first element as regression model prediction and the second element as an estimation of data uncertainty for that prediction.model.feature_names_. Issue#1314model_sum() or as the base model in init_model=. Issue #1271plot_partial_dependence method in python-package (Now it works for models with symmetric trees trained on dataset with numerical features only). Implemented by @felixandrer.boost_from_average option together with model_shrink_rate option. In this case shrinkage is applied to the starting value..auto_class_weights option in python-package, R-package and cli with possible values Balanced and SqrtBalanced. For Balanced every class is weighted maxSumWeightInClass / sumWeightInClass, where sumWeightInClass is sum of weights of all samples in this class. If no weights are present then sample weight is 1. And maxSumWeightInClass - is maximum sum weight among all classes. For SqrtBalanced the formula is sqrt(maxSumWeightInClass / sumWeightInClass). This option supported in binclass and multiclass tasks. Implemented by @egiby.model_size_reg option on GPU. Set to 0.5 by default (same as in CPU). This regularization works slightly differently on GPU: feature combinations are regularized more aggressively than on CPU. For CPU cost of a combination is equal to number of different feature values in this combinations that are present in training dataset. On GPU cost of a combination is equal to number of all possible different values of this combination. For example, if combination contains two categorical features c1 and c2, then the cost will be #categories in c1 * #categories in c2, even though many of the values from this combination might not be present in the dataset.catboost.utils.convert_to_onnx_object method. Implemented by @monkey0headTotalF1 metric CatBoost will print TotalF1:average=Weighted as corresponding metric column header in error logs. Implemented by @ivanychevclass_weights parameter accepts dictionary with class name to class weight mapping_get_tags() method for compatibility with sklearn (issue #1282). Implemented by @crazylegloss_function param in python cv method.catboost.utils.quantize function to create quantized Pool this way. See usage example in the issue #1116.
Implemented by @noxwell.save_quantization_borders method that allows to save resulting borders into a file and use it for quantization of other datasets. Quantization can be a bottleneck of training, especially on GPU. Doing quantization once for several trainings can significantly reduce running time. It is recommended for large dataset to perform quantization first, save quantization borders, use them to quantize validation dataset, and then use quantized training and validation datasets for further training.
Use saved borders when quantizing other Pools by specifying input_borders parameter of the quantize method.
Implemented by @noxwell.border_count > 255 for GPU training. This might be useful if you have a "golden feature", see docs.feature_weights="FeatureName1:1.5,FeatureName2:0.5".
Scores for splits with this features will be multiplied by corresponding weights.
Implemented by @Taube03.first_feature_use_penalties.
This parameter penalized the first usage of a feature. This should be used in case if the calculation of the feature is costly.
The penalty value (or the cost of using a feature) is subtracted from scores of the splits of this feature if feature has not been used in the model.
After the feature has been used once, it is considered free to proceed using this feature, so no substruction is done.
There is also a common multiplier for all first_feature_use_penalties, it can be specified by penalties_coefficient parameter.
Implemented by @Taube03 (issue #1155)recordCount attribute is added to PMML models (issue #1026).Tweedie loss is supported now. It can be a good solution for right-skewed target with many zero values, see tutorial.
When using CatBoostRegressor.predict function, default prediction_type for this loss will be equal to Exponent. Implemented by @ilya-pchelintsev (issue #577)proba_border. With this parameter you can set decision boundary for treating prediction as negative or positive. Implemented by @ivanychev.TotalF1 supports a new parameter average with possible value weighted, micro, macro. Implemented by @ilya-pchelintsev.eval_metric. It is not possible to used it as an optimization objective.
To write a multi-label metric, you need to define a python class which inherits from MultiLabelCustomMetric class. Implemented by @azikmsu.class_weights parameter is now supported in grid/randomized search. Implemented by @vazgenk.get_best_score returns train/validation best score after grid/randomized search (in case of refit=False). Implemented by @rednevaler.CatBoost.get_feature_importance to get a matrix of SHAP values for every prediction.
By default, SHAP interaction values are calculated for all features. You may specify features of interest using the interaction_indices argument.
Implemented by @IvanKozlov98.shap_calc_type parameter of CatBoost.get_feature_importance function as "Approximate". Implemented by @LordProtoss (issue #1146).PredictionDiff model analysis method can now be used with models that contain non symmetric trees. Implemented by @felixandrer.CatBoostRegressor.predict function for models trained with Poisson loss, default prediction_type will be equal to Exponent (issue #1184). Implemented by @garkavem.This release also contains bug fixes and performance improvements, including a major speedup for sparse data on GPU.
grow_policy parameter.
Starting from this release non symmetric trees are supported for both CPU and GPU training.to_regressor and to_classifier methods.The release also contains a list of bug fixes.
langevin option and tune diffusion_temperature and model_shrink_rate. See the corresponding paper for details.Logloss objective, but also for RMSE (on CPU and GPU) and MultiClass (on GPU).classes_ attribute and for prediction functions with prediction_type=Class. #305, #999, #1017.
Note: Class labels loaded from datasets in CatBoost dsv format always have string type now.boost_from_average=True. #1125catboost.get_feature_importance did not work after model is loaded #1064catboost.train did not work when called with the single dataset parameter. #1162##Other:
##Compatibility:
classes_count and class_weight params can be now used with user-defined loss functions. #1119use_weights gets value by default. #1106model.classes_ attribute for binary classification (proper labels instead of always 0 and 1). #984model.classes_ attribute when classes_count parameter was specified.leaf_estimation_method=Exact the default for MAPE lossCatBoostClassifier.predict_log_proba(), PR #1095get_feature_importance, PR #1090boost_from_average modeNew submodule for text processing! It contains two classes to help you make text features ready for training:
boost_from_average for MAPE loss functionPool creation from pandas.DataFrame with discontinuous columns, #1079standalone_evaluator, PR #1083Text features for classification on GPU. To specify text columns use text_features parameter. Achieve better quality by using text information of your dataset. See more in Learning CatBoost with text featuresMultiRMSE loss function is now available on CPU. Labels for the multi regression mode should be specified in separate Label columnsboost_from_average is now True by default for Quantile and MAE loss functions, which improves the resulting qualitydatasets.msrank() returns full msrank dataset. Previously, it returned the first 10k samples.
We have added msrank_10k() dataset implementing the past behaviour.get_object_importance() now respects parameter top_size, #1045 by @ibudaPlain boosting scheme for both small and large datasets. This change not only gives the huge speedup but also provides quality improvement!boost_from_average parameter is available in CatBoostClassifier and CatBoostRegressor"(1,0,0,-1)" or "0:1,3:-1" or "FeatureName0:1,FeatureName3:-1" are all valid specifications. With Python and params-file json, lists and dictionaries can also be usedMulticlass classifier training, #1040RuntimeError raised in StagedPredictIterator, #848System of linear equations is not positive definite when training MultiClass on Windows, #1022MultiClass with many classessum_models in R-package, #1007plot=True parameter in grid_search and randomized_search methods to show plots in jupyter notebookMultiClass objective don't give constant 0 value for the last class in case of GPU training.
Shap values for MultiClass objective are now calculated in the following way. First, predictions are normalized so that the average of all predictions is zero in each tree. The normalized predictions produce the same probabilities as the non-normalized ones. Then the shap values are calculated for every class separately. Note that since the shap values are calculated on the normalized predictions, their sum for every class is equal to the normalized predictionper_float_feature_quantization parameter, #996leaf-estimation-method is now ExactLossFunctionChange feature strength computationeval_metric in output of get_all_params(), #940skip_train~false is ignored, #970numpy.ndarray with order='F'boost_from_average when baseline is specifiedpandas.DataFrame or numpy.ndarray with order='F').CrossEntropy loss on CPUdatasets.rotten_tomatoes(), a textual datasetmonotone_constraints, #950CrossEntropy metric on CPUs with SSE3boost_from_average in RMSE mode. It gives a boost in quality especially for a small number of iterations.pandas.Categorical.
Hint: use pandas.Categorical instead of object to speed up loading up to 200x.boost_from_average parameter for RMSE training on CPU which might give a boost in quality.model.load_model(model_path, format="onnx") for that.get_features_importance with ShapValues for MultiClass, #868__builtins__ import in Python3 in PR #957, thanks to @AbhinavanTFeature Index to Feature Id in prettified output of python method get_feature_importance(), because it supports feature names nowper_float_feature_binarization (--per-float-feature-binarization) to per_float_feature_quantization (--per-float-feature-quantization)inverted from python cv method. Added type parameter instead, which can be set to Invertedget_features() now works only for datasets without categorical featuresAUC Mu, which was proposed by Ross S. Kleiman on NeurIPS 2019, linkMeanWeightedTarget in fstatutils.get_confusion_matrix()get_group_id() and get_features() methods of Pool classPredictionDiff type of get_feature_importance() method, which is a new method for model analysis. The method shows how the features influenced the fact that among two samples one has a higher prediction. It allows to debug ranking models: you find a pair of samples ranked incorrectly and you look at what features have caused that.plot_predictions() methodmodel.set_feature_names() method in Pythoncatboost.load_model() from CPU snapshots for numerical-only datasetsCatBoostClassifier.score() now supports y as DataFramesampling_frequency, per_float_feature_binarization, monotone_constraints parameters to CatBoostClassifier and CatBoostRegresssorscore() for multiclassification, #924get_all_params() function, #926fold_count is now called cv in grid_search() and randomized_searchgrid_search() and randomized_search() in res['cv_results'] fieldcatboost.save_model() now supports PMML, ONNX and other formatsmonotone_constraints in python API allows specifying numerical features that the prediction shall depend on monotonicallyeval_metric calculation for training with weights (in release 0.16 evaluation of a metric that was equal to an optimized loss did not use weights by default, so overfitting detector worked incorrectly)verbose to grid_search() and randomized_search()grid_search() and randomized_search()MultiClass loss has now the same sign as Logloss. It had the other sign before and was maximized, now it is minimized.CatBoostRegressor.score now returns the value of $R^2$ metric instead of RMSE to be more consistent with the behavior of scikit-learn regressors.use_weights default value to false (except for ranking metrics)catboost.datasets.monotonic1() and catboost.datasets.monotonic2(). Before that there was only california_housing dataset in open-source with monotonic constraints. Now you can use these two to benchmark algorithms with monotonic constraints.DCG, FairLoss, HammingLoss, NormalizedGini and FilteredNDCGGridSearch and RandomSearch implementations.get_all_params() Python function returns the values of all training parameters, both user-defined and default.Logloss or MultiClass loss function deduction for CatBoostClassifier.fit now also works if the training dataset is specified as Pool or filename string.get_feature_statistics is replaced by calc_feature_statisticsCorrelation is renamed to Cosineefb_max_conflict_fraction is renamed to sparse_features_conflict_fractionNote: PMML does not have full categorical features support, so to have the model in PMML format for datasets with categorical features you need to use set
one_hot_max_sizeparameter to some large value, so that all categorical features are one-hot encoded
fstr_type in Python and R interfacesLogloss, MultiClass and MultiClassOneVsAll.border parameter of Logloss metric. You need to use target_border as a separate training parameter now.CatBoostClassifier now runs MultiClass if more than 2 different values are present in training dataset labels.model.best_score_["validation_0"] is replaced with model.best_score_["validation"] if a single validation dataset is present.get_object_importance function parameter ostr_type is renamed to type in Python and R.plot parameter to get_roc_curve, get_fpr_curve and get_fnr_curve functions from catboost.utils.And a set of fixes for your issues.
has_header parameter to CatboostEvaluation class.: to ;) in the CatboostEvaluation class.--counter-calc-method option to SkipTestget_metadata function, for example print catboost_model.get_metadata()['model_guid']GPU training now supports several tree learning strategies, selectable with grow_policy parameter. Possible values:
SymmetricTree -- The tree is built level by level until max_depth is reached. On each iteration, all leaves from the last tree level will be split with the same condition. The resulting tree structure will always be symmetric.Depthwise -- The tree is built level by level until max_depth is reached. On each iteration, all non-terminal leaves from the last tree level will be split. Each leaf is split by condition with the best loss improvement.Lossguide -- The tree is built leaf by leaf until max_leaves limit is reached. On each iteration, non-terminal leaf with best loss improvement will be split.Note: grow policies
DepthwiseandLossguidecurrently support only training and prediction modes. They do not support model analysis (like feature importances and SHAP values) and saving to different model formats like CoreML, ONNX, and JSON.
max_leaves -- Maximum leaf count in the resulting tree, default 31. Used only for Lossguide grow policy. Warning: It is not recommended to set this parameter greater than 64, as this can significantly slow down training.
min_data_in_leaf -- Minimum number of training samples per leaf, default 1. CatBoost will not search for new splits in leaves with sample count less than min_data_in_leaf. This option is available for Lossguide and Depthwise grow policies only.Note: the new types of trees will be at least 10x slower in prediction than default symmetric trees.
GPU training also supports several score functions, that might give your model a boost in quality. Use parameter score_function to experiment with them.
Now you can use quantization with more than 255 borders and one_hot_max_size > 255 in CPU training.
save_borders() function to write borders to a file after training.predict, predict_proba, staged_predict, and staged_predict_proba now support applying a model to a single object, in addition to usual data matrices.None if not initialized.LossFunctionChange.
This type of feature importances works well in all the modes, but is especially good for ranking. It is more expensive to calculate, thus we have not made it default. But you can look at it by selecting the type of feature importance.QuerySoftMax mode on GPU.cat_features, PR #679 by @infected-mushroom - thanks a lot @infected-mushroom!MVS, which speeds up CPU training if you use it.classes_ attribute in python.ctr_target_border_count.
This option can be used if your initial target values are not binary and you do regression or ranking. It is equal to 1 by default, but you can try increasing it.sampling_unit that allows to switch sampling from individual objects to entire groups.skip_train property for loss functions in cv method. Contributed by GitHub user @RakitinDen, PR #662, many thanks.leaf_estimation_backtracking parameter.__eq__ method for CatBoost* python classes (PR #654). Thanks @daskol for your contribution!stdout or stderr in command-line CatBoost in calc mode by specifying stream://stdout or stream://stderr in --output-path parameter argument. (PR #646). Thanks @towelenee for your contribution!one_hot_max_size training parameter for groupwise loss function training.SampleId is the new main name for former DocId column in input data format (DocId is still supported for compatibility). Contributed by GitHub user @daskol, PR #655, many thanks.-X/-Y options with --cv, PR #644. Thanks @tswr for your pr!eval_metrics : eval_period is now clipped by total number of trees in the specified interval. PR #653. Thanks @AntPon for your contribution!We have also done a list of fixes and data check improvements. Thanks @brazhenko, @Danyago98, @infected-mushroom for your contributions.
epsilon dataset into memorysampling_type parameter for YetiRankPairwise losscatboost.datasets() -- dataset epsilon, a large dense dataset for binary classification.cv on GPU.Pool from pandas.DataFrame with pandas.Categorical columns.eval_metrics(),
get_feature_importance(), and get_object_importance().
In previous versions the weights were ignored.random-strength for pairwise training (PairLogitPairwise,
QueryCrossEntropy, YetiRankPairwise) is not supported anymore.MultiClass and MultiClassOneVsAll metrics is now
deprecated.cv method is now supported on GPU.class_names
parameter and specify which class is negative (0) and which is positive (1).
You can also use class_names in multiclassification mode to pass all
possible class names to the fit function.--output-borders-file.
To use the borders for training use cli option --input-borders-file.
This functionanlity is now supported on CPU and GPU (it was GPU-only in previous versions).
File format for the borders is described here.--eval-file is now supported on GPU.cv function (times fold count)We also made a list of stability improvements and stricter checks of input data and parameters.
And we are so grateful to our community members @canorbal and @neer201 for their contribution in this release. Thank you.
model_sum mode to command line interfacerandom_strength for Plain boosting (#448)best_score_ and evals_result_ (issue #539)0 by defaultcatboost.sum_models() to sum models with provided weights.In python 3 some functions returned dictionaries with keys of type bytes - particularly eval_metrics and get_best_score. These are fixed to have keys of type str.
get_evals_result() method and evals_result_ property to model in python wrapper to allow user access metric valuescatboost.FeaturesDataCatBoostClassifier and CatBoostRegressorWarning and Error logs to stderrtarget to label in method save_pool()get_params() method now returns only the params that were explicitly set when constructing the object. That means that CatBoostClassifier and CatBoostRegressor get_params() will not contain 'loss_function' if it was not specified.
This also means that this code:model1 = CatBoostClassifier()
params = model1.get_params()
model2 = CatBoost(params)
will create model2 with default loss_function RMSE, not with Logloss. This breaking change is done to support sklearn interface, so that sklearn GridSearchCV can work.
is_fitted_ => is_fitted()
metadata_ => get_metadata()use_weights parameter to metrics. By default all metrics, except for AUC use weights, but you can disable it. To calculate metric value without weights, you need to set this parameter to false. Example: Accuracy:use_weights=false. This can be done only for custom_metrics or eval_metric, not for the objective function. Objective function always uses weights if they are present in the dataset.LogLikelihoodOfPrediction, RecallAt:top=k, PrecisionAt:top=k and MAP:top=k.QueryAverage is renamed to a more clear AverageGain. This is a very important ranking metric. It shows average target value in top k documents of a group.
Introduced parameter best_model_min_trees - the minimal number of trees the best model should have.get_roc_curve.get_gpu_device_count() method to python package. This is a way to check if your CUDA devices are available.catboost.select_threshold(self, data=None, curve=None, FPR=None, FNR=None, thread_count=-1). You can also calculate FPR and FNR for each boundary value.pool.slice(doc_indices)task_type='GPU' to enable GPU training.We also did a lot of stability improvements, and improved usability of the library, added new parameter synonyms and improved input data validations.
Thanks a lot to all people who created issues on github. And thanks a lot to our contributor @pukhlyakova who implemented many new useful metrics!
GroupId and SubgroupId in python-packageIn this release we added several very powerfull ranking objectives:
Other ranking improvements:
MetricName:hints=skip_train~false (it might speed up your training if metric calculation is a bottle neck, for example, if you calculate many metrics or if you calculate metrics on GPU).MetricName:hints=skip_train~true. If you want to calculate AUC or PFound on train dataset you can use MetricName:hints=skip_train~false.verbose=n parametermetric_period=something and MetricName:hints=skip_train~falsepretified=True the function will return list of features with names sorted in descending order by their importance.We added many new metrics that can be used for visualization, overfitting detection, selecting of the best iteration of training or for cross-validation:
Added make files for binary with CUDA and for Python package
We created a new repo with tutorials, now you don't have to clone the whole catboost repo to run Jupyter notebook with a tutorial.
We have also a set of bugfixes and we are gratefull to everyone who has filled a bugreport, helping us making the library better.
This release contains contributions from CatBoost team. We want to especially mention @pukhlyakova who implemented lots of useful metrics.
get_cat_feature_indices() in Python wrapper.numpy.ndarray and pandas.dataframe with string values that can cause slight inconsistence while using trained model from older versions. Around 1% of cat feature hashes were treated incorrectly. If you expirience quality drop after update you should consider retraining your model.get_object_importance model method in Python package and ostr mode in cli-version. Tutorial for Python is available here.
More details and examples will be published in documentation soon._catboost reinitialization issues #268 and #269.catboost.util extended with create_cd. It creates column description file.catboost.datasets.use_cpu_ram_for_cat_features renamed to gpu_cat_features_storage with posible values CpuPinnedMemory and GpuRam. Default is GpuRam.This release contains contributions from CatBoost team.
As usual we are grateful to all who filed issues or helped resolve them, asked and answered questions.
DocParallel mode for tasks without categorical features and or with categorical features and —max-ctr-complextiy 1. Provides best performance for pools with big number of documents.catboost.datasets).catboost.utils.create_cd).GroupId column.train() function to be consistant with other GBDT libraries.use_best_model is set to True by default if eval_set labels are present.YetiRank optimizes NDGC and PFound.eval_metrics and cv in Jupyter notebook.verbose=int: if verbose > 1, metric_period is set to this value.eval_set) = list in python. Currently supporting only single eval_set.model_size_reg parameter to control model size. Fix ctr_leaf_count_limit parameter, also to control model size.subgroupId to Python/R-packages.eval_metrics.This release contains contributions from CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
boosting_type parameter value Dynamic is renamed to Ordered.query_id parameter renamed to group_id in Python and R wrappers.as_pandas.Target is changed to Label. It will still work with previous name, but it is recommended to use the new one.eval-metrics mode added into cmdline version. Metrics can be calculated for a given dataset using a previously trained model.CtrFactor is added.fit function using file with dataset: fit(train_path, eval_set=eval_path, column_description=cd_file). This will reduce memory consumption by up to two times.bootstrap_type parameter to CatBoostClassifier and Regressor (issue #263).This release contains contributions from newbfg and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
QueryIdGreedyLogSumis_classification check and CV for LoglossQueryRMSE and calculation of querywise metrics.boosting-type to switch between standard boosting scheme and dynamic boosting, described in paper "Dynamic boosting".bootstrap_type, subsample. Using Bernoulli bootstrap type with subsample < 1 might increase the training speed.logging_level and metric_period (should be set in training parameters) to cv.train function that receives the parameters and returns a trained model.QueryRMSE now supports default settings for dynamic boosting.QueryRMSE with weights.This release contains contributions from CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
model.shrink function added in Python and R wrappers.metric_period that controls output frequency.QueryAverage.As usual we are grateful to all who filed issues, asked and answered questions.
Cmdline:
gradient-iterations renamed to leaf-estimation-iterations.border option removed. If you want to specify border for binary classification mode you need to specify it in the following way: loss-function Logloss:Border=0.5priors, per-feature-priors, ctr-binarization;simple-ctr, combintations-ctr, per-feature-ctr;
More details will be published in our documentation.Python:
gradient_iterations renamed to leaf_estimation_iterations.border option removed. If you want to specify border for binary classification mode you need to specify it in the following way: loss_function='Logloss:Border=0.5'priors, per_feature_priors, ctr_binarization;simple_ctr, combintations_ctr, per_feature_ctr;
More details will be published in our documentation.eval_metrics: now it's possible for a given model to calculate specified metric values for each iteration on specified dataset.task-type CPU or GPU (task_type 'CPU', 'GPU' in python bindings). Windows build still contains two binaries.As usual we are grateful to all who filed issues, asked and answered questions.
FlatBuffers model format: new CatBoost versions wouldn’t break model compatibility anymore.
PairLogit - pairwise comparison of objects from the input dataset. Algorithm maximises probability correctly reorder all dataset pairs.QueryRMSE - mix of regression and ranking. It’s trying to make best ranking for each dataset query by input labels.Verbose flag is now deprecated, please use logging_level instead. You could set the following levels: Silent, Verbose, Info, Debug.This release contains contributions from: avidale, newbfg, KochetovNicolai and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
GPU CUDA support is available. CatBoost supports multi-GPU training. Our GPU implementation is 2 times faster then LightGBM and more then 20 times faster then XGBoost one. Check out the news with benchmarks on our site.
Stability improvements and bug fixes
This release contains contributions from: daskol and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
Iter. This type of detector was requested by our users. So now you can also stop training by a simple criterion: if after a fixed number of iterations there is no improvement of your evaluation function.train_dir when training your model and then run "tensorboard --logdir={train_dir}"nan_mode for that. When applying a model, NaNs will be treated in the same way for the features where NaN values were seen in train. It is not allowed to have NaN values in test if no NaNs in train for this feature were provided.snapshot_file parameter - this way after you restart your training it will start from the last completed iteration.allow_writing_files parameter. By default some files with logging and diagnostics are written on disc, but you can turn it off using by setting this flag to False.MultiClassOneVsAll. We also added class_names param - now you don't have to renumber your classes to be able to use multiclass. And we have added two new metrics for multiclass: TotalF1 and MCC metrics.
You can use the metrics to look how its values are changing during training or to use overfitting detection or cutting the model by best value of a given metric.tsv format, CatBoost now supports files with any delimetersStability improvements and bug fixes
This release contains contributions from: grayskripko, hadjipantelis and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.