catboost/docs/en/concepts/python-reference_cv.md
cv(pool=None,
params=None,
dtrain=None,
iterations=None,
num_boost_round=None,
fold_count=3,
nfold=None,
inverted=False,
partition_random_seed=0,
seed=None,
shuffle=True,
logging_level=None,
stratified=None,
as_pandas=True,
metric_period=None,
verbose=None,
verbose_eval=None,
plot=False,
early_stopping_rounds=None,
folds=None,
type='Classical',
return_models=False)
{% include cv-cv__purpose %}
The dataset is split into N folds. N–1 folds are used for training, and one fold is used for model performance estimation. N models are updated on each iteration K. Each model is evaluated on its' own validation dataset on each iteration. This produces N metric values on each iteration K.
The cv function calculates the average of these N values and the standard deviation. Thus, these two values are returned on each iteration.
If the dataset contains group identifiers, all objects from one group are added to the same fold when partitioning is performed.
Alias: dtrain
The input dataset to cross-validate.
Possible types
Default value
{{ python--required }}
{% include python__cv-python__cv__params__description__div %}
{% note info %}
save_snapshot,
--snapshot-file
, snapshot_interval.fold_count metric values are calculated in the cross-validation mode. Therefore, all fold_count values are averaged and the best iteration is chosen based on the average metric value at each iteration.{% endnote %}
Possible types
{{ python-type--dict }}
Default value
{{ python--required }}
Aliases: num_boost_round, n_estimators, num_trees
The maximum number of trees that can be built when solving machine learning problems.
When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.
Possible types
{{ python-type--int }}
Default value
{{ fit--iterations }}
Alias: nfold
The number of folds to split the dataset into.
Possible types
{{ python-type--int }}
Default value
3
Train on the test fold and evaluate the model on the training folds.
Possible types
{{ python-type--bool }}
Default value
False
Alias: seed
{% include reusage-cv-rand__desc_intro %}
{% include reusage-cv-rand__permutation-is-performed %}
{% include reusage-cv-rand__unique-data-splits %}
Possible types
{{ python-type--int }}
Default value
0
Shuffle the dataset objects before splitting into folds.
Possible types
{{ python-type--bool }}
Default value
True
The logging level to output to stdout.
Possible values:
Silent — Do not output any logging information to stdout.
Verbose — Output the following data to stdout:
Info — Output additional information and the number of trees.
Debug — Output debugging information.
Possible types
{{ python-type--string }}
Default value
None (corresponds to the {{ fit--verbose }} logging level)
Perform stratified sampling.
It is turned on (True) by default if one of the following loss functions is selected: {{ error-function--Logit }}, {{ error-function--MultiClass }}, {{ error-function--MultiClassOneVsAll }}.
It is turned off (False) for all other loss functions by default.
Possible types
{{ python-type--bool }}
Default value
None
Sets the type of return value to {{ python-type--pandasDataFrame }}.
The type of return value is {{ python-type--dict }} if this parameter is set to False or the pandas{{ python-package }} is not installed.
Possible types
{{ python-type--bool }}
Default value
True
{% include reusage-cli__metric-period__desc__start %}
{% include reusage-cli__metric-period__desc__end %}
Possible types
{{ python-type--int }}
Default value
{{ fit__metric-period }}
Alias: verbose_eval
{% include sections-with-methods-desc-python__feature-importances__verbose__short-description__list-intro %}
{{ python-type--bool }} — Defines the logging level:
{{ python-type--int }} — Use the Verbose logging level and set the logging period to the value of this parameter.
Possible types
Default value
False
Plot the following information during training:
Possible types
{{ python-type--bool }}
Default value
{{ fit--plot }}
Sets the overfitting detector type to {{ fit--od-type-iter }} and stops the training after the specified number of iterations since the iteration with the optimal metric value.
Possible types
{{ python-type--int }}
Default value
False
Custom splitting indices.
The format of the input data depends on the type of the parameter:
This parameter has the highest priority among other data split parameters.
Possible types
Default value
None
The method to split the dataset into folds.
Possible values:
{{ cv__type__Classical }} — The dataset is split into fold_count folds, fold_count trainings are performed. Each test set consists of a single fold, and the corresponding train set consists of the remaining k–1 folds.
{{ cv__type__Inverted }} — The dataset is split into fold_count folds, fold_count trainings are performed. Each test set consists of the first k–1 folds, and the corresponding train set consists of the remaining fold.
{{ cv__type__TimeSeries }} — The dataset is split into (fold_count + 1) consecutive parts without shuffling the data, fold_count trainings are performed. The k-th train set consists of the first k folds, and the corresponding test set consists of the (k+1)-th fold.
Possible types
{{ python-type--string }}
Default value
{{ cv__type__default }}
If return_models is True, returns a list of models fitted for each CV fold. By default, False.
Possible types
{{ python-type--bool }}
Default value
False
Depends on return_models, as_pandas, and the availability of the pandas{{ python-package }}:
return_models is False, cv returns cv_results which is a dict or a pandas frame (see a table below).return_models is True, cv returns a tuple (cv_results, fitted_models) containing, in addition to regular cv_results, a list of models fitted for each fold.as_pandas value | pandas{{ python-package }} availability | Type of return value |
|---|---|---|
| True | Installed | {{ python-type--pandasDataFrame }} |
| True | Not installed | {{ python-type--dict }} |
| False | Unimportant | {{ python-type--dict }} |
The first key (if the output type is {{ python-type--dict }}) or column name (if the output type is {{ python-type--pandasDataFrame }}) contains the iteration of the calculated metrics values on the corresponding line. Each following key or column name is formed from the evaluation dataset type (train or test), metric name, and computed characteristic (std, mean, etc.). Each value is a list of corresponding computed values.
For example, if only the {{ error-function--RMSE }} metric is specified in the parameters, then the return value is:
iterations test-Logloss-mean test-Logloss-std train-Logloss-mean train-Logloss-std
0 0 0.693219 0.000101 0.684767 0.011851
1 1 0.682687 0.014995 0.674235 0.003043
2 2 0.672758 0.029630 0.655983 0.005906
3 3 0.668589 0.023734 0.648127 0.005204
Each key or column value contains the same number of calculated values as the number of training iterations (or less, if the overfitting detection is turned on and the threshold is reached earlier).
{% include cv-usage-example-cv__usage-example %}
{% include cv-usage-example-cv_with_roc_curve__example %}