doc/tutorials/saving_model.rst
######################## Introduction to Model IO ########################
Contents
.. contents:: :backlinks: none :local:
Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option is enabled for serializing models to file, serializing models to buffer, and for memory snapshot (pickle and alike).
JSON and UBJSON have the same document structure with different representations, and we
will refer them collectively as the JSON format. This tutorial aims to share some basic
insights into the JSON serialisation method used in XGBoost. Without explicitly
mentioned, the following sections assume you are using the one of the 2 outputs formats,
which can be enabled by providing the file name with .json (or .ubj for binary
JSON) as file extension when saving/loading model: booster.save_model('model.json').
More details below.
Before we get started, XGBoost is a gradient boosting library with focus on tree models, which means inside XGBoost, there are 2 distinct parts:
If you come from the Deep Learning community, then it should be clear to you that there are differences between the neural network structures composed of weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
So when one calls booster.save_model (xgb.save in R), XGBoost saves the trees,
some model parameters like number of input columns in trained trees, and the objective
function, which combined to represent the concept of "model" in XGBoost. As for why are
we saving the objective as part of model, that's because objective controls transformation
of global bias (called base_score or the intercept in XGBoost) and task-specific
information. Users can share this model with others for inference, evaluation or continue
the training with a different set of hyper-parameters etc.
However, this is not the end of story. There are cases where we need to save something more than just the model itself. For example, in distributed training, XGBoost performs checkpointing operation. Or for some reasons, your favorite distributed computing framework decide to copy the model from one worker to another and continue the training in there. In such cases, the serialisation output is required to contain enough information to continue previous training without user providing any parameters again. We consider such scenario as memory snapshot (or memory based serialisation method) and distinguish it with normal model IO operation. Currently, memory snapshot is used in the following places:
Booster object is pickled with the built-in pickle module.xgb.Booster object is persisted with the built-in functions saveRDS
or save.Booster object is serialized with the built-in functions saveModel.To enable JSON format support for model IO (saving only the trees and objective), provide
a filename with .json or .ubj as file extension, the latter is the extension for
Universal Binary JSON <https://ubjson.org/>__
.. code-block:: python :caption: Python
bst.save_model('model_file_name.json')
.. code-block:: r :caption: R
xgb.save(bst, 'model_file_name.json')
.. code-block:: Scala :caption: Scala
val format = "json" // or val format = "ubj" model.write.option("format", format).save("model_directory_path")
.. note::
Only load models from JSON files that were produced by XGBoost. Attempting to load JSON files that were produced by an external source may lead to undefined behaviors and crashes.
When loading the model back, XGBoost recognizes the file extensions .json and
.ubj, and can dispatch accordingly. If the extension is not specified, XGBoost tries
to guess the right one.
A note on backward compatibility of models and memory snapshots
We guarantee backward compatibility for models but not for memory snapshots.
Models (trees and objective) use a stable representation, so that models produced in earlier
versions of XGBoost are accessible in later versions of XGBoost. If you'd like to store or archive
your model for long-term storage, use save_model (Python) and xgb.save (R).
On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its
format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for
checkpointing only, where you persist the complete snapshot of the training configurations so that
you can recover robustly from possible failures and resume the training process. Loading memory
snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.
If a model is persisted with pickle.dump (Python) or saveRDS (R), then the model may
not be accessible in later versions of XGBoost.
.. _custom-obj-metric:
Custom objective and metric
XGBoost accepts user provided objective, metric, and callback functions as extensions. These functions are not saved in model file as they are language dependent features. With Python, user can pickle the model to include these functions in saved binary. One drawback is that the output from pickle is not a stable serialization format and doesn't work on different Python versions nor XGBoost versions, not to mention different language environments. Another way to workaround this limitation is to provide these functions again after the model is loaded by separating the serialization between the XGBoost built-in model and auxiliary methods. If the customized function is useful, please consider making a PR for implementing it inside XGBoost, this way we can have your functions working with different language bindings. See the next section for more about pickling.
Loading pickled files or RDS files
From a different XGBoost version
As noted, pickled model is neither portable nor stable, but in some cases the pickled
models are valuable. One way to restore it in the future is to load it back with that
specific version of Python and XGBoost, and then export the model by calling
:py:meth:xgboost.Booster.save_model or :py:meth:xgboost.XGBModel.save_model.
.. note:: Pickle is not secure
Only load pickled files from a trusted source. The pickle Python module is NOT
secure. And by extension, joblib, cloudpickle are also not safe when loading
files from unknown sources. See https://docs.python.org/3/library/pickle.html for more
information.
A similar procedure may be used to recover the model persisted in an old RDS file. In R,
you are able to install an older version of XGBoost using the remotes package:
.. code-block:: r
library(remotes) remotes::install_version("xgboost", "0.90.0.1") # Install version 0.90.0.1
Once the desired version is installed, you can load the RDS file with readRDS and
recover the xgb.Booster object. Then call xgb.save to export the model using the
stable representation. Now you should be able to use the model in the latest version of
XGBoost.
Saving and Loading the internal parameters configuration
XGBoost's C API, Python API and R API support saving and loading the internal
configuration directly as a JSON string. In Python package:
.. code-block:: python
bst = xgboost.train(...) config = bst.save_config() print(config)
or in R:
.. code-block:: R
config <- xgb.config(bst) print(config)
Will print out something similar to (not actual output as it's too long for demonstration):
.. code-block:: javascript
{
"Learner": {
"generic_parameter": {
"device": "cuda:0",
"gpu_page_size": "0",
"n_jobs": "0",
"random_state": "0",
"seed": "0",
"seed_per_iteration": "0"
},
"gradient_booster": {
"gbtree_train_param": {
"num_parallel_tree": "1",
"process_type": "default",
"tree_method": "hist",
"updater": "grow_gpu_hist",
"updater_seq": "grow_gpu_hist"
},
"name": "gbtree",
"updater": {
"grow_gpu_hist": {
"gpu_hist_train_param": {
"debug_synchronize": "0",
},
"train_param": {
"alpha": "0",
"cache_opt": "1",
"colsample_bylevel": "1",
"colsample_bynode": "1",
"colsample_bytree": "1",
"default_direction": "learn",
...
"subsample": "1"
}
}
}
},
"learner_train_param": {
"booster": "gbtree",
"disable_default_eval_metric": "0",
"objective": "reg:squarederror"
},
"metrics": [],
"objective": {
"name": "reg:squarederror",
"reg_loss_param": {
"scale_pos_weight": "1"
}
}
},
"version": [1, 0, 0]
}
You can load it back to the model generated by same version of XGBoost by:
.. code-block:: python
bst.load_config(config)
This way users can study the internal representation more closely. Please note that some JSON generators make use of locale dependent floating point serialization methods, which is not supported by XGBoost.
Difference between saving model and dumping model
XGBoost has a function called dump_model in the Booster class, which lets you to
export the model in a readable format like text, json or dot (graphviz). The
primary use case for it is for model interpretation and visualization, and is not supposed
to be loaded back to XGBoost.
Categories
Since 3.1, the categories encoding from a training dataframe is stored in the booster to
provide test-time re-coding support, see :ref:cat-recode for more info about how the
re-coder works. We will briefly explain the JSON format for the serialized category index.
The categories are saved in a JSON object named "cats" under the gbtree model. It contains three keys:
This is a CSR-like pointer that stores the number of categories for each feature. It starts with zero and ends with the total number of categories from all features. For example:
.. code-block:: python
feature_segments = [0, 3, 3, 5]
The feature_segments list represents a dataset with two categorical features and one
numerical feature. The first feature contains three categories, the second feature is
numerical and thus has no categories, and the last feature includes two categories.
This array stores the sorted indices (argsort) of categories across all features,
segmented by the feature_segments. Given a feature with categories: ["b", "c", "a"], the sorted index is [2, 0, 1].
This is an array with a length equal to the number of features, storing all the categories
in the same order as the input dataframe. The storage schema depends on whether the
categories are strings (XGBoost also supports numerical categories, such as integers). For
string categories, we use a schema similar to the arrow format for a string array. The
categories of each feature are represented by two arrays, namely offsets and
values. The format is also similar to a CSR-matrix. The values field is a
uint8 array storing characters from all category names. Given a feature with three
categories: ["bb", "c", "a"], the values field is [98, 98, 99, 97]. Then the
offsets segments the values array similar to a CSR pointer: [0, 2, 3, 4]. We
chose to not store the values as a JSON string to avoid handling special characters
and string encoding. The string names are stored exactly as given by the dataframe.
As for numerical categories, the enc contains two keys: type and values. The
type field is an integer ID that identifies the type of the categories, such as 64-bit
integers and 32-bit floating points (note that they are all f32 inside a decision
tree). The exact mapping between the type to the integer ID is internal but stable. The
values is an array storing all categories in a feature.
Brief History