doc/source/ray-overview/examples/llamafactory-llm-fine-tune/notebooks/cpt_deepspeed.ipynb
This guide provides a step-by-step workflow for continued pre-training the google/gemma-3-4b-pt model on a multi-GPU Anyscale cluster. It uses LLaMA-Factory for the training framework and DeepSpeed to efficiently manage memory and scale the training process.
CPT is a technique to further adapt a pre-trained base model on large-scale unlabeled text. By continuing to train on high-quality corpora, you adapt the model to new domain knowledge and improve generalization. This notebook performs full fine-tuning of the base model instead of using parameter-efficient fine-tuning (PEFT) techniques.
First, ensure your environment has the correct libraries. Start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.
Recommended container image:
anyscale/ray-llm:2.48.0-py311-cu128
Execute the following commands to install the required packages and optional tools for experiment tracking and faster model downloads:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory==0.9.3
# Install DeepSpeed for large-scale training
pip install -q deepspeed==0.16.9
# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9
# (Optional) Experiment tracking library
pip install -q mlflow==3.4.0
DeepSpeed ZeRO-3 partitions parameters, gradients, and optimizer states across multiple GPUs, enabling CPT of mid-sized LLMs on just 4 GPUs.
| Item | Value |
|---|---|
| Base model | google/gemma-3-4b-pt |
| Worker nodes | 4 × L40S / 4 x A100-40G |
This tutorial uses a simple JSONL corpus (C4) containing cleaned English web text derived from Common Crawl, widely used for language-model pretraining. Each line is a JSON object with at least a text field. For demo purposes, the sample c4.jsonl contains only the first 100 records from the original C4 dataset (hosted on S3) to enable quick runs.
Dataset example
{"text": "Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.", "timestamp": "2019-04-25 12:57:54", "url": "https://klyq.com/beginners-bbq-class-taking-place-in-missoula/"}
To specify new datasets that are accessible across Ray worker nodes, you must first add a dataset_info.json to storage shared across nodes such as /mnt/cluster_storage. This configuration file acts as a central registry for all your datasets. It maps a custom name to your dataset file location, format, and column structure.
If you plan to run CPT on this text dataset, first complete the setup steps below. Ensure that you place the dataset files in a storage location that all workers can access (for example, a shared mount or object storage). Avoid storing large files on the head node.
dataset_info.json
{
"my_cpt_c4": {
"file_name": "/mnt/cluster_storage/c4.jsonl",
"columns": {
"prompt": "text"
}
}
}
For a more detailed dataset preparation and formatting guide, see Choose your data format.
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/c4.jsonl -O /mnt/cluster_storage/c4.jsonl
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/
Next, create the main YAML configuration file—the master recipe for your pre-training job. It specifies the base model, the training method (full fine-tuning), the dataset, training hyperparameters, cluster resources, and more.
Important notes:
report_to: mlflow in the config. If you don't want to use MLflow, set report_to: none to avoid errors.dataset_dir, output_dir) must reside on storage reachable by all workers (for example, /mnt/cluster_storage/).HF_TOKEN in the runtime environment.anyscale/accelerator_shape:4xL40S) so that all 4 GPUs are on the same machine, which is important for efficient DeepSpeed ZeRO-3 communication. You can switch to other multi-GPU nodes such as 4xA100-40GB or any other node type with comparable or more VRAM, depending on your cloud availability.Note: To customize the training configuration, edit train-configs/cpt_deepspeed.yaml.
# cpt_deepspeed.yaml
### model
model_name_or_path: google/gemma-3-4b-pt
trust_remote_code: true
### method
stage: pt
do_train: true
finetuning_type: full
### deepspeed
deepspeed: /mnt/cluster_storage/ds_z3_config.json # path to the DeepSpeed config
### dataset
dataset: my_cpt_c4
dataset_dir: /mnt/cluster_storage
template: gemma
cutoff_len: 512
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: gemma3_4b_full_cpt
logging_steps: 2
save_steps: 50
plot_loss: true
report_to: mlflow # or none
### train
per_device_train_batch_size: 1 # Adjust this depending on your GPU memory and sequence length
gradient_accumulation_steps: 2
num_train_epochs: 2.0
learning_rate: 1.0e-4
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000
### ray
ray_run_name: gemma3_4b_full_cpt
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4 # Number of GPUs to use
resources_per_worker:
GPU: 1
# accelerator_type:L40S: 0.001 # Use this to simply specify a GPU type (may place GPUs on separate nodes).
anyscale/accelerator_shape:4xL40S: 0.001 # Prefer this for DeepSpeed so all 4 GPUs are on the same node.
# See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.
ray_init_kwargs:
runtime_env:
env_vars:
# If using gated models like google/gemma-3-4b-pt
HF_TOKEN: <your_huggingface_token>
# If hf_transfer is installed
HF_HUB_ENABLE_HF_TRANSFER: '1'
# If using mlflow for experiments tracking
MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
Note:
This configuration assumes 4xL40S GPUs are available in your cloud environment. If not, you can substitute with 4xA100-40G (or another supported accelerator with similar VRAM).
Together, stage: pt and finetuning_type: full configure this run as full continued pre-training on this C4-based corpus, producing full model checkpoints rather than lightweight adapters.
DeepSpeed is an open-source deep-learning optimization library developed by Microsoft, aimed at enabling large-model training. Higher ZeRO stages (1→3) and enabling CPU offload reduce GPU VRAM usage, but might cause slower training.
To enable DeepSpeed, create a separate Deepspeed config in the storage shared across nodes. and reference it from your main training yaml config with:
deepspeed: /mnt/cluster_storage/ds_z3_config.json
Below is a sample ZeRO-3 config:
ds_z3_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
For a more detailed guide on acceleration and optimization methods including DeepSpeed on Ray, see Speed and memory optimizations.
%%bash
# Create a copy of the DeepSpeed configuration file in /mnt/cluster_storage
cp ../deepspeed-configs/ds_z3_config.json /mnt/cluster_storage/
Note: For gated models such as google/gemma-3-4b-pt, ensure that you accept the license agreement for the models on the Hugging Face site and set HF_TOKEN in the runtime environment. If you installed MLflow, configure its credentials. Otherwise, set report_to: none in cpt_deepspeed.yaml to avoid api_token not set errors.
With all configurations in place, you can launch pre-training in one of two ways:
The USE_RAY=1 prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/cpt_deepspeed.yaml
For longer or production runs, submit the training as an Anyscale job. Jobs run outside your interactive session for better stability, retries, and durable logs. You package LLaMA-Factory and other libraries in a container image and launch with a short job config. See Run LLaMA-Factory as an Anyscale job for the step-by-step guide.
If you enabled MLflow logging (report_to: mlflow in your YAML), LLaMA-Factory logs metrics (loss, learning rate, etc.), parameters, and artifacts to your configured MLflow tracking server.
Example YAML snippet:
report_to: mlflow
ray_init_kwargs:
runtime_env:
env_vars:
MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
MLFlow example
For a more detailed guide on tracking experiments with other tools such as Weights & Biases or MLflow, see Observability and tracking.
Ray Train writes checkpoints under ray_storage_path/ray_run_name. In this example run, the path is: /mnt/cluster_storage/gemma3_4b_full_cpt.
Inside, you see a trainer session directory named like:
TorchTrainer_8c6a5_00000_0_2025-09-09_09-53-45/.
TorchTrainer_* when the trainer starts; the suffix encodes a short run ID and the start timestamp.checkpoint_000xxx/, where the number is the saved ordered checkpoints.Control the save cadence with save_strategy and save_steps. For instructions on how to resume interrupted training with resume_from_checkpoint and more, see Understand the artifacts directory.
If you use LoRA, you can keep the base model and adapters separate for multi-LoRA deployment or merge the adapters into the base model for low-latency inference.
For full fine-tuning or freeze-tuning, export the fine-tuned model directly.
You may optionally apply post-training quantization on merged or full models before serving.