Continued Pre-Training (CPT) at scale with DeepSpeed

This guide provides a step-by-step workflow for continued pre-training the google/gemma-3-4b-pt model on a multi-GPU Anyscale cluster. It uses LLaMA-Factory for the training framework and DeepSpeed to efficiently manage memory and scale the training process.

CPT is a technique to further adapt a pre-trained base model on large-scale unlabeled text. By continuing to train on high-quality corpora, you adapt the model to new domain knowledge and improve generalization. This notebook performs full fine-tuning of the base model instead of using parameter-efficient fine-tuning (PEFT) techniques.

Full fine-tuning vs LoRA: Full fine-tuning generally yields the best quality but requires significantly more compute, longer training, and large checkpoints. LoRA is much faster and cheaper with small adapter checkpoints, but typically shows the most improvement on curated, simplified corpora (gains on broad/noisy corpora may be limited). See Compare full vs freeze vs PEFT and LoRA speed and memory optimizations.

Step 1: Set up your environment

Dependencies

First, ensure your environment has the correct libraries. Start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended container image:

bash

anyscale/ray-llm:2.48.0-py311-cu128

Execute the following commands to install the required packages and optional tools for experiment tracking and faster model downloads:

python

%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory==0.9.3

# Install DeepSpeed for large-scale training
pip install -q deepspeed==0.16.9

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

# (Optional) Experiment tracking library
pip install -q mlflow==3.4.0

Model and compute resources

DeepSpeed ZeRO-3 partitions parameters, gradients, and optimizer states across multiple GPUs, enabling CPT of mid-sized LLMs on just 4 GPUs.

Item	Value
Base model	`google/gemma-3-4b-pt`
Worker nodes	4 × L40S / 4 x A100-40G

Step 2: Prepare the dataset

Understand the dataset

This tutorial uses a simple JSONL corpus (C4) containing cleaned English web text derived from Common Crawl, widely used for language-model pretraining. Each line is a JSON object with at least a text field. For demo purposes, the sample c4.jsonl contains only the first 100 records from the original C4 dataset (hosted on S3) to enable quick runs.

Dataset example

json

{"text": "Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.", "timestamp": "2019-04-25 12:57:54", "url": "https://klyq.com/beginners-bbq-class-taking-place-in-missoula/"}

Register the dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add a dataset_info.json to storage shared across nodes such as /mnt/cluster_storage. This configuration file acts as a central registry for all your datasets. It maps a custom name to your dataset file location, format, and column structure.

If you plan to run CPT on this text dataset, first complete the setup steps below. Ensure that you place the dataset files in a storage location that all workers can access (for example, a shared mount or object storage). Avoid storing large files on the head node.

dataset_info.json

json

{
  "my_cpt_c4": {
      "file_name": "/mnt/cluster_storage/c4.jsonl",
      "columns": {
          "prompt": "text"
      }
  }
}

For a more detailed dataset preparation and formatting guide, see Choose your data format.

python

%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/c4.jsonl -O /mnt/cluster_storage/c4.jsonl
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/

Step 3: Create the pre-training config (CPT with DeepSpeed)

Next, create the main YAML configuration file—the master recipe for your pre-training job. It specifies the base model, the training method (full fine-tuning), the dataset, training hyperparameters, cluster resources, and more.

Important notes:

MLflow tracking: To track experiments with MLflow, set report_to: mlflow in the config. If you don't want to use MLflow, set report_to: none to avoid errors.
Access and paths: The YAML only needs to be on the head node, but any referenced paths (dataset_dir, output_dir) must reside on storage reachable by all workers (for example, /mnt/cluster_storage/).
Gated models: If your base model has gated access (for example, Gemma) on Hugging Face, set HF_TOKEN in the runtime environment.
GPU selection and placement: The config uses a 4xL40S node (anyscale/accelerator_shape:4xL40S) so that all 4 GPUs are on the same machine, which is important for efficient DeepSpeed ZeRO-3 communication. You can switch to other multi-GPU nodes such as 4xA100-40GB or any other node type with comparable or more VRAM, depending on your cloud availability.

Configure LLaMA-Factory with Ray

Note: To customize the training configuration, edit train-configs/cpt_deepspeed.yaml.

yaml

# cpt_deepspeed.yaml

### model
model_name_or_path: google/gemma-3-4b-pt
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: full

### deepspeed
deepspeed: /mnt/cluster_storage/ds_z3_config.json # path to the DeepSpeed config

### dataset
dataset: my_cpt_c4
dataset_dir: /mnt/cluster_storage

template: gemma
cutoff_len: 512
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: gemma3_4b_full_cpt
logging_steps: 2
save_steps: 50
plot_loss: true
report_to: mlflow   # or none

### train
per_device_train_batch_size: 1 # Adjust this depending on your GPU memory and sequence length
gradient_accumulation_steps: 2
num_train_epochs: 2.0
learning_rate: 1.0e-4
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: gemma3_4b_full_cpt
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4  # Number of GPUs to use
resources_per_worker:
  GPU: 1
  # accelerator_type:L40S: 0.001            # Use this to simply specify a GPU type (may place GPUs on separate nodes).
  anyscale/accelerator_shape:4xL40S: 0.001  # Prefer this for DeepSpeed so all 4 GPUs are on the same node.
  # See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.
ray_init_kwargs:
  runtime_env:
    env_vars:
      # If using gated models like google/gemma-3-4b-pt
      HF_TOKEN: <your_huggingface_token>
      # If hf_transfer is installed
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      # If using mlflow for experiments tracking
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"

Note: This configuration assumes 4xL40S GPUs are available in your cloud environment. If not, you can substitute with 4xA100-40G (or another supported accelerator with similar VRAM).

Together, stage: pt and finetuning_type: full configure this run as full continued pre-training on this C4-based corpus, producing full model checkpoints rather than lightweight adapters.

DeepSpeed configuration

DeepSpeed is an open-source deep-learning optimization library developed by Microsoft, aimed at enabling large-model training. Higher ZeRO stages (1→3) and enabling CPU offload reduce GPU VRAM usage, but might cause slower training.

To enable DeepSpeed, create a separate Deepspeed config in the storage shared across nodes. and reference it from your main training yaml config with:

yaml

deepspeed: /mnt/cluster_storage/ds_z3_config.json

Below is a sample ZeRO-3 config:

ds_z3_config.json

json

{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
},
"bf16": {
    "enabled": "auto"
},
"zero_optimization": {
    "stage": 3,
    "overlap_comm": false,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
}
}

For a more detailed guide on acceleration and optimization methods including DeepSpeed on Ray, see Speed and memory optimizations.

python

%%bash
# Create a copy of the DeepSpeed configuration file in /mnt/cluster_storage
cp ../deepspeed-configs/ds_z3_config.json /mnt/cluster_storage/

Step 4: Train and monitor

Note: For gated models such as google/gemma-3-4b-pt, ensure that you accept the license agreement for the models on the Hugging Face site and set HF_TOKEN in the runtime environment. If you installed MLflow, configure its credentials. Otherwise, set report_to: none in cpt_deepspeed.yaml to avoid api_token not set errors.

With all configurations in place, you can launch pre-training in one of two ways:

Option A: Run from a workspace (quickstart)

The USE_RAY=1 prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.

python

%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/cpt_deepspeed.yaml

Option B: Run as an Anyscale job (production)

For longer or production runs, submit the training as an Anyscale job. Jobs run outside your interactive session for better stability, retries, and durable logs. You package LLaMA-Factory and other libraries in a container image and launch with a short job config. See Run LLaMA-Factory as an Anyscale job for the step-by-step guide.

Tracking with MLflow

If you enabled MLflow logging (report_to: mlflow in your YAML), LLaMA-Factory logs metrics (loss, learning rate, etc.), parameters, and artifacts to your configured MLflow tracking server.

Example YAML snippet:

yaml

report_to: mlflow

ray_init_kwargs:
  runtime_env:
    env_vars:
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"

MLFlow example

For a more detailed guide on tracking experiments with other tools such as Weights & Biases or MLflow, see Observability and tracking.

Step 5: Locate checkpoints

Ray Train writes checkpoints under ray_storage_path/ray_run_name. In this example run, the path is: /mnt/cluster_storage/gemma3_4b_full_cpt.

Inside, you see a trainer session directory named like: TorchTrainer_8c6a5_00000_0_2025-09-09_09-53-45/.

Ray Train creates TorchTrainer_* when the trainer starts; the suffix encodes a short run ID and the start timestamp.
Within that directory, Ray Train names checkpoints checkpoint_000xxx/, where the number is the saved ordered checkpoints.

Control the save cadence with save_strategy and save_steps. For instructions on how to resume interrupted training with resume_from_checkpoint and more, see Understand the artifacts directory.

Step 6: Export the model

If you use LoRA, you can keep the base model and adapters separate for multi-LoRA deployment or merge the adapters into the base model for low-latency inference.

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply post-training quantization on merged or full models before serving.