examples/psoft_finetuning/README.md
PSOFT aims to preserve the geometric relationships among pre-trained weight column vectors—a core principle of OFT—while achieving a balanced trade-off across parameter, computation, and memory efficiency. Unlike existing OFT variants (e.g., OFTv2, BOFT, and GOFT) that rely on sparsity-based designs, PSOFT adopts a low-rank principal subspace perspective, bridging the gap between LoRA and OFT. PSOFT confines orthogonal fine-tuning to a principal subspace, offering theoretical guarantees via orthogonality constraints on the down-projection matrix, while enabling practical adaptability through two low-dimensional tunable vectors.
import torch
from peft import PsoftConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
psoft_config = PsoftConfig(
r=32,
psoft_alpha=32,
)
peft_model = get_peft_model(model, psoft_config)
peft_model.print_trainable_parameters()
dataset = load_dataset("imdb", split="train[:1%]")
training_args = SFTConfig(dataset_text_field="text", max_length=128)
trainer = SFTTrainer(
model=peft_model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
peft_model.save_pretrained("psoft-opt-125m")
python psoft_finetuning.py \
--base_model_name_or_path meta-llama/Llama-3.2-3B \
--output_dir ./outputs/psoft-llama3.2-3b-imdb \
--data_path imdb \
--dataset_split "train[:1%]" \
--max_length 128 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 5e-4 \
--bits bf16 \
--r 128 \
--psoft_alpha 128 \
--target_modules q_proj v_proj
32–128) are suitable for simpler tasks, while larger ranks (e.g., 64–256) provide greater expressiveness for more complex tasks at the cost of increased parameters and computation.1e-4 to 5e-3) for stable training.lowrank option is more memory- and compute-efficient than full, making it more suitable for large models.5) usually provides a good balance between accuracy and efficiency.@inproceedings{wu2026efficient,
title={Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation},
author={Wu, Fei and Hu, Jia and Min, Geyong and Wang, Shiqiang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=FSHrinMArK}
}