examples/dna_language_models/dna_lm.ipynb
This notebook demonstrates how to utilize parameter-efficient fine-tuning techniques (PEFT) from the PEFT library to fine-tune a DNA Language Model (DNA-LM). The fine-tuned DNA-LM will be applied to solve a task from the nucleotide benchmark dataset. Parameter-efficient fine-tuning (PEFT) techniques are crucial for adapting large pre-trained models to specific tasks with limited computational resources.
We'll start by importing the required libraries, including the PEFT library and other dependencies.
import torch
import transformers
import peft
import tqdm
import numpy as np
We'll load a pre-trained DNA Language Model, "SpeciesLM", that serves as the base for fine-tuning. This is done using the transformers library from HuggingFace.
The tokenizer and the model comes from the paper, "Species-aware DNA language models capture regulatory elements and their evolution". Paper Link, Code Link. They introduce a species-aware DNA language model, which is trained on more than 800 species spanning over 500 million years of evolution.
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("gagneurlab/SpeciesLM", revision = "downstream_species_lm")
lm = AutoModelForMaskedLM.from_pretrained("gagneurlab/SpeciesLM", revision = "downstream_species_lm")
lm.eval()
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
lm.to(device);
We'll load the nucleotide_transformer_downstream_tasks dataset, which contains 18 downstream tasks from the Nucleotide Transformer paper. This dataset provides a consistent genomics benchmark with binary classification tasks.
from datasets import load_dataset
raw_data_full = load_dataset("InstaDeepAI/nucleotide_transformer_downstream_tasks")
raw_data = raw_data_full.filter(lambda example: example['task'] == 'H3')
We'll use the "H3" subset of this dataset, which contains a total of 13,468 rows in the training data, and 1497 rows in the test data.
raw_data
The dataset consists of three columns, sequence, name and label. An row in this dataset looks like:
raw_data['train'][0]
We split out dataset into training, test, and validation sets.
from datasets import Dataset, DatasetDict
train_valid_split = raw_data['train'].train_test_split(test_size=0.15, seed=42)
train_valid_split = DatasetDict({
'train': train_valid_split['train'],
'validation': train_valid_split['test']
})
ds = DatasetDict({
'train': train_valid_split['train'],
'validation': train_valid_split['validation'],
'test': raw_data['test']
})
Then, we use the tokenizer and a utility function we created, get_kmers to generate the final data and labels. The get_kmers function is essential for generating overlapping 6-mers needed by the language model (LM). By using k=6 and stride=1, we ensure that the model receives continuous and overlapping subsequences, capturing the local context within the biological sequence for more effective analysis and prediction.
def get_kmers(seq, k=6, stride=1):
return [seq[i:i + k] for i in range(0, len(seq), stride) if i + k <= len(seq)]
test_sequences = []
train_sequences = []
val_sequences = []
dataset_limit = 200 # NOTE: This dataset limit is set to 200, so that the training runs faster. It can be set to None to use the
# entire dataset
for i in range(0, len(ds['train'])):
if dataset_limit and i == dataset_limit:
break
sequence = ds['train'][i]['sequence']
sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
sequence = tokenizer(sequence)["input_ids"]
train_sequences.append(sequence)
for i in range(0, len(ds['validation'])):
if dataset_limit and i == dataset_limit:
break
sequence = ds['validation'][i]['sequence']
sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
sequence = tokenizer(sequence)["input_ids"]
val_sequences.append(sequence)
for i in range(0, len(ds['test'])):
if dataset_limit and i == dataset_limit:
break
sequence = ds['test'][i]['sequence']
sequence = "candida_glabrata " + " ".join(get_kmers(sequence))
sequence = tokenizer(sequence)["input_ids"]
test_sequences.append(sequence)
train_labels = ds['train']['label']
test_labels = ds['test']['label']
val_labels = ds['validation']['label']
if dataset_limit:
train_labels = train_labels[0:dataset_limit]
test_labels = test_labels[0:dataset_limit]
val_labels = val_labels[0:dataset_limit]
Finally, we create a Dataset object for each our sets.
from datasets import Dataset
train_dataset = Dataset.from_dict({"input_ids": train_sequences, "labels": train_labels})
val_dataset = Dataset.from_dict({"input_ids": val_sequences, "labels": val_labels})
test_dataset = Dataset.from_dict({"input_ids": test_sequences, "labels": test_labels})
Now, we'll train our DNA Language Model with the training dataset. We'll add a linear layer in the final layer of our language model, and then, train all the parameteres of our model with the training dataset.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
import torch
from torch import nn
class DNA_LM(nn.Module):
def __init__(self, model, num_labels):
super(DNA_LM, self).__init__()
self.model = model.bert
self.in_features = model.config.hidden_size
self.out_features = num_labels
self.classifier = nn.Linear(self.in_features, self.out_features)
def forward(self, input_ids, attention_mask=None, labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
sequence_output = outputs.hidden_states[-1]
# Use the [CLS] token for classification
cls_output = sequence_output[:, 0, :]
logits = self.classifier(cls_output)
loss = None
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.out_features), labels.view(-1))
return (loss, logits) if loss is not None else logits
# Number of classes for your classification task
num_labels = 2
classification_model = DNA_LM(lm, num_labels)
classification_model.to(device);
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
from transformers import Trainer, TrainingArguments
# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01,
eval_steps=1,
logging_steps=1,
)
# Initialize Trainer
trainer = Trainer(
model=classification_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
)
# Train the model
trainer.train()
# Generate predictions
predictions = trainer.predict(test_dataset)
logits = predictions.predictions
predicted_labels = logits.argmax(axis=-1)
print(predicted_labels)
Then, we create a function to calculate the accuracy from the test and predicted labels.
def calculate_accuracy(true_labels, predicted_labels):
assert len(true_labels) == len(predicted_labels), "Arrays must have the same length"
correct_predictions = np.sum(true_labels == predicted_labels)
accuracy = correct_predictions / len(true_labels)
return accuracy
accuracy = calculate_accuracy(test_labels, predicted_labels)
print(f"Accuracy: {accuracy:.2f}")
The results aren't that good, which we can attribute to the small dataset size.
In this section, we demonstrate how to employ parameter-efficient fine-tuning (PEFT) techniques to adapt a pre-trained model for specific genomics tasks using the PEFT library.
The LoraConfig object is instantiated to configure the PEFT parameters:
# Number of classes for your classification task
num_labels = 2
classification_model = DNA_LM(lm, num_labels)
classification_model.to(device);
from peft import LoraConfig, TaskType
peft_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["query", "key", "value"],
lora_dropout=0.01,
)
from peft import get_peft_model
peft_model = get_peft_model(classification_model, peft_config)
peft_model.print_trainable_parameters()
peft_model
# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01,
eval_steps=1,
logging_steps=1,
)
# Initialize Trainer
trainer = Trainer(
model=peft_model.model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
)
# Train the model
trainer.train()
# Generate predictions
predictions = trainer.predict(test_dataset)
logits = predictions.predictions
predicted_labels = logits.argmax(axis=-1)
print(predicted_labels)
def calculate_accuracy(true_labels, predicted_labels):
assert len(true_labels) == len(predicted_labels), "Arrays must have the same length"
correct_predictions = np.sum(true_labels == predicted_labels)
accuracy = correct_predictions / len(true_labels)
return accuracy
accuracy = calculate_accuracy(test_labels, predicted_labels)
print(f"Accuracy: {accuracy:.2f}")
As we can see, the PEFT model achieves similar performance to the baseline model, demonstrating the effectiveness of PEFT in adapting pre-trained models to specific tasks with limited computational resources.
With PEFT, we only train 442,368 parameters, which is 0.49% of the total parameters in the model. This is a significant reduction in computational resources compared to training the entire model from scratch.
We can improve the results by using a larger dataset, fine-tuning the model for more epochs or changing the hyperparameters (rank, learning rate, etc.).