docs/lora/experiment.html
Here's a Colab notebook for training a feedback transformer on Tiny Shakespeare dataset.
14importtorch15fromtorch.optimimportAdam16fromtorch.utils.dataimportDataLoader,TensorDataset17fromtransformersimportAutoTokenizer,AutoModelForCausalLM1819fromlabmlimportlab,monit,tracker20fromlabml.configsimportBaseConfigs,option21fromlabml.utils.downloadimportdownload\_file22fromlabml\_nn.helpers.deviceimportDeviceConfigs23fromlabml\_nn.lora.gpt2importGPTModel
The default configs can and will be over-ridden when we start the experiment
26classTrainer(BaseConfigs):
32device:torch.device=DeviceConfigs()
GPT-2 configs
35layer\_norm\_epsilon:float=1e-0536d\_model:int=76837n\_layers:int=1238n\_heads:int=1239n\_positions:int=102440vocab\_size:int=50257
Training configs
43epochs:int=1044batch\_size:int=3245learning\_rate:float=1e-446context\_len:int=512
LoRA rank
49lora\_r:int=32
Dataset
52text:TensorDataset="tiny\_shakespeare"
Huggingface tokenizer
54tokenizer=AutoTokenizer.from\_pretrained("gpt2")
56model:GPTModel
Optimizer
58optimizer:torch.optim.Adam
Cross entropy loss
60loss\_func=torch.nn.CrossEntropyLoss()
Dataloader
62data\_loader:DataLoader
64def\_load\_pretrained\_weights(self):
Load the huggingface model and get the parameters
70hf\_model=AutoModelForCausalLM.from\_pretrained("gpt2")71state\_dict=hf\_model.state\_dict()
Transformer embedding and prediction layer parameter mapping (hf: ours )
74mapping={75'transformer.wte.weight':'token\_embedding.weight',76'transformer.wpe.weight':'position\_embedding.weight',77'transformer.ln\_f.weight':'final\_norm.weight',78'transformer.ln\_f.bias':'final\_norm.bias',79'lm\_head.weight':'lm\_head.weight'80}
Mapping (hf: ours ) of decoder layers
83foriinrange(12):84mapping[f'transformer.h.{i}.ln\_1.weight']=f'blocks.{i}.attn\_norm.weight'85mapping[f'transformer.h.{i}.ln\_1.bias']=f'blocks.{i}.attn\_norm.bias'86mapping[f'transformer.h.{i}.attn.c\_attn.weight']=f'blocks.{i}.attn.qkv\_projection.weight'87mapping[f'transformer.h.{i}.attn.c\_attn.bias']=f'blocks.{i}.attn.qkv\_projection.bias'88mapping[f'transformer.h.{i}.attn.c\_proj.weight']=f'blocks.{i}.attn.output\_projection.weight'89mapping[f'transformer.h.{i}.attn.c\_proj.bias']=f'blocks.{i}.attn.output\_projection.bias'90mapping[f'transformer.h.{i}.ln\_2.weight']=f'blocks.{i}.ffn\_norm.weight'91mapping[f'transformer.h.{i}.ln\_2.bias']=f'blocks.{i}.ffn\_norm.bias'92mapping[f'transformer.h.{i}.mlp.c\_fc.weight']=f'blocks.{i}.ffn.linear\_in.weight'93mapping[f'transformer.h.{i}.mlp.c\_fc.bias']=f'blocks.{i}.ffn.linear\_in.bias'94mapping[f'transformer.h.{i}.mlp.c\_proj.weight']=f'blocks.{i}.ffn.linear\_out.weight'95mapping[f'transformer.h.{i}.mlp.c\_proj.bias']=f'blocks.{i}.ffn.linear\_out.bias'
Move the parameters based on mapping
98new\_state\_dict={}99forold\_key,new\_keyinmapping.items():100ifold\_keyinstate\_dict:101new\_state\_dict[new\_key]=state\_dict[old\_key]
GPT-2 hugging face uses 1D Convolution layers. We need to transpose those weights since we use linear layers
104convo\_layers=([f'blocks.{i}.ffn.linear\_in.weight'foriinrange(12)]+105[f'blocks.{i}.ffn.linear\_out.weight'foriinrange(12)]+106[f'blocks.{i}.attn.qkv\_projection.weight'foriinrange(12)]+107[f'blocks.{i}.attn.output\_projection.weight'foriinrange(12)])108109forlayerinconvo\_layers:110new\_state\_dict[layer]=torch.transpose(new\_state\_dict[layer],0,1)
Load out model. We use strict = False because the state does not have LoRA weights
113missing\_keys,unexpected\_keys=self.model.load\_state\_dict(new\_state\_dict,strict=False)
make sure that only lora weights are not loaded
116assertall('lora'inkeyforkeyinmissing\_keys)117assertnotunexpected\_keys
119definitialize(self):
Initialize the GPT2 model
124self.model=GPTModel(125layer\_norm\_epsilon=self.layer\_norm\_epsilon,126d\_model=self.d\_model,127n\_layers=self.n\_layers,128n\_heads=self.n\_heads,129n\_positions=self.n\_positions,130vocab\_size=self.vocab\_size,131r=self.lora\_r,132)133self.model.to(self.device)
Load pre-trained model weights
135self.\_load\_pretrained\_weights()
Initialize the optimizer
138self.optimizer=Adam(self.model.parameters(),lr=self.learning\_rate)
Initialize the data loader
141self.data\_loader=DataLoader(self.text,batch\_size=self.batch\_size,shuffle=True)
143defrun(self):
148for\_inmonit.loop(self.epochs):
inputs has shape [batch_size, seq_len]
150for(inputs,)inmonit.iterate('Train',self.data\_loader):
Move inputs to device
152inputs=inputs.to(self.device)
Call the model, with the all but the last token
154logits=self.model(inputs[:,:-1])
Get cross entropy loss
156loss=self.loss\_func(logits.reshape(-1,logits.shape[-1]),inputs[:,1:].reshape(-1))
Make gradients 0
159self.optimizer.zero\_grad()
Compute gradients
161loss.backward()
Optimize
163self.optimizer.step()
Log the loss
166tracker.save({'loss':loss})167tracker.add\_global\_step()
169tracker.new\_line()
It will download from the url if not present
172@option(Trainer.text)173deftiny\_shakespeare(c:Trainer):
179path=lab.get\_data\_path()/'tiny\_shakespeare.txt'180ifnotpath.exists():181download\_file("https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt",path)182withopen(path,'r',encoding='utf-8')asf:183text=f.read()184185tokens=c.tokenizer.encode(text)186num\_batches=len(tokens)//(c.batch\_size\*c.context\_len)187tokens=tokens[:num\_batches\*c.batch\_size\*c.context\_len]188input\_ids=torch.tensor(tokens).view(-1,c.context\_len)189returnTensorDataset(input\_ids)