පෙළකැබලි වල බර්ට් කාවැද්දීම්

RETRO ආකෘතියසඳහා කුට්ටි වල BERT කාවැද්දීම් ලබා ගැනීමේ කේතය මෙයයි.

13fromtypingimportList1415importtorch16fromtransformersimportBertTokenizer,BertModel1718fromlabmlimportlab,monit

බර්ට්කාවැද්දීම්

දීඇති පෙළ කුට්ටියක් සඳහා N මෙම පන්තිය BERT කාවැද්දීම් ජනනය කරයි BERT(N). BERT(N) සියලුම ටෝකන වල BERT කාවැද්දීම් වල Nසාමාන්යය වේ.

21classBERTChunkEmbeddings:

29def\_\_init\_\_(self,device:torch.device):30self.device=device

හග්ජිං ෆේස් වෙතින් බර්ට් ටෝකනයිසර් පටවන්න

33withmonit.section('Load BERT tokenizer'):34self.tokenizer=BertTokenizer.from\_pretrained('bert-base-uncased',35cache\_dir=str(36lab.get\_data\_path()/'cache'/'bert-tokenizer'))

HuggingFace වෙතින් බර්ට් ආකෘතිය පටවන්න

39withmonit.section('Load BERT model'):40self.model=BertModel.from\_pretrained("bert-base-uncased",41cache\_dir=str(lab.get\_data\_path()/'cache'/'bert-model'))

ආකෘතියවෙත ගෙන යන්න device

44self.model.to(device)

මෙමක්රියාත්මක කිරීමේදී, අපි ස්ථාවර ටෝකන සංඛ්යාවක් සමඟ කුට්ටි සාදන්නේ නැත. එක් හේතුවක් නම්, මෙම ක්රියාත්මක කිරීම චරිත මට්ටමේ ටෝකන භාවිතා කිරීම සහ BERT එහි උප වචන ටෝකනයිසර් භාවිතා කිරීමයි.

එබැවින්මෙම ක්රමය මඟින් අර්ධ ටෝකන නොමැති බවට වග බලා ගැනීම සඳහා පා truncate වේ.

නිදසුනක්වශයෙන්, කෙළවරේ අර්ධ වචන (අර්ධ උප වචන ටෝකන) සහිත කුට්ටියක් සමාන s a popular programming la විය හැකිය. වඩා හොඳ බර්ට් කාවැද්දීම් ලබා ගැනීම සඳහා අපි ඒවා ඉවත් කරමු. කලින් සඳහන් කළ පරිදි, ටෝකනීකරණයෙන් පසු අපි කුට්ටි කඩා දැමුවහොත් මෙය අවශ්ය නොවේ.

46@staticmethod47def\_trim\_chunk(chunk:str):

තීරුවයිට්ස්පේස්

61stripped=chunk.strip()

වචනකඩන්න

63parts=stripped.split()

පළමුහා අවසාන කෑලි ඉවත් කරන්න

65stripped=stripped[len(parts[0]):-len(parts[-1])]

වයිට්ස්පේස්ඉවත් කරන්න

68stripped=stripped.strip()

හිස්ආපසු මුල් string නම්

71ifnotstripped:72returnchunk

එසේනොමැතිනම්, ඉවත් කරන ලද නූල් ආපසු ලබා දෙන්න

74else:75returnstripped

කුට්ටිලැයිස්තුවක් BERT(N) සඳහා ලබා ගන්න.

77def\_\_call\_\_(self,chunks:List[str]):

අපටඅනුක්රමික ගණනය කිරීමට අවශ්ය නැත

83withtorch.no\_grad():

කුට්ටිකපන්න

85trimmed\_chunks=[self.\_trim\_chunk(c)forcinchunks]

බර්ට්ටෝකනයිසර් සමඟ කුට්ටි ටෝකන්ට් කරන්න

88tokens=self.tokenizer(trimmed\_chunks,return\_tensors='pt',add\_special\_tokens=False,padding=True)

ටෝකන්හැඳුනුම්පත්, අවධානය ආවරණ සහ ටෝකන් වර්ග උපාංගයට ගෙන යන්න

91input\_ids=tokens['input\_ids'].to(self.device)92attention\_mask=tokens['attention\_mask'].to(self.device)93token\_type\_ids=tokens['token\_type\_ids'].to(self.device)

ආකෘතියතක්සේරු කරන්න

95output=self.model(input\_ids=input\_ids,96attention\_mask=attention\_mask,97token\_type\_ids=token\_type\_ids)

ටෝකන්කාවැද්දීම් ලබා ගන්න

100state=output['last\_hidden\_state']

සාමාන්යටෝකන කාවැද්දීම් ගණනය කරන්න. ටෝකනය හිස් පෑඩ් 0 නම් අවධානය යොමු කිරීමේ ආවරණ බව සලකන්න. කුට්ටි විවිධ දිග බැවින් අපට හිස් ටෝකන ලැබේ.

104emb=(state\*attention\_mask[:,:,None]).sum(dim=1)/attention\_mask[:,:,None].sum(dim=1)

107returnemb

BERTකාවැද්දීම් පරීක්ෂා කිරීමට කේතය

110def\_test():

114fromlabml.loggerimportinspect

ආරම්භකරන්න

117device=torch.device('cuda:0')118bert=BERTChunkEmbeddings(device)

නියැදිය

121text=["Replace me by any text you'd like.",122"Second sentence"]

බර්ට්ටෝකනයිසර් පරීක්ෂා කරන්න

125encoded\_input=bert.tokenizer(text,return\_tensors='pt',add\_special\_tokens=False,padding=True)126127inspect(encoded\_input,\_expand=True)

බර්ට්ආකෘති ප්රතිදානයන් පරීක්ෂා කරන්න

130output=bert.model(input\_ids=encoded\_input['input\_ids'].to(device),131attention\_mask=encoded\_input['attention\_mask'].to(device),132token\_type\_ids=encoded\_input['token\_type\_ids'].to(device))133134inspect({'last\_hidden\_state':output['last\_hidden\_state'],135'pooler\_output':output['pooler\_output']},136\_expand=True)

ටෝකන්id වලින් පෙළ ප්රතිනිර්මාණය කිරීම පරීක්ෂා කරන්න

139inspect(bert.tokenizer.convert\_ids\_to\_tokens(encoded\_input['input\_ids'][0]),\_n=-1)140inspect(bert.tokenizer.convert\_ids\_to\_tokens(encoded\_input['input\_ids'][1]),\_n=-1)

කුට්ටිකාවැද්දීම් ලබා ගන්න

143inspect(bert(text))

147if\_\_name\_\_=='\_\_main\_\_':148\_test()

Trending Research Papers labml.ai