This model was released on 2020-03-02 and added to Hugging Face Transformers on 2020-11-16.

PhoBERT

Overview

The PhoBERT model was proposed in PhoBERT: Pre-trained language models for Vietnamese by Dat Quoc Nguyen, Anh Tuan Nguyen.

The abstract from the paper is the following:

We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

This model was contributed by dqnguyen. The original code can be found here.

Usage example

python

import torch

from transformers import AutoModel, AutoTokenizer


phobert = AutoModel.from_pretrained("vinai/phobert-base", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = phobert(input_ids)  # Models outputs are now tuples

<Tip>

PhoBERT implementation is the same as BERT, except for tokenization. Refer to BERT documentation for information on configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.

</Tip>

PhobertTokenizer

[[autodoc]] PhobertTokenizer