Back to Nlp Progress

Lexical Normalization

english/lexical_normalization.md

0.33.4 KB
Original Source

Lexical Normalization

Lexical normalization is the task of translating/transforming a non standard text to a standard register.

Example:

new pix comming tomoroe
new pictures coming tomorrow

Datasets usually consists of tweets, since these naturally contain a fair amount of these phenomena.

For lexical normalization, only replacements on the word-level are annotated. Some corpora include annotation for 1-N and N-1 replacements. However, word insertion/deletion and reordering is not part of the task.

LexNorm

The LexNorm corpus was originally introduced by Han and Baldwin (2011). Several mistakes in annotation were resolved by Yang and Eisenstein; on this page, we only report results on the new dataset. For this dataset, the 2,577 tweets from Li and Liu(2014) is often used as training data, because of its similar annotation style.

This dataset is commonly evaluated with accuracy on the non-standard words. This means that the system knows in advance which words are in need of normalization.

ModelAccuracyPaper / SourceCode
MoNoise (van der Goot & van Noord, 2017)87.63MoNoise: Modeling Noise Using a Modular Normalization SystemOfficial
Joint POS + Norm in a Viterbi decoding (Li & Liu, 2015)87.58*Joint POS Tagging and Text Normalization for Informal Text
Syllable based (Xu et al., 2015)86.08Tweet Normalization with Syllables
unLOL (Yang & Eisenstein, 2013)82.06A Log-Linear Model for Unsupervised Text Normalization

* used a slightly different version of the data

LexNorm2015

The LexNorm2015 dataset was introduced for the shared task on lexical normalization, hosted at WNUT2015 (Baldwin et al(2015)). In this dataset, 1-N and N-1 replacements are included in the annotation. The evaluation metrics used are precision, recall and F1 score. However, this is calculated a bit odd:

Precision: out of all necessary replacements, how many correctly found

Recall: out of all normalization by system, how many correct

This means that if the system replaces a word which is in need of normalization, but chooses the wrong normalization, it is penalized twice.

ModelF1PrecisionRecallPaper / SourceCode
MoNoise (van der Goot & van Noord, 2017)86.3993.5380.26MoNoise: Modeling Noise Using a Modular Normalization SystemOfficial
Random Forest + novel similarity metric (Jin, 2017)84.2190.6178.65NCSU-SAS-Ning: Candidate Generation and Feature Engineering for Supervised Lexical Normalization

Go back to the README