Back to Wav2letter

A Recipe for the Wall Street Journal (WSJ) corpus.

data/wsj/README.md

0.23.6 KB
Original Source

A Recipe for the Wall Street Journal (WSJ) corpus.

The WSJ corpus consists of about 80 hours of read sentences taken from the Wall Street Journal. The WSJ corpus can be purchased from the LDC:

In these experiments, we use three subsets following the Kaldi WSJ recipe:

  • train: 37416 utterances, referred to as si284 in Kaldi
  • dev: 503 utterances, referred to as nov93dev in Kaldi
  • test: 333 utterances, referred to as nov92 in Kaldi

Wav2Letter models

NOV93DEV WER % / LER %NOV92 WER % / LER %MODELPAPER
9.8 / 7.25.6 / -conv_gluWav2Letter: an End-to-End ConvNet-based Speech Recognition System, Letter-Based Speech Recognition with Gated ConvNets

Prerequisites

Here and later we assume that complete version of data is downloaded, but the same steps can be used for B-version of WSJ.

  • download the data from the LDC WSJ0 and WSJ1. You will have csr_1_LDC93S6A.tar and csr_2_comp_LDC94S13A.tar.

  • unpack files

tar -xf csr_1_LDC93S6A.tar
tar -xf csr_2_comp_LDC94S13A.tar
wget https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/ctools/sph2pipe_v2.5.tar.gz
tar -xzf sph2pipe_v2.5.tar.gz && cd sph2pipe_v2.5
gcc -o sph2pipe *.c -lm

Preparation of audio and text data

To prepare the audio and text data for training/evaluation run (set necessary paths instead of [...]) (if you are using B-version of WSJ then call with --wsj1_type LDC94S13B)

python3 prepare.py --wsj0 [...]/csr_1 --wsj1 [...]/csr_2_comp --sph2pipe [...]/sph2pipe_v2.5/sph2pipe --dst [...] --wsj1_type LDC94S13A

The following structure will be generated

tree -L 2
.
├── audio
│   ├── nov92
│   ├── nov92_5k
│   ├── nov93
│   ├── nov93_5k
│   ├── nov93dev
│   ├── nov93dev_5k
│   ├── si284
│   └── si84
├── csr_1
│   ├── 11-10.1
│   ├── ...
│   └── readme.txt
├── csr_1_LDC93S6A.tar
├── csr_2_comp
│   ├── 13-10.1
│   ├── ...
├── csr_2_comp_LDC94S13A.tar
├── lists
│   ├── nov92_5k.lst
│   ├── nov92.lst
│   ├── nov93_5k.lst
│   ├── nov93dev_5k.lst
│   ├── nov93dev.lst
│   ├── nov93.lst
│   ├── si284.lst
│   └── si84.lst
└── text
    ├── lm.txt
    ├── nov92_5k.txt
    ├── nov92.txt
    ├── nov93_5k.txt
    ├── nov93dev_5k.txt
    ├── nov93dev.txt
    ├── nov93.txt
    ├── si284.txt
    └── si84.txt