Tutorials/CNTK_208_Speech_Connectionist_Temporal_Classification.ipynb
This tutorial assumes familiarity with 10* CNTK tutorials and basic knowledge of data representation in acoustic modelling tasks. It introduces some CNTK building blocks that can be used in training deep networks for speech recognition on the example of CTC training criteria.
CNTK implementation of CTC is based on the paper by A. Graves et al. "Connectionist temporal classification: labeling unsegmented sequence data with recurrent neural networks". CTC is a popular training criteria for sequence learning tasks, such as speech or handwriting. It doesn't require segmentation of training data nor post-processing of network outpus to convert them to labels. Thereby, it significantly simplifies training and decoding processes while achieving state of the art accuracy.
CTC training runs on several sequences in parallel either on GPU or CPU, achieving maximal utilization of the hardware.
First let us import some of the necessary libraries including CNTK and setup the testing environment.
import os
import cntk as C
import numpy as np
# Select the right target device
import cntk.tests.test_utils
cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)
data_dir = os.path.join("..", "Tests", "EndToEndTests", "Speech", "Data")
print("Current directory {0}".format(os.getcwd()))
if os.path.exists(data_dir):
if os.path.realpath(data_dir) != os.path.realpath(os.getcwd()):
os.chdir(data_dir)
print("Changed to data directory {0}".format(data_dir))
else:
print("Data directory not available locally. Downloading data.")
try:
from urllib.request import urlretrieve
except ImportError:
from urllib import urlretrieve
for dir in ['GlobalStats', 'Features']:
if not os.path.exists(dir):
os.mkdir(dir)
for file in ['glob_0000.scp', 'glob_0000.write.scp', 'glob_0000.mlf', 'state_ctc.list', 'GlobalStats/mean.363', 'GlobalStats/var.363', 'Features/000000000.chunk']:
if os.path.exists(file):
print('Already downloaded %s' % file)
else:
print('Downloading %s' % file)
urlretrieve('https://github.com/Microsoft/CNTK/raw/release/2.7/Tests/EndToEndTests/Speech/Data/%s' % file, file)
CNTK consumes Acoustic Model (AM) training data in HTK/MLF format and typically expects 3 input files
CNTK provides flexible and efficient readers HTKFeatureDeserializer/HTKMLFDeserializer for acoustic features and labels. These readers follow convention over configuration principle and greatly simply training procedure. At the same time, they take care of various optimizations of reading from disk/network, CPU and GPU asynchronous prefetching which resuls in significant speed up of model training.
Note: Currently, CTC training expects label and feature inputs of the same dimension, yet the labels don't have to be aligned. An easy way to generate the label file is to have uniform (equal) distribution of the labels across the feature frames. Obviously, some labels will be mis-aligned with this setup, but CTC criteria will take care of it during training, see the original publication for reference.
# Type of features/labels and dimensions are application specific
# Here we use rather small dimensional feature and the label set for the sake of keeping the train set compact.
feature_dimension = 33
feature = C.sequence.input((feature_dimension))
label_dimension = 133
label = C.sequence.input((label_dimension))
train_feature_filepath = "glob_0000.scp"
train_label_filepath = "glob_0000.mlf"
mapping_filepath = "state_ctc.list"
try:
train_feature_stream = C.io.HTKFeatureDeserializer(
C.io.StreamDefs(speech_feature = C.io.StreamDef(shape = feature_dimension, scp = train_feature_filepath)))
train_label_stream = C.io.HTKMLFDeserializer(
mapping_filepath, C.io.StreamDefs(speech_label = C.io.StreamDef(shape = label_dimension, mlf = train_label_filepath)), True)
train_data_reader = C.io.MinibatchSource([train_feature_stream, train_label_stream], frame_mode = False)
train_input_map = {feature: train_data_reader.streams.speech_feature, label: train_data_reader.streams.speech_label}
except RuntimeError:
print ("ERROR: not able to read features or labels")
In this block we first normalize the features and define a model with LSTM Layers. We normalize the input features to zero mean and unit variance by subtracting the mean vector and multiplying by inverse standard deviation, which are stored in separate files.
feature_mean = np.fromfile(os.path.join("GlobalStats", "mean.363"), dtype=float, count=feature_dimension)
feature_inverse_stddev = np.fromfile(os.path.join("GlobalStats", "var.363"), dtype=float, count=feature_dimension)
feature_normalized = (feature - feature_mean) * feature_inverse_stddev
with C.default_options(activation=C.sigmoid):
z = C.layers.Sequential([
C.layers.For(range(3), lambda: C.layers.Recurrence(C.layers.LSTM(1024))),
C.layers.Dense(label_dimension)
])(feature_normalized)
CTC criteria (loss) function is implemented by combination of the labels_to_graph and forward_backward functions. These functions are designed to generalize forward-backward viterbi-like functions which are very common in sequential modelling problems, e.g. speech or handwriting. labels_to_graph is designed to convert the input label sequence into graph representation suitable for particular forward-backward procedure, and forward_backward function performs the procedure itself. Currently, these functions only support CTC, and it's their default configuration.
mbsize = 1024
mbs_per_epoch = 10
max_epochs = 5
criteria = C.forward_backward(C.labels_to_graph(label), z, blankTokenId=132, delayConstraint=3)
err = C.edit_distance_error(z, label, squashInputs=True, tokensToIgnore=[132])
# Learning rate parameter schedule per sample:
# Use 0.01 for the first 3 epochs, followed by 0.001 for the remaining
lr = C.learning_parameter_schedule_per_sample([(3, .01), (1,.001)])
mm = C.momentum_schedule([(1000, 0.9), (0, 0.99)], mbsize)
learner = C.momentum_sgd(z.parameters, lr, mm)
trainer = C.Trainer(z, (criteria, err), learner)
C.logging.log_number_of_parameters(z)
progress_printer = C.logging.progress_print.ProgressPrinter(tag='Training', num_epochs = max_epochs)
for epoch in range(max_epochs):
for mb in range(mbs_per_epoch):
minibatch = train_data_reader.next_minibatch(mbsize, input_map = train_input_map)
trainer.train_minibatch(minibatch)
progress_printer.update_with_trainer(trainer, with_metric = True)
print('Trained on a total of ' + str(trainer.total_number_of_samples_seen) + ' frames')
progress_printer.epoch_summary(with_metric = True)
# Uncomment to save the model
# z.save('CTC_' + str(max_epochs) + 'epochs_' + str(mbsize) + 'mbsize_' + str(mbs_per_epoch) + 'mbs.model')
test_feature_filepath = "glob_0000.write.scp"
test_feature_stream = C.io.HTKFeatureDeserializer(
C.io.StreamDefs(speech_feature = C.io.StreamDef(shape = feature_dimension, scp = test_feature_filepath)))
test_data_reader = C.io.MinibatchSource([test_feature_stream, train_label_stream], frame_mode = False)
test_input_map = {feature: test_data_reader.streams.speech_feature, label: test_data_reader.streams.speech_label}
num_test_minibatches = 2
test_result = 0.0
for i in range(num_test_minibatches):
test_minibatch = test_data_reader.next_minibatch(mbsize, input_map = test_input_map)
eval_error = trainer.test_minibatch(test_minibatch)
test_result = test_result + eval_error
# Average of evaluation errors of all test minibatches
round(test_result / num_test_minibatches,2)