Aligning Text and Phonemes for Speech Technology Applications Using an EM-Like Algorithm |
| |
Authors: | R I Damper Y Marchand J-D S Marsters A I Bazin |
| |
Institution: | (1) Image, Speech and Intelligent Systems (ISIS) Research Group, School of Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK;(2) Institute for Biodiagnostics (Atlantic), National Research Council Canada, Neuroimaging Research Laboratory, 1796 Summer Street, Suite 3900, Halifax, Nova Scotia, Canada, B3H 3A7;(3) Image, Speech and Intelligent Systems (ISIS) Research Group, School of Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK |
| |
Abstract: | A common requirement in speech technology is to align two different symbolic representations of the same linguistic ‘message’.
For instance, we often need to align letters of words listed in a dictionary with the corresponding phonemes specifying their
pronunciation. As dictionaries become ever bigger, manual alignment becomes less and less tenable yet automatic alignment
is a hard problem for a language like English. In this paper, we describe the use of a form of the expectation-maximization
(EM) algorithm to learn alignments of English text and phonemes, starting from a variety of initializations. We use the British
English Example Pronunciation (BEEP) dictionary of almost 200,000 words in this work. The quality of alignment is difficult
to determine quantitatively since no ‘gold standard’ correct alignment exists. We evaluate the success of our algorithm indirectly
from the performance of a pronunciation by analogy system using the aligned dictionary data as a knowledge base for inferring
pronunciations. We find excellent performance—the best so far reported in the literature. There is very little dependence
on the start point for alignment, indicating that the EM search space is strongly convex. Since the aligned BEEP dictionary
is a potentially valuable resource, it is made freely available for research use. |
| |
Keywords: | text-to-speech synthesis string alignment dynamic programming EM algorithm pronunciation by analogy |
本文献已被 SpringerLink 等数据库收录! |
|