Description of a speech synthesis engine
The aim of this document is to provide an overview of what a speech synthesis
engine is and what are its main components. Hopefully this can lead to discussion on where should mimic focus and improve. I bet we will need to improve speech reproduction (easy to mute, etc) and work on the other areas to provide multilingual support to mimic.
This is a rough division of a speech synthesis engine can be. It may not
be complete but still gives an overall view of the system. Not all the
modules mentioned below have to actually be in flite/mimic code, but it would
be good to identify most of them to get familiar with the code.
This block of the engine deals with the conversion from raw text to a series
of phonemes and additional contextual information of use to actually speak.
When we (humans) read, we all read the same text and extract similar information
about which phonemes we should utter, although each one of us has a different
voice (pitch, etc…).
Phrasing: Sentence detection. Splits a whole text into sentences that can be
processed individually. Can provide some pausing information.
Tokenizer: Detects “tokens” in the text, as an example spaces can be used
to separate a sentence in individual tokens. Between two sentences the audio
engine could be interrputed for 0.5 seconds without serious concerns, however
an interruption between two consecutive tokens is annoying.
Token to words: Normalizes the text, for instance turns “1st” to “first”,
or “Henry” “VIII” to “Henry” “the” “eighth”.
Part of speech tagger (POS tagger): Helps to distinguish homograph words,
words that are spelled equally but may have different meaning and pronunciation. Not crucial, but helpful in some languages.
Words to phonemes: Provides a phonetic transcription of the words. This
transcription may be based on a dictionary (Lexicon) and/or on a decision tree
able to predict the pronunciation of unknown words based on how they are written.
Phoneme duration prediction: Not all the phonemes in all the contexts have
the same length. Some speech synthesis models capture that information directly
from recordings, where as other speech synthesis models may expect more information
from the text parser.
Intonation: A sentence has a prosody. Not only in questions "How do you do?"
but also in “I like beer.” or “I like beer if it’s after work”. The prosody is
the “melody” in the sentence. Some speech synthesis models capture that
information, other models may require that information from the text.
The speech synthesis block deals with the generation of the wave sound. The
traditional workflow for speech synthesis training consists of getting some
speech recordings of known, phonetically balanced sentences and use those
recordings to compose a new wave.
There are many methods that differ on the fundamental approach to build waves
(concatenating segments of audio records vs creating a statistical model for
each contextualized phoneme vs others). The election of the speech synthesis
has implications on:
- The latency (computation required to generate speech)
- The memory requirements
- The disk requirements
- The voice intellegibility (if it is easy to understand),
- The voice naturalness (if it sounds as a real human)
This is a list of speech synthesis engines I have heard about or I have worked
with. I could try to find some examples for some of the engines or try to make a
chart of pros and cons of each of them if you are interested.
Based on concatenation
- diphone based speech synthesis: Phonemes are not always pronounced equally. In a diphone speech synthesis engine, we try to capture all possible pairs of phonemes from recordings and we concatenate those phonemes to create new wave sounds. If the diphone we want is not in our recordings we use a similar one based on phoneme features. *Sounds a bit like a robot* (read with robot voice) - clunits: Based on cluster unit selection algorithm. (Black 1997) Based on statistical models
- HTS: Hidden Markov Model Speech Synthesis. Instead of concatenating recorded
segments we generate a model of how the recording sounds for each
contextualized phoneme. A known FOSS engine is the hts_engine and there is
a version of flite+hts_engine available.
- clustergen: Statistical parametric synthesis engine.
This block deals with the steps required to go from a waveform to the actual
sound that comes through the speakers. It can be fairly simple:
- Open audio device
- Write to audio device
But it can get complicated:
- Asynchronous output
- Pause / Resume.
- Pause / Say another thing.
- Repeat the last sentence (“Mycroft, can you repeat please?”)
- Volume control
- Integration with other audio systems (recording/speaking at the same time…)