Speech synthesis

[Next] [Up] [Previous]
Next: Audio as a Up: Summary of work Previous: Summary of work

Speech synthesis

There are three approaches to producing digitized speech:

Concatenative: Concatenate digitized utterances produced by a human to make up canned messages.
Diphone: Use a library of diphones obtained by sampling a large number of utterances spoken by a human.
Formant: Model the human vocal tract by using a series of cascading filters to produce the right wave forms and hence intelligible speech.

Approach [*] is space intensive. It works in a limited number of cases, but it has the advantage of producing the most natural sounding speech in a restricted domain.

Approach [*] is more widely applicable and provides an unlimited vocabulary. It is memory intensive, since the diphones ( numbering about [tex2html_wrap5910] for English) need to be accessed frequently. The approach is not compute intensive. Quality varies widely from barely intelligible to human-intelligible. This approach has been commercially applied by Apple in the form of MacinTalk-[tex2html_wrap5912] and MacinTalk-[tex2html_wrap5914]. The MacinTalk-[tex2html_wrap5916], also known as GalaTea, is fairly memory intensive, but the quality is among the best that has been achieved with this method of synthesis. The principal drawback with diphone synthesis is that the underlying model is fairly restrictive. Though systems like GalaTea achieve a fair amount of intonation, the intonational structure generated still leaves much to be desired. The model also allows only minimal variations in voice, e.g., pitch and speech rate. Changing voice parameters produces a significant deterioration in output.

Approach [*], which models the human vocal tract, is compute but not memory intensive. It is also the most flexible approach to speech synthesis. Since it is based on a mathematical model of the human vocal-tract, it permits a large number of variations in voice quality (see [Her91][Her90][Her89][Kla87] for details). What is more, it allows us to perform the same kind of scaling etc. on the voice that we perform in the visual setting when working with fonts. This is particularly useful in conveying complex information and is exploited in our own work in presenting spoken mathematics.

[Next] [Up] [Previous]
Next: Audio as a Up: Summary of work Previous: Summary of work

TV Raman
Thu Mar 9 20:10:41 EST 1995