There are three approaches to producing digitized speech:
Approach [*] is space intensive. It works in a limited number of cases, but it has the advantage of producing the most natural sounding speech in a restricted domain.
Approach [*] is more widely applicable and provides an unlimited vocabulary. It is memory intensive, since the diphones ( numbering about [tex2html_wrap5910] for English) need to be accessed frequently. The approach is not compute intensive. Quality varies widely from barely intelligible to human-intelligible. This approach has been commercially applied by Apple in the form of MacinTalk-[tex2html_wrap5912] and MacinTalk-[tex2html_wrap5914]. The MacinTalk-[tex2html_wrap5916], also known as GalaTea, is fairly memory intensive, but the quality is among the best that has been achieved with this method of synthesis. The principal drawback with diphone synthesis is that the underlying model is fairly restrictive. Though systems like GalaTea achieve a fair amount of intonation, the intonational structure generated still leaves much to be desired. The model also allows only minimal variations in voice, e.g., pitch and speech rate. Changing voice parameters produces a significant deterioration in output.
Approach [*], which models the human vocal tract, is compute
but not memory intensive. It is also the most flexible
approach to speech synthesis. Since it is based on a
mathematical model of the human vocal-tract, it permits a large number
of variations in voice quality
(see [Her91][Her90][Her89][Kla87] for details). What is more,
it allows us to perform the same kind of scaling etc. on the voice
that we perform in the visual setting when working with fonts. This is
particularly useful in conveying complex
information and is exploited in our own work in presenting spoken