Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains

Inactive Publication Date: 2006-10-10
PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA
View PDF24 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0006]Our synthesis technique involves identifying and extracting the formants from an actual speech signal (labeled to identify approximate demi-syllable areas) and then using this information to construct demi-syllable segments each represented by a set of filter parameters and a source signal waveform. The invention provides a novel cross fade technique to smoothly concatenate consecutive demi-syllable segments. Unlike conventional blending techniques, our system allows us to perform cross fade in the filter parameter domain while simultaneously but independently performing “cross fade” (parameter interpolation) of the source waveforms in the time domain. The filter parameters model vocal tract effects, while the source waveforms model the global source. The technique has the advantage of restricting prosodic modification to only the glottal source, if desired. This can reduce distortion usually associated with the conventional blending techniques.

Problems solved by technology

However, for larger vocabularies it is not feasible to store complete word samples of actual human speech.
Unfortunately, when concatenating sub-word units, speech synthesis must confront several very difficult problems.
However, such versatile sub-word units often do not concatenate well.
During playback of concatenated sub-word units, there is often a very noticeable distortion or glitch where the sub-word units are joined.
Also, since the sub-word units must be modified in pitch and duration, to realize the intended prosodic pattern, most often a distortion is incurred from current techniques for making these modifications.
Finally, since most speech segments are influenced strongly by neighboring segments, there is not a simple set of concatenation units (such as phonemes or diphones) which can adequately represent human speech.
A number of speech synthesists have suggested various solutions to the above concatenation problems, but so far no one has successfully solve the problem.
Human speech generates complex time-varying waveforms that defy simple signal processing solutions.
The syllable is a natural unit for this purpose, but choosing the syllable requires a large amount of memory.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains
  • Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains
  • Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017]While there have been many speech synthesis models proposed in the past, most have in common the following two component signal processing structure. Shown in FIG. 1, speech can be modeled as an initial source component 10, processed through a subsequent filter component 12.

[0018]Depending on the model, either source or filter, or both can be very simple or very complex. For example, one earlier form of speech synthesis concatenated highly complex PCM (Phase Code Modulated) waveforms as the source, and a very simple (unity gain) filter. In the PCM synthesizer all a prior knowledge was imbedded in the source and none in the filter. By comparison, another synthesis method used a simple repeating pulse train as the source and a comparatively complex filter based on LPC (Linear Predictive Coding). Note that neither of these conventional synthesis techniques attempted to model the physical structures within the human vocal tract that are responsible for producing human speech.

[0019...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The concatenative speech synthesizer employs demi-syllable subword units to generate speech. The synthesizer is based on a source-filter model that uses source signals that correspond closely to the human glottal source and that uses filter parameters that correspond closely to the human vocal tract. Concatenation of the demi-syllable units is facilitated by two separate cross face techniques, one applied in the time domain in the demi-syllable source signal waveforms, and one applied in the frequency domain by interpolating the corresponding filter parameters of the concatenated demi-syllables. The dual cross fade technique results in natural sounding synthesis that avoids time-domain glitches without degrading or smearing characteristic resonances in the filter domain.

Description

BACKGROUND AND SUMMARY OF THE INVENTION[0001]The present invention relates generally to speech synthesis and more particularly to a concatenative synthesizer based on a source-filter model in which the source signal and filter parameters are generated by independent cross fade mechanisms.[0002]Modern day speech synthesis involves many tradeoffs. For limited vocabulary applications, it is usually feasible to store entire words as digital samples to be concatenated into sentences for playback. Given a good prosody algorithm to place the stress on the appropriate words, these systems tend to sound quite natural, because the individual words can be accurate reproductions of actual human speech. However, for larger vocabularies it is not feasible to store complete word samples of actual human speech. Therefore, a number of speech synthesis have been experimenting with breaking speech into smaller units and concatenating those units into words, phrases and ultimately sentences.[0003]Unfor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F15/00G10L13/00G10L13/04G10L13/06
CPCG10L13/07
Inventor PEARSON, STEVEKIBRE, NICHOLASNIEDZIELSKI, NANCY
Owner PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products