A method and an apparatus for improved duration modeling of phonemes in a
speech synthesis system are provided. According to one aspect, text is received into a processor of a
speech synthesis system. The received text is processed using a sum-of-products phoneme duration model that is used in either the
formant method or the concatenative method of
speech generation. The phoneme duration model, which is used along with a phoneme
pitch model, is produced by developing a non-exponential functional transformation form for use with a
generalized additive model. The non-exponential functional transformation form comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration. The minimum and maximum phoneme durations are observed in training data. The received text is processed by specifying at least one of a number of contextual factors for the
generalized additive model. An inverse of the non-exponential functional transformation is applied to duration observations, or training data. Coefficients are generated for use with the
generalized additive model. The generalized
additive model comprising the coefficients is applied to at least one phoneme of the received text resulting in the generation of at least one phoneme having a duration. An acoustic sequence is generated comprising speech signals that are representative of the received text.