Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Speech synthesis method and system for new tone generation

A technology of speech synthesis and timbre, applied in the speech synthesis method and system field of timbre generation, which can solve the problems of high complexity of the speech synthesis model and the method's over-reliance on the sound bank, etc., to save computing costs, flexible and diverse methods, and improve application foreground effect

Active Publication Date: 2021-05-14
HANGZHOU YIWISE INTELLIGENT TECH CO LTD
View PDF11 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to solve the problem that the multi-speaker speech synthesis model in the prior art has high complexity, and the method of operating and generating the data timbre of speech synthesis relies too much on the sound library. The present invention provides a new timbre The generated speech synthesis method and system can generate more new timbres by adjusting a small number of speaker vectors in the speech synthesis model, which is very convenient to control the timbre of the synthesized speech

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Speech synthesis method and system for new tone generation
  • Speech synthesis method and system for new tone generation

Examples

Experimental program
Comparison scheme
Effect test

preparation example Construction

[0032] Such as figure 1 As shown, the speech synthesis method that a kind of new timbre of the present invention generates comprises the steps:

[0033] Step 1. Obtain the sample text and the corresponding real voice audio and speaker label, convert the real voice audio into a real Mel spectrum, process the sample text to obtain a phoneme sequence, and extract the pronunciation duration of the phoneme corresponding to the text;

[0034] Step 2, build the speech synthesis model that new timbre produces, comprise speaker Embedding embedding layer, neural network coder, duration prediction module and decoder, described neural network coder is made of phoneme Embedding embedding layer, CBHG module;

[0035] Step 3, using the phoneme sequence and the speaker label to train the speech synthesis model generated by the new timbre;

[0036] Step 4. For the text to be synthesized, after preprocessing and the specified speaker label, it is used as the input of the speech synthesis model...

Embodiment

[0074] The present invention is tested on 46,500 pieces of audio and corresponding text datasets containing 8 speakers. The present invention carries out following pretreatment to data set:

[0075] 1) Extract the phoneme file and the corresponding audio, and use the open source tool Montreal-forced-aligner to extract the pronunciation duration of the phoneme.

[0076] 2) Extract the mel spectrum for each audio, where the window size is 50 milliseconds, the size of the frame shift is 12.5 milliseconds, and the dimension is 80 dimensions.

[0077] 3) Summing the mel-spectrum extracted from the audio in dimensions to obtain the energy of the mel-spectrum.

[0078] In the process of training the model, the text information is encoded as the input of the neural network encoder, the audio speaker label corresponding to the text is used as the input of the speaker Embedding layer, and the speaker vector and the time-length adjusted text The encoded information is concatenated and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a speech synthesis method and system for new tone generation, and belongs to the field of speech synthesis. The method comprises the following steps of: firstly, extracting phoneme pronunciation duration and a Mel spectrum from a text and an audio as a training set, and learning a text coding representation aligned with the Mel spectrum in length; enabling a speaker tag to pass through an Embedding embedding layer to generate a speaker coding representation; and combining the speaker coding representation and the text coding representation, outputting a synthesized Mel frequency spectrum through a decoder, and synthesizing voice through a vocoder. According to the method, the speaker coding information is linearly combined to obtain diversified new speaker codes, so that the voice with new tone is synthesized. The period and the cost of recording a voice training database are saved, and the newly generated tone can be regulated and controlled; and the complexity of a model is reduced, so that the speech synthesis model added with the tone generation function can be arranged on hardware with low computing resources, and wide application of the model in more scenes is facilitated.

Description

technical field [0001] The invention belongs to the field of speech synthesis, and relates to a speech synthesis method and system for timbre generation. Background technique [0002] In recent years, with the development of deep learning, speech synthesis technology has also been greatly improved. Speech synthesis has moved from the traditional parametric method and concatenation method to an end-to-end method. Usually, the encoder-attention-decoder (Encoder-Attention-Decoder) mechanism is used for autoregressive generation: to generate the current data point, all previous data points in the time series must be generated as model input, such as Taoctron, Taoctron 2, Deep voice 3, Clarinet, Transformer TTS. Although the autoregressive model can generate satisfactory results, if the attention generated by Attention is not good enough, it may lead to repetition or missing words in the synthesized speech. [0003] With the development of speech synthesis technology, people h...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G10L13/08G10L25/30G06N3/08
CPCG10L13/08G10L25/30G06N3/08
Inventor 盛乐园
Owner HANGZHOU YIWISE INTELLIGENT TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products