Speech synthesis method and system for new tone generation

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of speech synthesis and timbre, applied in the speech synthesis method and system field of timbre generation, which can solve the problems of high complexity of the speech synthesis model and the method's over-reliance on the sound bank, etc., to save computing costs, flexible and diverse methods, and improve application foreground effect

Active Publication Date: 2021-05-14

HANGZHOU YIWISE INTELLIGENT TECH CO LTD

View PDF11 Cites 7 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] The purpose of the present invention is to solve the problem that the multi-speaker speech synthesis model in the prior art has high complexity, and the method of operating and generating the data timbre of speech synthesis relies too much on the sound library. The present invention provides a new timbre The generated speech synthesis method and system can generate more new timbres by adjusting a small number of speaker vectors in the speech synthesis model, which is very convenient to control the timbre of the synthesized speech

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

preparation example Construction

[0032] Such as figure 1 As shown, the speech synthesis method that a kind of new timbre of the present invention generates comprises the steps:

[0033] Step 1. Obtain the sample text and the corresponding real voice audio and speaker label, convert the real voice audio into a real Mel spectrum, process the sample text to obtain a phoneme sequence, and extract the pronunciation duration of the phoneme corresponding to the text;

[0034] Step 2, build the speech synthesis model that new timbre produces, comprise speaker Embedding embedding layer, neural network coder, duration prediction module and decoder, described neural network coder is made of phoneme Embedding embedding layer, CBHG module;

[0035] Step 3, using the phoneme sequence and the speaker label to train the speech synthesis model generated by the new timbre;

[0036] Step 4. For the text to be synthesized, after preprocessing and the specified speaker label, it is used as the input of the speech synthesis model...

Embodiment

[0074] The present invention is tested on 46,500 pieces of audio and corresponding text datasets containing 8 speakers. The present invention carries out following pretreatment to data set:

[0075] 1) Extract the phoneme file and the corresponding audio, and use the open source tool Montreal-forced-aligner to extract the pronunciation duration of the phoneme.

[0076] 2) Extract the mel spectrum for each audio, where the window size is 50 milliseconds, the size of the frame shift is 12.5 milliseconds, and the dimension is 80 dimensions.

[0077] 3) Summing the mel-spectrum extracted from the audio in dimensions to obtain the energy of the mel-spectrum.

[0078] In the process of training the model, the text information is encoded as the input of the neural network encoder, the audio speaker label corresponding to the text is used as the input of the speaker Embedding layer, and the speaker vector and the time-length adjusted text The encoded information is concatenated and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a speech synthesis method and system for new tone generation, and belongs to the field of speech synthesis. The method comprises the following steps of: firstly, extracting phoneme pronunciation duration and a Mel spectrum from a text and an audio as a training set, and learning a text coding representation aligned with the Mel spectrum in length; enabling a speaker tag to pass through an Embedding embedding layer to generate a speaker coding representation; and combining the speaker coding representation and the text coding representation, outputting a synthesized Mel frequency spectrum through a decoder, and synthesizing voice through a vocoder. According to the method, the speaker coding information is linearly combined to obtain diversified new speaker codes, so that the voice with new tone is synthesized. The period and the cost of recording a voice training database are saved, and the newly generated tone can be regulated and controlled; and the complexity of a model is reduced, so that the speech synthesis model added with the tone generation function can be arranged on hardware with low computing resources, and wide application of the model in more scenes is facilitated.

Description

technical field [0001] The invention belongs to the field of speech synthesis, and relates to a speech synthesis method and system for timbre generation. Background technique [0002] In recent years, with the development of deep learning, speech synthesis technology has also been greatly improved. Speech synthesis has moved from the traditional parametric method and concatenation method to an end-to-end method. Usually, the encoder-attention-decoder (Encoder-Attention-Decoder) mechanism is used for autoregressive generation: to generate the current data point, all previous data points in the time series must be generated as model input, such as Taoctron, Taoctron 2, Deep voice 3, Clarinet, Transformer TTS. Although the autoregressive model can generate satisfactory results, if the attention generated by Attention is not good enough, it may lead to repetition or missing words in the synthesized speech. [0003] With the development of speech synthesis technology, people h...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L13/08G10L25/30G06N3/08

CPCG10L13/08G10L25/30G06N3/08

Inventor 盛乐园

Owner HANGZHOU YIWISE INTELLIGENT TECH CO LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Speech synthesis method and system for new tone generation

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

preparation example Construction

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology