Multi-mode voice separation method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A speech separation and multi-modal technology, applied in speech analysis, neural learning methods, character and pattern recognition, etc., can solve the problems of insufficient data effect, inability to flexibly change the number of inputs, etc., and achieve the effect of improving performance

Active Publication Date: 2021-06-25

SHANDONG UNIV

View PDF4 Cites 7 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0014] To sum up, only one feature extractor is used to extract the sound of the full frequency band in the above literature, that is, the feature extractor used on each frequency is the same, and the effect of obtaining data is not good enough. In addition, the existing speaker The number of speakers has been fixed when designing the network parameters, that is, the model is static, and a fixed number of speakers must be used in the process of training and testing, and the number of inputs cannot be flexibly changed. Therefore, the existing speech separation technology needs to be further improved. improvement of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0060] See attached figure 1 As shown, the present embodiment discloses a multimodal speech separation method, including:

[0061] Receive the mixed voice of the object to be identified and the facial visual information of the object to be identified, and obtain the number of speakers through face detection;

[0062] Process the above information to obtain complex language spectrograms and face images and transmit them to the multi-modal speech separation model, and use the number of speakers to dynamically adjust the structure of the model, wherein, during the training process, the multi-modal speech separation model, Use the complex domain ideal ratio masking cIRM as the training target, cIRM is defined in the complex domain as the ratio between the clean sound spectrogram and the mixed sound spectrogram, consisting of real and imaginary parts and including the amplitude and phase of the sound information;

[0063] The multimodal speech separation model outputs a cIRM corr...

Embodiment 2

[0078] The purpose of this embodiment is to provide a computing device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor implements the steps of the above method when executing the program.

Embodiment 3

[0080] The purpose of this embodiment is to provide a computer-readable storage medium.

[0081] A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the above-mentioned method are executed.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a multi-mode voice separation method and system. The method comprises the following steps: receiving a mixed voice of a to-be-recognized object and facial visual information of the to-be-recognized object; performing face detection by using a Dlib library to obtain the number of speakers; and processing information to obtain a multilingual spectrogram and face images of the speakers, transmitting the multilingual spectrogram and the face images of the speakers to a multi-modal voice separation model, and dynamically adjusting the structure of the model according to the number of the speakers, wherein in the training process of the multi-modal voice separation model, complex field ideal ratio masking is used as a training target; defining a ratio between a clean sound spectrogram and a mixed sound spectrogram in a complex field, consisting of a real part and an imaginary part, and containing amplitude and phase information of a sound; enabling the multi-modal voice separation model to output time-frequency masks corresponding to the number of faces; and carrying out complex number multiplication on the output masking and the spectrogram of the mixed sound to obtain a spectrogram of the clean sound, and carrying out short-time inverse Fourier transform calculation on the spectrogram of the clean sound to obtain a time domain signal of the clean sound, thereby completing voice separation. The model is more suitable for most application scenes.

Description

technical field [0001] The disclosure belongs to the technical field of speech separation, and in particular relates to a multi-modal speech separation method and system. Background technique [0002] The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art. [0003] In life, we often need to be exposed to a variety of mixed sounds, among which the mixed sounds of people need to be dealt with most often. In an environment where multiple sounds are present, people have the ability to focus on one individual voice while ignoring other voices and ambient noise, a phenomenon known as the cocktail party effect. Since the human auditory system has a powerful sound signal processing capability, it is easy to separate mixed sounds. With the intelligentization of life, voice separation technology has played an important role in various voice interaction devices. However, for computers, how to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L21/0272G10L21/0208G10L25/30G06K9/00G06K9/48G06K9/62G06N3/04G06N3/08

CPCG10L21/0272G10L21/0208G10L25/30G06N3/08G10L2021/02087G06V40/161G06V40/20G06V10/478G06N3/045G06F18/253

Inventor 魏莹刘洋

Owner SHANDONG UNIV

Who we serve

R&D Engineer
R&D Manager
IP Professional

Why Patsnap Eureka

Industry Leading Data Capabilities
Powerful AI technology
Patent DNA Extraction

Social media

Patsnap Eureka Blog

Learn More

PatSnap group products

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multi-mode voice separation method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology