A multi-modal speech separation method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A speech separation and multi-modal technology, applied in speech analysis, neural learning methods, character and pattern recognition, etc., can solve the problems of insufficient data effect, inability to flexibly change the number of inputs, etc., and achieve the effect of improving performance

Active Publication Date: 2022-02-11

SHANDONG UNIV

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0013] To sum up, only one feature extractor is used to extract the sound of the full frequency band in the above literature, that is, the feature extractor used on each frequency is the same, and the effect of obtaining data is not good enough. In addition, the existing speaker The number of speakers has been fixed when designing the network parameters, that is, the model is static, and a fixed number of speakers must be used in the process of training and testing, and the number of inputs cannot be flexibly changed. Therefore, the existing speech separation technology needs to be further improved. improvement of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0059] see attached figure 1 As shown, this embodiment discloses a multimodal speech separation method, including:

[0060] Receive the mixed voice of the object to be recognized and the facial visual information of the object to be recognized, and obtain the number of speakers through face detection;

[0061] The above information is processed to obtain a polyphonic spectrogram and a face image and transmitted to a multimodal speech separation model, and the structure of the model is dynamically adjusted by the number of speakers, wherein the multimodal speech separation model is in the training process, Use complex domain ideal ratio masking cIRM as training target, cIRM is defined in complex domain as the ratio between clean voice spectrogram and mixed voice spectrogram, consisting of real and imaginary parts and including the amplitude and phase of the voice information;

[0062] The multimodal speech separation model outputs cIRM corresponding to the number of objects t...

Embodiment 2

[0077] The purpose of this embodiment is to provide a computing device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor implements the steps of the above method when executing the program.

Embodiment 3

[0079] The purpose of this embodiment is to provide a computer-readable storage medium.

[0080] A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the above-mentioned method are executed.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

This disclosure proposes a multi-modal speech separation method and system, including: receiving the mixed voice of the object to be recognized and the facial visual information of the object to be recognized; using the Dlib library to perform face detection to obtain the number of speakers; and processing the above information Obtain the complex language spectrogram and the face image of the speaker and transmit it to the multi-modal speech separation model, and dynamically adjust the structure of the model according to the number of speakers, wherein the multi-modal speech separation model uses complex numbers in the training process Domain ideal ratio masking is used as the training target, which is defined as the ratio between the clean sound spectrogram and the mixed sound spectrogram in the complex domain, which consists of real and imaginary parts and contains the amplitude and phase information of the sound; The multimodal speech separation model outputs the time-frequency mask corresponding to the number of faces; the output mask is multiplied with the spectrogram of the mixed voice to obtain the spectrogram of the clean voice, and the spectrogram of the clean voice is made The short-time Fourier inverse transform is calculated to obtain the time-domain signal of the clean sound, thereby completing the speech separation. The disclosed model is more suitable for most application scenarios.

Description

technical field [0001] The present disclosure belongs to the technical field of speech separation, and in particular, relates to a multimodal speech separation method and system. Background technique [0002] In life, it is often necessary to come into contact with a variety of mixed sounds, among which the mixed sound of people is the most common. In a mixed environment, people have the ability to focus on one person's voice while ignoring others' voices and ambient noise, a phenomenon known as the cocktail party effect. Due to the powerful sound signal processing capabilities of the human auditory system, it is easy to separate mixed sounds. With the intelligentization of life, voice separation technology has played an important role in various voice interaction devices. However, for computers, how to efficiently realize voice separation has always been a difficult problem. [0003] At present, speech separation technology has a very wide range of applications. For examp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G10L21/0272G10L21/0208G10L25/30G06V40/16G06V40/20G06V10/46G06V10/80G06V10/82G06N3/04G06N3/08

CPCG10L21/0272G10L21/0208G10L25/30G06N3/08G10L2021/02087G06V40/161G06V40/20G06V10/478G06N3/045G06F18/253

Inventor 魏莹刘洋

Owner SHANDONG UNIV

Who we serve

R&D Engineer
R&D Manager
IP Professional

Why Patsnap Eureka

Industry Leading Data Capabilities
Powerful AI technology
Patent DNA Extraction

Social media

Patsnap Eureka Blog

Learn More

PatSnap group products

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

A multi-modal speech separation method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology