Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

A multi-modal speech separation method and system

A speech separation and multi-modal technology, applied in speech analysis, neural learning methods, character and pattern recognition, etc., can solve the problems of insufficient data effect, inability to flexibly change the number of inputs, etc., and achieve the effect of improving performance

Active Publication Date: 2022-02-11
SHANDONG UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] To sum up, only one feature extractor is used to extract the sound of the full frequency band in the above literature, that is, the feature extractor used on each frequency is the same, and the effect of obtaining data is not good enough. In addition, the existing speaker The number of speakers has been fixed when designing the network parameters, that is, the model is static, and a fixed number of speakers must be used in the process of training and testing, and the number of inputs cannot be flexibly changed. Therefore, the existing speech separation technology needs to be further improved. improvement of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A multi-modal speech separation method and system
  • A multi-modal speech separation method and system
  • A multi-modal speech separation method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0059] see attached figure 1 As shown, this embodiment discloses a multimodal speech separation method, including:

[0060] Receive the mixed voice of the object to be recognized and the facial visual information of the object to be recognized, and obtain the number of speakers through face detection;

[0061] The above information is processed to obtain a polyphonic spectrogram and a face image and transmitted to a multimodal speech separation model, and the structure of the model is dynamically adjusted by the number of speakers, wherein the multimodal speech separation model is in the training process, Use complex domain ideal ratio masking cIRM as training target, cIRM is defined in complex domain as the ratio between clean voice spectrogram and mixed voice spectrogram, consisting of real and imaginary parts and including the amplitude and phase of the voice information;

[0062] The multimodal speech separation model outputs cIRM corresponding to the number of objects t...

Embodiment 2

[0077] The purpose of this embodiment is to provide a computing device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor implements the steps of the above method when executing the program.

Embodiment 3

[0079] The purpose of this embodiment is to provide a computer-readable storage medium.

[0080] A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the above-mentioned method are executed.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

This disclosure proposes a multi-modal speech separation method and system, including: receiving the mixed voice of the object to be recognized and the facial visual information of the object to be recognized; using the Dlib library to perform face detection to obtain the number of speakers; and processing the above information Obtain the complex language spectrogram and the face image of the speaker and transmit it to the multi-modal speech separation model, and dynamically adjust the structure of the model according to the number of speakers, wherein the multi-modal speech separation model uses complex numbers in the training process Domain ideal ratio masking is used as the training target, which is defined as the ratio between the clean sound spectrogram and the mixed sound spectrogram in the complex domain, which consists of real and imaginary parts and contains the amplitude and phase information of the sound; The multimodal speech separation model outputs the time-frequency mask corresponding to the number of faces; the output mask is multiplied with the spectrogram of the mixed voice to obtain the spectrogram of the clean voice, and the spectrogram of the clean voice is made The short-time Fourier inverse transform is calculated to obtain the time-domain signal of the clean sound, thereby completing the speech separation. The disclosed model is more suitable for most application scenarios.

Description

technical field [0001] The present disclosure belongs to the technical field of speech separation, and in particular, relates to a multimodal speech separation method and system. Background technique [0002] In life, it is often necessary to come into contact with a variety of mixed sounds, among which the mixed sound of people is the most common. In a mixed environment, people have the ability to focus on one person's voice while ignoring others' voices and ambient noise, a phenomenon known as the cocktail party effect. Due to the powerful sound signal processing capabilities of the human auditory system, it is easy to separate mixed sounds. With the intelligentization of life, voice separation technology has played an important role in various voice interaction devices. However, for computers, how to efficiently realize voice separation has always been a difficult problem. [0003] At present, speech separation technology has a very wide range of applications. For examp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G10L21/0272G10L21/0208G10L25/30G06V40/16G06V40/20G06V10/46G06V10/80G06V10/82G06N3/04G06N3/08
CPCG10L21/0272G10L21/0208G10L25/30G06N3/08G10L2021/02087G06V40/161G06V40/20G06V10/478G06N3/045G06F18/253
Inventor 魏莹刘洋
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products