Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Multi-modal emotion recognition method

An emotion recognition, multi-modal technology, applied in the field of data processing, can solve the problems of long sequence context modeling and other problems, achieve the effect of limited ability to solve and improve accuracy

Active Publication Date: 2021-03-26
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF3 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In addition, in addition to multimodal fusion, in terms of model architecture, current multimodal emotion recognition methods mainly use recurrent neural networks to capture temporal context information. Stretched

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-modal emotion recognition method
  • Multi-modal emotion recognition method
  • Multi-modal emotion recognition method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0059] Such as figure 1 As shown, the multimodal emotion recognition method provided by the embodiment of the present application includes:

[0060] S1: Input the audio file, video file and corresponding text file of the sample to be tested, perform feature extraction on the audio file, video file and text file respectively, and obtain the audio feature at the frame level, the video feature at the frame level and the word level text features.

[0061] In some embodiments, the specific method for feature extraction of the audio file, video file and text file respectively includes:

[0062] Segmenting the audio file to obtain frame-level short-term audio clips; respectively inputting the short-time audio clips to a pre-trained audio feature extraction network to obtain the frame-level audio features;

[0063] Utilize face detection tool to extract the human face image of frame level from described video file; Input the human face image of described frame level into pre-trained...

Embodiment 2

[0110] The present application also discloses an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, the methods described in the above-mentioned embodiments are implemented. A step of.

Embodiment 3

[0112] The multimodal emotion recognition method includes the following steps:

[0113] S1-1: Input the audio to be tested, the video to be tested, and the text to be tested. The video and audio to be tested, the video to be tested, and the text to be tested are three different modalities.

[0114] In this embodiment, the audio to be tested and the video to be tested are video and audio in the same segment, the text to be tested corresponds to the audio to be tested and the video to be tested, and audio, video, and text are three types of video in this video modal.

[0115] In this embodiment, the data of these three modalities need to be analyzed in this embodiment to detect the emotional state of the character in the input segment.

[0116] According to the above scheme, further, a segment can be input, in which a character speaks, the continuous picture of this character speaking is the video to be tested, the audio that appears in the segment is the audio to be tested, th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a multi-modal emotion recognition method. The method comprises the steps of extracting frame-level audio features, frame-level video features and word-level text features respectively; respectively inputting the extracted features into a feature encoder for modeling to obtain encoded audio encoding, video encoding and text encoding features; modeling an interaction relationship in a modal by using coded features through respective self-attention modules, sorting and combining the interaction relationships in pairs, and inputting the sorted and combined interaction relationships into a cross-modal attention module to model the interaction relationship between every two modals; performing time sequence pooling on output of the self-attention module and the cross-modal attention module to obtain global interaction features in all modals and global interaction features between every two modals; and respectively carrying out weighted fusion on the global interactioncharacteristics in the modals and between the modals by utilizing an attention mechanism to obtain characteristic representations in the modals and between the modals of the whole sample to be detected, and splicing the two to be detected to obtain a final emotion classification result through a full connection network.

Description

technical field [0001] The present application relates to the field of data processing, in particular to a multi-modal emotion recognition method. Background technique [0002] Traditional emotion recognition is often limited to a single modality, such as speech emotion recognition, facial expression recognition, and text emotion analysis. With the development of computer science and technology, multi-modal emotion recognition methods based on audio, video and text have emerged, and will be widely used in smart home, education, and finance in the future. Existing multi-modal emotion recognition methods usually use feature-level fusion or decision-level fusion to integrate information from multiple modalities. These methods have their own advantages and disadvantages. Although the feature layer fusion can model the interaction between modalities, it needs to align the features of different modalities in advance in time sequence. The decision-making layer fusion is the oppos...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/906G06K9/62G06N3/04G06N3/08
CPCG06F16/906G06N3/049G06N3/08G06N3/048G06N3/045G06F18/241G06F18/25
Inventor 陶建华孙立才刘斌柳雪飞
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products