Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Audio and video multi-mode sentiment classification method and system

A sentiment classification, multi-modal technology, applied in speech analysis, neural learning method, speech recognition and other directions, can solve the problem of lack of unified method for audio and video raw data, unable to extract facial features, etc., to improve information processing efficiency, The effect of simplifying computing overhead and improving accuracy

Active Publication Date: 2021-09-17
SOUTH CHINA UNIV OF TECH
View PDF7 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] 5. In this invention application, there is no uniform approach to the processing of audio and video raw data. In the processing of audio and video data, the format and content of data are very different.
For example, there may not be a human face in the video, so it is impossible to extract facial features according to the method described in the invention application

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Audio and video multi-mode sentiment classification method and system
  • Audio and video multi-mode sentiment classification method and system
  • Audio and video multi-mode sentiment classification method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0032] Such as figure 1 As shown, in the present embodiment, audio-video multimodal emotion classification method comprises the following steps:

[0033] S1. Processing and calculation of raw video data

[0034] Obtain key frames and audio signals from the input original video clip; for each key frame, the frame picture is scaled and input to the face detection module, if the frame picture does not contain a face, the frame picture is equal-sized Segmentation; if the frame picture contains a human face, use the Megvii Face++ open source API to extract the key points of the human face; perform Mel Spectrogram calculation and MFCC (Mel Frequency Cepstral Coefficient) calculation on the audio signal, using open source voice The text-to-text toolkit Deepspeech converts audio into text, and the related functions provided in Transformers (self-attention transformation network) convert the text into word vectors and generate sentence symbols according to the text sentence structure....

Embodiment 2

[0060]Based on the same inventive concept as Embodiment 1, this embodiment provides an audio-video multimodal emotion classification system, such as figure 2 shown, including:

[0061] The data preprocessing module is used to realize the step S1 of embodiment 1, to the processing and calculation of the original video data, to obtain video data samples, audio data samples and text feature samples;

[0062] The emotional feature extraction module is used to realize the step S2 of embodiment 1, construct an emotional feature extraction network, and perform feature extraction on video data samples, audio data samples and text feature samples respectively, and obtain visual modal features, audio features and text features;

[0063] The feature fusion and classification module is used to implement step S3 of Embodiment 1, unify the extracted visual modality features, audio features and text features through the fully connected layer, input them into the tensor fusion network for f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the field of voice and image processing and mode recognition, in particular to an audio and video multi-mode sentiment classification method and system, and the method comprises the steps: carrying out the processing and calculation of original video data, and obtaining a video data sample, an audio data sample and a text feature sample; constructing an emotion feature extraction network, and performing feature extraction on the video data sample, the audio data sample and the text feature sample to obtain a visual modal feature, an audio feature and a text feature in multiple modalities; and carrying out dimension unification on the extracted visual modal features, audio features and text features, inputting the features into a tensor fusion network for fusion learning, and finally carrying out classification to output a multi-modal sentiment classification probability result. Cross-modal emotion information can be effectively integrated, space-time high-dimension feature extraction is carried out on videos, audios and texts, the videos, the audios and the texts are spliced into multi-modal feature vectors, fusion learning is carried out then, and emotion classification is carried out.

Description

technical field [0001] The invention relates to the fields of speech and image processing and pattern recognition, and specifically relates to an audio and video multimodal emotion classification method and system based on an open source deep learning framework. Background technique [0002] With the advent of the 5G era, on the basis of the development of the emerging Internet entertainment industry represented by short videos, the lifting of network speed restrictions will further make short videos a new mainstream information carrier. Followed by the explosive growth of data volume with video as the carrier, "information overload" has become an inevitable problem. Personalized recommendation systems based on information content are playing an increasingly important role, so the demand for tagged description and classification of videos is also increasing. Secondly, due to the continuous popularization of 4G and 5G networks and the increase in the number of active online ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/62G06N3/04G06N3/08G10L15/26G10L25/03G10L25/24G10L25/30G10L25/63
CPCG06N3/08G10L25/63G10L25/30G10L25/03G10L25/24G10L15/26G06N3/044G06N3/045G06F18/2415
Inventor 岑敬伦李志鹏青春美罗万相
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products