Open domain video natural language description generation method based on multi-modal feature fusion

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of feature fusion and natural language, applied in the field of video analysis, can solve problems such as not considering other features, only using RGB image features, and not studying other information too much, so as to increase robustness and speed, improve accuracy, The effect of high robustness

Active Publication Date: 2018-10-12

NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

View PDF7 Cites 37 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The S2VT model achieved a METEOR value of 29.8% on a standard video description dataset, which is higher than all previous model results, but S2VT only considers the image features and optical flow features of the video, and other information of the video has not been passed. more research

[0004] Later, some models were proposed, such as the bidirectional LSTM model (Yi B, Yang Y, Shen F, et al. Bidirectional Long-Short Term Memory for Video Description[C] / / ACM onMultimedia Conference.ACM,2016:436-440.) , multi-scale multi-instance model (Xu H, Venugopalan S, Ramanishka V, et al. A Multi-scale Multiple Instance Video Description Network [J]. Computer Science, 2015, 6738: 272-279.) but did not consider the image and other features besides optical flow

In 2017, Pasunuru et al. proposed a multi-task model (Pasunuru R, Bansal M. Multi-Task Video Captioning with Video and Entailment Generation [J]. 2017.), between unsupervised video prediction tasks (encoding) and language generation tasks (decoding) shared parameters among them, achieved the best result so far, with a METEOR value of 36%, but the model only used RGB image features

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0033] Such as figure 1 The shown open-domain video natural language description model based on multimodal feature fusion is mainly divided into two major models, one is the feature extraction model, and the other is the natural language model. The present invention mainly studies the feature extraction model, which will be divided into four major models: Partial introduction.

[0034] The first part: ResNet152 extracts RGB image features and optical flow features,

[0035] (1) Extraction of RGB image features,

[0036] Use the ImageNet image database to pre-train the ResNet model. ImageNet contains 12,000,000 images divided into 1,000 categories, which can make the model more accurate in identifying objects in open-domain videos. The batch size of the neural network model is set to 50, and the learning rate at the beginning Set to 0.0001, the MSVD (Microsoft Research Video DescriptionCorpus) dataset contains 1970 video clips, with a duration between 8 and 25 seconds, corres...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an open domain video natural language description method based on multi-modal feature fusion. According to the method, a deep convolutional neural network model is adopted forextracting the RGB image features and the grayscale light stream picture features, video spatio-temporal information and audio information are added, then a multi-modal feature system is formed, whenthe C3D feature is extracted, the coverage rate among the continuous frame blocks input into the three-dimensional convolutional neural network model is dynamically regulated, the limitation problem of the size of the training data is solved, meanwhile, robustness is available for the video length capable of being processed, the audio information makes up the deficiencies in the visual sense, andfinally, fusion is carried out aiming at the multi-modal features. For the method provided by the invention, a data standardization method is adopted for standardizing the modal feature values withina certain range, and thus the problem of differences of the feature values is solved; the individual modal feature dimension is reduced by adopting the PCA method, 99% of the important information iseffectively reserved, the problem of training failure caused by the excessively large dimension is solved, the accuracy of the generated open domain video description sentences is effectively improved, and the method has high robustness for the scenes, figures and events.

Description

technical field [0001] The invention belongs to video analysis technology, in particular to an open-domain video natural language description generation method based on multimodal feature fusion. Background technique [0002] With the popularity of smart mobile devices in recent years, a large amount of video data on network platforms needs to be analyzed and managed urgently. Therefore, it is of great practical value to study the natural language description technology of videos. Illegal videos on social platforms such as Weibo and WeChat emerge in endlessly, but now we mainly rely on manual methods such as reporting by the masses to report and control the spread of such videos, which is not effective. In addition to controlling the dissemination of pornographic, violent and reactionary and other illegal videos and maintaining network security, the language description of videos can also provide intelligent technology for the blind and other people with visual impairments t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L15/00G10L15/02G10L15/06G10L15/18G10L15/26G10L17/26G06K9/46G06K9/62G06N3/04

CPCG10L15/005G10L15/02G10L15/063G10L15/18G10L15/26G10L17/26G06V10/56G06N3/045G06F18/214

Inventor 袁家斌杜晓童

Owner NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Open domain video natural language description generation method based on multi-modal feature fusion

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology