Multi-modal Transformer image description method based on dynamic word embedding

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An image description and multi-modal technology, applied in image coding, image data processing, character and pattern recognition, etc., can solve problems such as poor effect and insufficient semantic understanding of the model, so as to improve semantic understanding and semantic description ability , the effect of reducing the semantic gap problem

Pending Publication Date: 2021-09-03

KUNMING UNIV OF SCI & TECH

View PDF4 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] The present invention provides a multimodal Transformer image description method based on dynamic word embedding. This method uses a multimodal deep learning model and uses a joint modeling method of inter-modal and intra-modal attention to input The data is modeled to generate the corresponding description, which solves the problem that the traditional method only uses inter-modal attention, which leads to the model’s insufficient understanding of semantics and poor effect. The specific steps are as follows:

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0048] An image description method based on a dynamic word embedding multimodal Transformer, as follows figure 1 , 2 As shown, it specifically includes the following steps:

[0049] (1) Use the image feature extractor component to select the salient area of the image and extract the image features of the image: perform feature extraction on the target in the image to generate a more meaningful image feature matrix, such as Figure 4 As shown; feature extraction is performed on the target in each target frame in the image, and the feature matrix of the image is generated, and the size is (1024*46).

[0050] Wherein, for the salient area of the image, feature extraction is performed on the target in the image: for the obtained target area of the image, PCA is used to extract the main information in the target area of the image.

[0051]

[0052] Then the main information obtained Perform a linear transformation to the same feature dimension as the input to the nex...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multi-modal Transformer image description method based on dynamic word embedding, and belongs to the field of artificial intelligence. According to the invention, a model for simultaneously performing intra-modal attention and inter-modal attention is constructed, fusion of multi-modal information is realized, the convolutional neural network is bridged with the Transformer, image information and text information are fused in the same vector space, and the accuracy of language description of the model is improved; a semantic gap problem existing in the field of image description is reduced. Compared with a baseline model using Bottom-up and LSTM, the invention has the advantages that the BLEU-1, the BLEU-2, the BLEU-3, the BLEU-4, the ROUGE-L and the CIDEr-D are all improved.

Description

technical field [0001] The invention relates to an image description method of a multi-modal Transformer based on dynamic word embedding, and belongs to the technical field of artificial intelligence. Background technique [0002] Multimodal deep learning aims to achieve the ability to process and understand multi-source modal information through deep learning methods. With the rapid rise of society and economy, multimodal deep learning has been widely used in various aspects of social production, and has achieved remarkable results. At present, the popular research direction is multi-modal learning among image, video, audio and semantics. For example: In speech recognition, humans understand speech by combining speech-visual information. Visual modalities provide information on pronunciation locations and muscle movements, which can help disambiguate similar voices, as well as judge the speaker's emotions through body behavior and voice, and so on. [0003] Using natural...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/62G06T9/00G06F40/30

CPCG06T9/00G06F40/30G06F18/2135

Inventor 曾凯杨文瑞朱艳沈韬刘英莉

Owner KUNMING UNIV OF SCI & TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multi-modal Transformer image description method based on dynamic word embedding

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology