Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Multi-modal Transformer image description method based on dynamic word embedding

An image description and multi-modal technology, applied in image coding, image data processing, character and pattern recognition, etc., can solve problems such as poor effect and insufficient semantic understanding of the model, so as to improve semantic understanding and semantic description ability , the effect of reducing the semantic gap problem

Pending Publication Date: 2021-09-03
KUNMING UNIV OF SCI & TECH
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present invention provides a multimodal Transformer image description method based on dynamic word embedding. This method uses a multimodal deep learning model and uses a joint modeling method of inter-modal and intra-modal attention to input The data is modeled to generate the corresponding description, which solves the problem that the traditional method only uses inter-modal attention, which leads to the model’s insufficient understanding of semantics and poor effect. The specific steps are as follows:

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-modal Transformer image description method based on dynamic word embedding
  • Multi-modal Transformer image description method based on dynamic word embedding
  • Multi-modal Transformer image description method based on dynamic word embedding

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] An image description method based on a dynamic word embedding multimodal Transformer, as follows figure 1 , 2 As shown, it specifically includes the following steps:

[0049] (1) Use the image feature extractor component to select the salient area of ​​the image and extract the image features of the image: perform feature extraction on the target in the image to generate a more meaningful image feature matrix, such as Figure 4 As shown; feature extraction is performed on the target in each target frame in the image, and the feature matrix of the image is generated, and the size is (1024*46).

[0050] Wherein, for the salient area of ​​the image, feature extraction is performed on the target in the image: for the obtained target area of ​​the image, PCA is used to extract the main information in the target area of ​​the image.

[0051]

[0052] Then the main information obtained Perform a linear transformation to the same feature dimension as the input to the nex...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a multi-modal Transformer image description method based on dynamic word embedding, and belongs to the field of artificial intelligence. According to the invention, a model for simultaneously performing intra-modal attention and inter-modal attention is constructed, fusion of multi-modal information is realized, the convolutional neural network is bridged with the Transformer, image information and text information are fused in the same vector space, and the accuracy of language description of the model is improved; a semantic gap problem existing in the field of image description is reduced. Compared with a baseline model using Bottom-up and LSTM, the invention has the advantages that the BLEU-1, the BLEU-2, the BLEU-3, the BLEU-4, the ROUGE-L and the CIDEr-D are all improved.

Description

technical field [0001] The invention relates to an image description method of a multi-modal Transformer based on dynamic word embedding, and belongs to the technical field of artificial intelligence. Background technique [0002] Multimodal deep learning aims to achieve the ability to process and understand multi-source modal information through deep learning methods. With the rapid rise of society and economy, multimodal deep learning has been widely used in various aspects of social production, and has achieved remarkable results. At present, the popular research direction is multi-modal learning among image, video, audio and semantics. For example: In speech recognition, humans understand speech by combining speech-visual information. Visual modalities provide information on pronunciation locations and muscle movements, which can help disambiguate similar voices, as well as judge the speaker's emotions through body behavior and voice, and so on. [0003] Using natural...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06T9/00G06F40/30
CPCG06T9/00G06F40/30G06F18/2135
Inventor 曾凯杨文瑞朱艳沈韬刘英莉
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products