Multi-modal pre-training method based on image-text linear combination

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A linear combination and pre-training technology, applied in neural learning methods, character and pattern recognition, biological neural network models, etc., to improve processing speed, performance, and accuracy

Pending Publication Date: 2022-04-08

HUNAN UNIV OF TECH

View PDF0 Cites 14 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0014] In view of the above-mentioned technical problems, the present invention proposes a multi-modal pre-training method based on the linear combination of images and texts, which solves the bottleneck problem of model operation time and the performance problem of the improved pre-training model after fine-tuning, which has important scientific significance and Practical application value

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0091] Facing the application scenario of image search text, the number of super-parameter images, the number of image-related descriptions (number of text annotations / sentences), image weight, and text weight respectively use a=1; b=3; ξ=μ=1 The strategy can obtain a better recall rate. Such as Figure 5 A diagram of the entire pre-trained model structure. The following is a detailed description with examples:

[0092] S1-S2.1: Image-Text Pair (Image-Text Pair), such as Figure 6 and Figure 7 After the feature extraction operation, splicing is performed to obtain the feature sequence Y:

[0093] Y=[V type +v;T type +t] = [0.87, 0.15, ..., 0.857]

[0094] S2.2: Input the feature sequence Y to the Transformer Encoder interaction layer to calculate the attention value through the attention mechanism, and finally obtain the final feature sequence Y through the nonlinear activation function tanh() p .

[0095]

[0096] Y p =tanh(Attention(Q,K,V))=[0.108, 0.732, -0.852,...

Embodiment 2

[0112] Facing the application scenario of text search for pictures, the input strategy of a=2; b=3; ξ=μ=1 can obtain a better recall rate. The following is a detailed description with examples:

[0113] S1-S2.1: Image-Text Pair (Image-Text Pair), such as Figure 6 and Figure 7 After the feature extraction operation, splicing is performed to obtain the feature sequence Y:

[0114] Y=[V type +v;T type +t] = [0.27, 0.59, ..., 0.437]

[0115] S2.2: Input the feature sequence Y to the Transformer Encoder interaction layer to calculate the attention value through the attention mechanism, and finally obtain the final feature sequence Y through the nonlinear activation function tanh() p .

[0116]

[0117] Y p =tanh(Attention(Q,K,V))=[0.271, -0.842, -0.312, . . . , 0.662].

[0118] S3: After obtaining the feature sequence of two modal interactions, different downstream tasks can be connected. In this embodiment 2, as mentioned above in the application scenario of text sea...

Embodiment 3

[0132] In the image-text multimodal classification task, in the face of the application scenario of VQA, the input strategy of a=1; b=2; ξ=μ=1 can obtain better accuracy. The following is a detailed description with examples:

[0133] S1-S2.1: Image-Text Pair (Image-Text Pair), such as Figure 6 and Figure 7 After the feature extraction operation, splicing is performed to obtain the feature sequence Y:

[0134] Y=[V type +v;T type +t] = [0.821, -0.159, ..., -0.825]

[0135] S2.2: Input the feature sequence Y to the Transformer Encoder interaction layer to calculate the attention value through the attention mechanism, and finally obtain the final feature sequence Y through the nonlinear activation function tanh() p .

[0136]

[0137] Y p =tanh(Attention(Q,K,V))=[0.172, -0.451, -0.312, . . . , -0.662].

[0138] S3: In Embodiment 3, it is mentioned above that facing the application scenario of VQA, the strategy of a=1; b=2; ξ=μ=1 can obtain a better accuracy rate. V...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A multi-modal pre-training method based on image-text linear combination belongs to the technical field of image-text multi-modal retrieval, and comprises the following steps: S1, respectively carrying out feature extraction on a text and an image; s2, establishing a relation between two modes of a text and an image in an interaction layer; s2.1, jointly inputting the feature vectors of the visual modality and the language modality obtained in the step S1 into an interaction layer of a multi-modality pre-training model; s2.2, utilizing an attention mechanism in the Transform to enable the two modes to be connected with each other; s3, taking the image-text matching or shielding language model as a pre-training target, and training the model to be available; and S4, taking a specific application scene and a downstream task as training targets, carrying out fine tuning training on the pre-training model, and enabling the performance of the model to be optimal in the scene. The training method provided by the invention solves the bottleneck problem of model operation time and the performance problem of the improved pre-training model after fine tuning, and has relatively important scientific significance and practical application value.

Description

technical field [0001] The invention belongs to the technical field of image-text multimodal retrieval, and more specifically relates to a multi-modal pre-training method based on linear combination of image and text. Background technique [0002] Modality is the way things are experienced and happened. We live in a world composed of multiple modal information, including visual information, auditory information, text information, olfactory information, etc. When the research problem or data set contains multiple such When the modal information of , we call it a multimodal problem. Studying multimodal problems is the key to advancing artificial intelligence to better understand and recognize the world around us. [0003] Today's more common applications include media description, event recognition, multimedia retrieval, visual reasoning, visual question answering, and more. At present, many visual tasks are fine-tuned on a fully pre-trained convolutional model. In addition,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/62G06N3/04G06N3/08G06F40/284G06V10/46G06V10/764G06V10/774

Inventor 袁鑫攀张知奇陈博王克李长云

Owner HUNAN UNIV OF TECH

Features

Generate Ideas
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multi-modal pre-training method based on image-text linear combination

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology