Cross-modal lip reading antagonism double-contrast self-supervised learning method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A supervised learning and adversarial technology, applied in the field of image processing, can solve problems that depend on the validity, dependence, and neglect of negative samples

Active Publication Date: 2021-08-10

NAT UNIV OF DEFENSE TECH

View PDF8 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Previous studies in this area have attempted pairwise comparison strategies to bring visual embeddings closer to corresponding audio embeddings and further away from non-corresponding audio embeddings. Two-contrast learning requires manual selection of negative samples, and its effect largely depends on the effectiveness of negative samples; second, representation learning only relies on synchronized audio-video data pairs, and other self-supervised signals, such as speaker-related information and modality information, can also be used to optimize the quality of learned representations, but these self-monitoring signals are usually ignored in previous work

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0040] Such as figure 1 As shown, given a video of the mouth talking and the corresponding audio , first introduces a visual encoder and an audio encoder to extract the A-V embedding. To ensure consistent A-V embedding, both the audio encoder network and the visual encoder network ingest clips with the same duration, typically 0.2 seconds. Specifically, the input to the audio encoder is 13-dimensional Mel-frequency cepstral coefficients (MFCCs), which are extracted every 10ms with a frame length of 25ms. The input to the vision encoder is 5 consecutive mouth-centered cropped video (𝑓𝑝𝑠 = 25) frames.

[0041] To learn effective visual representations for lip reading, three pre-tasks are introduced. Double Contrast Learning Objectives with The goal is to make the visual embeddings more closely resemble the corresponding audio embeddings on short and long time scales. adversarial learning objectives with Make the learned embedding independent of schema info...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a cross-modal lip reading antagonistic double-contrast self-supervised learning method. The method comprises a visual encoder, an audio encoder, two multi-scale time convolution networks with average pools, an identity discriminator and a modal classifier. The method learns an effective visual representation by combining double contrast learning based on audio-visual synchronization, identity adversarial training and modal adversarial training. In double-contrast learning, noise contrast estimation is used as a training target to distinguish a real sample and a noise sample. In adversarial training, an identity discriminator and a modal classifier are provided for audio-visual representation, the identity discriminator is used for discriminating whether input visual features have a common identity, the modal classifier is used for predicting whether the input features belong to a visual mode or an audio mode, and then a momentum gradient inversion layer is used for realizing adversarial training.

Description

technical field [0001] The invention belongs to the field of image processing, and in particular relates to an adversarial double-contrast self-supervised learning method for cross-modal lip reading. Background technique [0002] Supervised deep learning has made revolutionary progress in many fields such as image classification, object detection and segmentation, speech recognition, machine translation, etc. Although supervised learning has made remarkable progress in the past few years, its success largely relies on large amounts of human-annotated training data. However, for some specific tasks, such as lip reading, the cost of annotation can be prohibitively expensive. In recent years, self-supervised learning has attracted increasing attention due to its high labeling efficiency and good generalization ability. Self-supervised learning methods have shown great potential in natural language processing, computer vision, and cross-modal representation learning. [0003]...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06K9/00G06K9/62G06N3/08G10L15/06G10L15/16G10L15/25

CPCG06N3/084G10L15/063G10L15/16G10L15/25G06V40/20G06F18/22G06F18/214G06F18/24

Inventor 张雪毅刘丽常冲刘忠龙云利

Owner NAT UNIV OF DEFENSE TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Cross-modal lip reading antagonism double-contrast self-supervised learning method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology