Cross-modal lip reading antagonism double-contrast self-supervised learning method

A supervised learning and adversarial technology, applied in the field of image processing, can solve problems that depend on the validity, dependence, and neglect of negative samples

Active Publication Date: 2021-08-10
NAT UNIV OF DEFENSE TECH
View PDF8 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Previous studies in this area have attempted pairwise comparison strategies to bring visual embeddings closer to corresponding audio embeddings and further away from non-corresponding audio embeddings. Two-contrast learning requires manual selection of negative samples, and its effect largely depends on the effectiveness of negative samples; second, representation learning only relies on synchronized audio-video data pairs, and other self-supervised signals, such as speaker-related information and modality information, can also be used to optimize the quality of learned representations, but these self-monitoring signals are usually ignored in previous work

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cross-modal lip reading antagonism double-contrast self-supervised learning method
  • Cross-modal lip reading antagonism double-contrast self-supervised learning method
  • Cross-modal lip reading antagonism double-contrast self-supervised learning method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] Such as figure 1 As shown, given a video of the mouth talking and the corresponding audio , first introduces a visual encoder and an audio encoder to extract the A-V embedding. To ensure consistent A-V embedding, both the audio encoder network and the visual encoder network ingest clips with the same duration, typically 0.2 seconds. Specifically, the input to the audio encoder is 13-dimensional Mel-frequency cepstral coefficients (MFCCs), which are extracted every 10ms with a frame length of 25ms. The input to the vision encoder is 5 consecutive mouth-centered cropped video (𝑓𝑝𝑠 = 25) frames.

[0041] To learn effective visual representations for lip reading, three pre-tasks are introduced. Double Contrast Learning Objectives with The goal is to make the visual embeddings more closely resemble the corresponding audio embeddings on short and long time scales. adversarial learning objectives with Make the learned embedding independent of schema info...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a cross-modal lip reading antagonistic double-contrast self-supervised learning method. The method comprises a visual encoder, an audio encoder, two multi-scale time convolution networks with average pools, an identity discriminator and a modal classifier. The method learns an effective visual representation by combining double contrast learning based on audio-visual synchronization, identity adversarial training and modal adversarial training. In double-contrast learning, noise contrast estimation is used as a training target to distinguish a real sample and a noise sample. In adversarial training, an identity discriminator and a modal classifier are provided for audio-visual representation, the identity discriminator is used for discriminating whether input visual features have a common identity, the modal classifier is used for predicting whether the input features belong to a visual mode or an audio mode, and then a momentum gradient inversion layer is used for realizing adversarial training.

Description

technical field [0001] The invention belongs to the field of image processing, and in particular relates to an adversarial double-contrast self-supervised learning method for cross-modal lip reading. Background technique [0002] Supervised deep learning has made revolutionary progress in many fields such as image classification, object detection and segmentation, speech recognition, machine translation, etc. Although supervised learning has made remarkable progress in the past few years, its success largely relies on large amounts of human-annotated training data. However, for some specific tasks, such as lip reading, the cost of annotation can be prohibitively expensive. In recent years, self-supervised learning has attracted increasing attention due to its high labeling efficiency and good generalization ability. Self-supervised learning methods have shown great potential in natural language processing, computer vision, and cross-modal representation learning. [0003]...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06K9/00G06K9/62G06N3/08G10L15/06G10L15/16G10L15/25
CPCG06N3/084G10L15/063G10L15/16G10L15/25G06V40/20G06F18/22G06F18/214G06F18/24
Inventor 张雪毅刘丽常冲刘忠龙云利
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products