The invention discloses an
open domain video
natural language description method based on multi-
modal feature fusion. According to the method, a deep
convolutional neural network model is adopted forextracting the
RGB image features and the
grayscale light
stream picture features, video spatio-
temporal information and audio information are added, then a multi-
modal feature
system is formed, whenthe C3D feature is extracted, the coverage rate among the continuous frame blocks input into the three-dimensional
convolutional neural network model is dynamically regulated, the limitation problem of the size of the training data is solved, meanwhile, robustness is available for the video length capable of being processed, the audio information makes up the deficiencies in the visual sense, andfinally, fusion is carried out aiming at the multi-
modal features. For the method provided by the invention, a data
standardization method is adopted for standardizing the modal feature values withina certain range, and thus the problem of differences of the feature values is solved; the individual modal
feature dimension is reduced by adopting the PCA method, 99% of the important information iseffectively reserved, the problem of training failure caused by the excessively large dimension is solved, the accuracy of the generated
open domain video description sentences is effectively improved, and the method has high robustness for the scenes, figures and events.