The invention discloses a combined video description method based on multi-
modal features and multi-layer attention mechanism. Firstly, the invention counts the words appearing in the description
sentence to form a vocabulary, and numbers each word to facilitate vector representation. Then three kinds of
feature data are extracted, including semantic attribute feature, Image information features extracted by 2D-CNN and video motion information features extracted by 3D-CNN, and then multi-
modal data dynamic fusion through the multi-layer attention mechanism to obtain visual information, and then according to the current context, adjust the use of visual information; Finally, according to the current context and visual information, the words described in the video are generated. After the multi-
modal features of the video are fused through the multi-layer attention mechanism, the invention generates the semantic description of the
video based on the multi-modal features of the video, thereby effectively improving the accuracy of the video description.