The invention provides an extraction method of
semantic information of video images, and relates to the technical field of video description and
annotation. Firstly, frame sequences of a video are extracted according to a certain inter-frame space, a
feature vector of each frame image is extracted according to a
convolutional neural network, the feature vectors are regarded as input of a LSTM network
encoder, output of each
time step of the LSTM network
encoder and output of the previous
time step of a LSTM network decoder are regarded as input of an
external storage EMM, and contents of a stored matrix in the
external storage EMM are updated; the
external storage EMM outputs two reading vectors which are regarded as input vectors of decoding and encoding of the subsequent
time step respectively. Through two LSTM network dynamics, reading and writing of the external storage EMM are controlled, and storing of the
feature vector of each frame image of the video at the encoding phase is achieved; at the decoding phase, through forecasting of feedback of words, the output of the subsequent time step of the external storage is adjusted, so that when an
annotation of the video is generated, the feature vectors of a context are adjusted according to a generated word sequence.