The invention relates to an
image description generation method based on a depth LSTM network, comprising the following steps: (1) extracting the CNN characteristics of an image in an
image description dataset, and acquiring an embedded vector corresponding to the image and describing the words in a reference
sentence; (2) building a double-layer LSTM network, and carrying out series modeling based on the double-layer LSTM network and a CNN network to generate a multimodal LSTM model; (3) training the multimodal LSTM model by means of joint training; (4) gradually increasing the number of
layers of the LSTM network in the multimodal LSTM model, carrying out training each time one layer is added to the LSTM network, and finally, getting a gradual multi-objective optimization and multilayer probability fused
image description model; and (5) fusing the probability scores output by the branches of the multilayer LSTM network in the gradual multi-objective optimization and multilayer probability fused image
description model, and outputting the word corresponding to the maximum probability through common decision. Compared with the prior art, the method has such advantages as multiple
layers, improved expression ability, effective updating, and high accuracy.