The present invention discloses a method for constructing a deep visual Q&A
system for
visually impaired persons. In the
training phase, the method comprises: taking collected pictures and a corresponding Q&A text to constitute a
training set; extracting picture features for the pictures by using the
convolutional neural network; for a
question text, converting questions into a word vector
list by using the word vector technique, and taking the word vector
list as input of the LSTM so as to extract question features; and finally, after carrying out element dot product on the pictures and the question features, carrying out classification on the pictures and the question features so as to obtain an answer prediction value, comparing the answer prediction value with an answer tag in the
training set, calculating the loss, and using the
back propagation algorithm to optimize the model. In the running phase, the method comprises that: a
client obtains photos taken by the user and the
question text, and uploads the photos and the
question text to a
server; the
server inputs the uploaded photos and question text into a trained model, extracts question features by using the same manner, outputs a corresponding answer prediction value by using a classifier, and returns the answer prediction value to the
client; and the
client returns the answer prediction value to the user in a form of voice input.