The invention discloses a complex scene voice recognition method based on
multiple modes. The method comprises the following steps: synchronously collecting an
audio signal, a lip
image signal and a facial electromyogram
signal corresponding to voice input if a collected lip image of a user is detected to change, determining multi-
source data features of the signals in a space domain and a
time domain, and coding and modeling the multi-
source data features by using a
speech recognition model to obtain common information of different
modal expression contents and to obtain multi-
modal speech information, and synthesizing a text by using a
language model. The invention further discloses a complex scene voice recognition device based on
multiple modes. The device comprises a data acquisitionmodule, a
feature extraction module, a coding and decoding module, a text synthesis module and an interaction module. According to the invention, efficient, accurate and robust voice recognition in complex scene environments with vocal cord damage, high
noise, high closure, high privacy requirements and the like is realized, and more reliable voice
interaction technology and
system are provided for complex man-
machine interaction scenes.