A fast variational on-line learning technique for training a transformed
hidden Markov model. A simplified general model and an associated
estimation algorithm is provided for modeling visual data such as a
video sequence. Specifically, once the model has been initialized, an expectation-maximization (“EM”)
algorithm is used to learn the one or more
object class models, so that the
video sequence has high marginal probability under the model. In the expectation step (the “E-Step”), the
model parameters are assumed to be correct, and for an input image,
probabilistic inference is used to fill in the values of the unobserved or hidden variables, e.g., the
object class and appearance. In one embodiment of the invention, a
Viterbi algorithm and a
latent image is employed for this purpose. In the maximization step (the “M-Step”), the
model parameters are adjusted using the values of the unobserved variables calculated in the previous E-step. Instead of using
batch processing typically used in EM
processing, the
system and method according to the invention employs an on-line
algorithm that passes through the data only once and which introduces new classes as the new data is observed is proposed. By parameter
estimation and
inference in the model, visual data is segmented into components which facilitates sophisticated applications in video or
image editing, such as, for example, object removal or
insertion, tracking and
visual surveillance,
video browsing, photo organization, video
compositing, and meta data creation.