According to one aspect, action prediction may be implemented via a spatio-temporal feature pyramid graph convolutional network (ST-FP-GCN) including a first pyramid layer, a second pyramid layer, a third pyramid layer, etc. The first pyramid layer may include a first graph convolution network (GCN), a fusion gate, and a first long-short-term-memory (LSTM) gate. The second pyramid layer may include a first convolution operator, a first summation operator, a first mask pool operator, a second GCN, a first upsampling operator, and a second LSTM gate. An output summation operator may sum a first LSTM output and a second LSTM output to generate an output indicative of an action prediction for an inputted image sequence and an inputted pose sequence.