Text this: Leveraging transfer learning for spatio-temporal human activity recognition from video sequences