Text this: Modeling sub-event dynamics in first-person action recognition