Text this: Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition