Text this: Adopting multiple vision transformer layers for fine-grained image representation