Text this: Associating multiple vision transformer layers for fine-grained image representation