Staff View: Associating multiple vision transformer layers for fine-grained image representation

Associating multiple vision transformer layers for fine-grained image representation

Accurate discriminative region proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision due to its innate multi-head self-attention mechanism. However, the attention maps are gradually similar after certain la...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sun, Fayou, Ngo, Hea Choon, Sek, Yong Wee, Zuqiang, Meng
Format:	Article
Language:	English
Published:	KeAi Communications Co. 2023
Online Access:	http://eprints.utem.edu.my/id/eprint/28202/2/013022106202410613870.pdf http://eprints.utem.edu.my/id/eprint/28202/ https://doi.org/10.1016/j.aiopen.2023.09.001
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.utem.eprints.28202
record_format	eprints
spelling	my.utem.eprints.282022025-01-09T15:30:39Z http://eprints.utem.edu.my/id/eprint/28202/ Associating multiple vision transformer layers for fine-grained image representation Sun, Fayou Ngo, Hea Choon Sek, Yong Wee Zuqiang, Meng Accurate discriminative region proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision due to its innate multi-head self-attention mechanism. However, the attention maps are gradually similar after certain layers, and since ViT used a classification token to achieve classification, it is unable to effectively select discriminative image patches for fine- grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient features. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce an attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature and then utilize the semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on four widely used fine-grained datasets under the same settings, involving Stanford- Cars, Stanford-Dogs, CUB-200-2011, and ImageNet. KeAi Communications Co. 2023-09 Article PeerReviewed text en http://eprints.utem.edu.my/id/eprint/28202/2/013022106202410613870.pdf Sun, Fayou and Ngo, Hea Choon and Sek, Yong Wee and Zuqiang, Meng (2023) Associating multiple vision transformer layers for fine-grained image representation. AI OPEN, 4. pp. 130-136. ISSN 2666-6510 https://doi.org/10.1016/j.aiopen.2023.09.001 10.1016/j.aiopen.2023.09.001
institution	Universiti Teknikal Malaysia Melaka
building	UTEM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknikal Malaysia Melaka
content_source	UTEM Institutional Repository
url_provider	http://eprints.utem.edu.my/
language	English
description	Accurate discriminative region proposal has an important effect for fine-grained image recognition. The vision transformer (ViT) brings about a striking effect in computer vision due to its innate multi-head self-attention mechanism. However, the attention maps are gradually similar after certain layers, and since ViT used a classification token to achieve classification, it is unable to effectively select discriminative image patches for fine- grained image classification. To accurately detect discriminative regions, we propose a novel network AMTrans, which efficiently increases layers to learn diverse features and utilizes integrated raw attention maps to capture more salient features. Specifically, we employ DeepViT as backbone to solve the attention collapse issue. Then, we fuse each head attention weight within each layer to produce an attention weight map. After that, we alternatively use recurrent residual refinement blocks to promote salient feature and then utilize the semantic grouping method to propose the discriminative feature region. A lot of experiments prove that AMTrans acquires the SOTA performance on four widely used fine-grained datasets under the same settings, involving Stanford- Cars, Stanford-Dogs, CUB-200-2011, and ImageNet.
format	Article
author	Sun, Fayou Ngo, Hea Choon Sek, Yong Wee Zuqiang, Meng
spellingShingle	Sun, Fayou Ngo, Hea Choon Sek, Yong Wee Zuqiang, Meng Associating multiple vision transformer layers for fine-grained image representation
author_facet	Sun, Fayou Ngo, Hea Choon Sek, Yong Wee Zuqiang, Meng
author_sort	Sun, Fayou
title	Associating multiple vision transformer layers for fine-grained image representation
title_short	Associating multiple vision transformer layers for fine-grained image representation
title_full	Associating multiple vision transformer layers for fine-grained image representation
title_fullStr	Associating multiple vision transformer layers for fine-grained image representation
title_full_unstemmed	Associating multiple vision transformer layers for fine-grained image representation
title_sort	associating multiple vision transformer layers for fine-grained image representation
publisher	KeAi Communications Co.
publishDate	2023
url	http://eprints.utem.edu.my/id/eprint/28202/2/013022106202410613870.pdf http://eprints.utem.edu.my/id/eprint/28202/ https://doi.org/10.1016/j.aiopen.2023.09.001
_version_	1821007594389504000
score	13.239859

Associating multiple vision transformer layers for fine-grained image representation

Similar Items