[D] VIT16 – Should I use all or only final attention MHA to generate attention heatmap?

[D] VIT16 - Should I use all or only final attention MHA to generate attention heatmap?

Hello,

I’m currently extracting attention heatmaps from pretrained ViT16 models (which i then finetune) to see what regions of the image did the model use to make its prediction.

Many research papers and sources suggests that I should only extract attention scores from final layer, but based on my experiments so far taking the average of MHA scores actually gave a “better” heatmap than just the final layer (image attached).

Additionally, I am a bit confused as to why there are consistent attentions to the image paddings (black border).

The two methods gives very different results, and I’m not sure if I should trust the attention heatmap.

https://preview.redd.it/p0ok6ltkdoig1.png?width=1385&format=png&auto=webp&s=3bcd9bdb01912d085a85ee452b36c115891a76be

submitted by /u/PositiveInformal9512
[link] [comments]

Liked Liked