Vision Transformers Need Registers — Fixing a Bug in DINOv2?
In this post we will discuss about visual transformers registers, which is a concept that was introduced in a research paper by Meta AI titled “Vision Transformers Need Registers”, which is written by authors that were part of DINOv2 release, a successful foundational computer vision model by Meta AI which we covered before in the following post — https://aipapersacademy.com/dinov2-from-meta-ai-finally-a-foundational-model-in-computer-vision/
Our agenda for this post will be:
Background: Visual Features — An essential knowledge to understand what this paper is about.
The Problem: Attention Map Artifacts — A phenomenon discovered and analyzed in this paper, which was found to be common in large foundational computer vision models, such as DINOv2.
The Fix: Registers — The proposed solution to remove the attention map artifacts.
Results & Conclusion — We’ll review interesting results from the paper to understand if the registers approach fixes DINOv2 and other models, or rather should only be used in certain cases.
If you prefer a video format then check out the following video:
Background — Visual Features
So let’s start with a short essential background knowledge about visual features, which we also covered in the DINOv2 post and need it here as well.
Why Do We Need Visual Features?
Say we have multiple tasks we want to solve, for example, given an image of a cat such as the one on the left in the example below, say that we want to have a segmentation of that image, meaning categorizing related parts of the image, such as the segmented cat image we can see on the upper right of the example below. Say we also want to have a depth estimation, where we estimate the…
0 Comments