Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or
cropping.
These two methods incur significant losses in the amount of information and context present in an image.
There are many downstream applications in which global context matters as much as high frequency details,
such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice
of which information to discard.
We introduce
Images have been getting increasingly larger over the past decade. For example, consider a video feed of a football game which is captured natively in 8K resolution. We would like to understand where the player in the middle of the screen is passing the ball to. However, today's leading models would not be able to reason over the entire image in one pass.
Modern computer vision pipelines are limited by the memory in the systems they are trained upon, resulting in the creation of models that only operate on small images. Computer vision practitioners limit the size of images in two less-than-ideal ways: down-sampling or cropping. While these simple operations produce powerful models when measured against typical computer vision benchmarks, the loss of high frequency information or global context is limited for many real-world tasks.
First, images are tokenized hierarchically (Nested Tokenization) before being independently featurized by a region encoder with a limited context window (Independent Region Encoding). Then, a lightweight context encoder incorporates context globally across this sequence of features (Context-Aware Encoding), which then gets passed to the task-specific decoders.
The use of
This is best visualized through Figure 3, which demonstrates the effective receptive field of Swin-B and
Swin-B <
Critically, as inputs get larger, backbones such as Swin scale memory usage quadratically, whereas
@article{xTLargeImageModeling,
title={xT: Nested Tokenization for Larger Context in Large Images},
author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya},
journal={arXiv preprint arXiv:2403.01915},
year={2024}
}