Nested Tokenization for Larger Context in Large Images

1Berkeley AI Research, UC Berkeley, 2UCLA, 3Princeton University


Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice of which information to discard. We introduce \(x\)T, a simple framework for vision transformers which effectively aggregates global context with local details and can model large images end-to-end on contemporary GPUs. We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model's ability to understand truly large images and incorporate fine details over large scales and assess our method's improvement on them. By introducing a nested tokenization scheme for large images in conjunction with long-sequence length models normally used for natural language processing, we are able to increase accuracy by up to 8.6% on challenging classification tasks and \(F_1\) score by 11.6 on context-dependent segmentation in large images.

Images are Getting Bigger

A photo of a football field.

Images have been getting increasingly larger over the past decade. For example, consider a video feed of a football game which is captured natively in 8K resolution. We would like to understand where the player in the middle of the screen is passing the ball to. However, today's leading models would not be able to reason over the entire image in one pass.

Modern computer vision pipelines are limited by the memory in the systems they are trained upon, resulting in the creation of models that only operate on small images. Computer vision practitioners limit the size of images in two less-than-ideal ways: down-sampling or cropping. While these simple operations produce powerful models when measured against typical computer vision benchmarks, the loss of high frequency information or global context is limited for many real-world tasks.

Using \(x\)T to Model Large Images

Figure 1: Architecture for the \(x\)T framework.

\(x\)T is framework that allows existing vision backbones to process large images in a memory efficient and contextual manner. We achieve this through an iterative, two-stage design.

First, images are tokenized hierarchically (Nested Tokenization) before being independently featurized by a region encoder with a limited context window (Independent Region Encoding). Then, a lightweight context encoder incorporates context globally across this sequence of features (Context-Aware Encoding), which then gets passed to the task-specific decoders.


xT sets a new frontier on downstream tasks.

Figure 2: Powerful vision models used with \(x\)T set a new frontier on downstream tasks.

The use of \(x\)T allows myopic, memory-hungry vision backbones to effectively "see" across the entire large image at once. On tasks such as classification (iNaturalist-Reptilia shown in the figure), \(x\)T can achieve higher accuracy with fewer parameters due to its ability to incorporate global context across local regions of the image.

Figure 3: \(x\)T increases the receptive field of vision backbones.

This is best visualized through Figure 3, which demonstrates the effective receptive field of Swin-B and Swin-B <\(x\)T> XL as the input image gets larger. Swin-B cannot model an image that is >2,800 x 2,800 pixels large, while it can modeled with \(x\)T properly.

Figure 4: \(x\)T increases the receptive field of vision backbones.

Critically, as inputs get larger, backbones such as Swin scale memory usage quadratically, whereas \(x\)T memory usage stays near-constant per region. This enables entirely new classes of applications not possible before, such as the effective processing of images captured from large-format sensors such as satellites and microscopes.


  title={xT: Nested Tokenization for Larger Context in Large Images},
  author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya},
  journal={arXiv preprint arXiv:2403.01915},