EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (paper, poster)

News

If you are interested in getting updates, please join our mailing list here.

[2024/03/19] Online demo of EfficientViT-SAM is available: https://evitsam.hanlab.ai/.
[2024/02/08] Tech report of EfficientViT-SAM is available: arxiv.
[2024/02/07] We released EfficientViT-SAM, the first accelerated SAM model that matches/outperforms SAM-ViT-H's zero-shot performance, delivering the SOTA performance-efficiency trade-off.
[2023/11/20] EfficientViT is available in the NVIDIA Jetson Generative AI Lab.
[2023/09/12] EfficientViT is highlighted by MIT home page and MIT News.
[2023/07/18] EfficientViT is accepted by ICCV 2023.

About EfficientViT Models

EfficientViT is a new family of ViT models for efficient high-resolution dense prediction vision tasks. The core building block of EfficientViT is a lightweight, multi-scale linear attention module that achieves global receptive field and multi-scale learning with only hardware-efficient operations, making EfficientViT TensorRT-friendly and suitable for GPU deployment.

Third-Party Implementation/Integration

NVIDIA Jetson Generative AI Lab
timm: link
X-AnyLabeling: link

Getting Started

conda create -n efficientvit python=3.10
conda activate efficientvit
conda install -c conda-forge mpi4py openmpi
pip install -r requirements.txt

EfficientViT Applications

Segment Anything

Datasets
Pretrained Models
Use in Pytorch
Evaluation
Visualization
Web Demo
Deployment using ONNX and TensorRT

Model	Resolution	COCO mAP	LVIS mAP	Params	MACs	Jetson Orin Latency (bs1)	A100 Throughput (bs16)	Checkpoint
EfficientViT-SAM-L0	512x512	45.7	41.8	34.8M	35G	8.2ms	762 images/s	link
EfficientViT-SAM-L1	512x512	46.2	42.1	47.7M	49G	10.2ms	638 images/s	link
EfficientViT-SAM-L2	512x512	46.6	42.7	61.3M	69G	12.9ms	538 images/s	link
EfficientViT-SAM-XL0	1024x1024	47.5	43.9	117.0M	185G	22.5ms	278 images/s	link
EfficientViT-SAM-XL1	1024x1024	47.8	44.4	203.3M	322G	37.2ms	182 images/s	link

Table1: Summary of All EfficientViT-SAM Variants. COCO mAP and LVIS mAP are measured using ViTDet's predicted bounding boxes as the prompt. End-to-end Jetson Orin latency and A100 throughput are measured with TensorRT and fp16.

Image Classification

Datasets
Pretrained Models
Use in Pytorch
Evaluation
Deployment
Training

Semantic Segmentation

Datasets
Pretrained Models
Use in Pytorch
Evaluation
Visualization
Deployment

Contact

Han Cai: hancai@mit.edu

TODO

ImageNet Pretrained models
Segmentation Pretrained models
ImageNet training code
EfficientViT L series, designed for cloud
EfficientViT for segment anything
EfficientViT for image generation
EfficientViT for CLIP
EfficientViT for super-resolution
Segmentation training code

Citation

If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@article{cai2022efficientvit,
  title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},
  author={Cai, Han and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2205.14756},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (paper, poster)

News

About EfficientViT Models

Third-Party Implementation/Integration

Getting Started

EfficientViT Applications

Segment Anything

Image Classification

Semantic Segmentation

Contact

TODO

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (paper, poster)

News

About EfficientViT Models

Third-Party Implementation/Integration

Getting Started

EfficientViT Applications

Segment Anything

Image Classification

Semantic Segmentation

Contact

TODO

Citation