SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

### Feature request

I’d like to request the addition of **SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference** to the 🤗 Transformers library.

**Paper**: [https://arxiv.org/abs/2410.04417](https://arxiv.org/abs/2410.04417)  
**Authors**: Yuan Zhang, Chun-Kai Fan, Junpeng Ma, et al.  
**Code**: No official repo yet (as of submission date)

SparseVLM is a training-free, text-guided visual token sparsification method. It prunes redundant image tokens layer-wise using self-attention weights and introduces token recycling to compress pruned information. It works with existing VLMs like BLIP, Flamingo, and VideoBLIP, reducing FLOPs and latency by up to 60% while preserving accuracy.

### Motivation

Vision-language models like BLIP, Flamingo, and others can be computationally heavy at inference time due to dense visual tokens. SparseVLM addresses this by pruning visual tokens **without retraining** or adding parameters. It offers an efficient, plug-and-play way to speed up inference for image and video tasks.

Integrating this into 🤗 Transformers could make existing VLMs much more usable in real-time applications and low-resource environments. It also fits perfectly with Hugging Face's ongoing work on efficient inference (e.g., bitsandbytes, quantization, MobileLLM).

### Your contribution

Yes! I’d be happy to contribute an initial implementation of SparseVLM, including the visual token selection mechanism, sparsification logic, and a wrapper for models like BLIP or ViT.

Once the authors release a reference implementation or weights, I can also help align it with their code for accuracy.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference #38509

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference #38509

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions