Description
Feature request
I’d like to request the addition of SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference to the 🤗 Transformers library.
Paper: https://arxiv.org/abs/2410.04417
Authors: Yuan Zhang, Chun-Kai Fan, Junpeng Ma, et al.
Code: No official repo yet (as of submission date)
SparseVLM is a training-free, text-guided visual token sparsification method. It prunes redundant image tokens layer-wise using self-attention weights and introduces token recycling to compress pruned information. It works with existing VLMs like BLIP, Flamingo, and VideoBLIP, reducing FLOPs and latency by up to 60% while preserving accuracy.
Motivation
Vision-language models like BLIP, Flamingo, and others can be computationally heavy at inference time due to dense visual tokens. SparseVLM addresses this by pruning visual tokens without retraining or adding parameters. It offers an efficient, plug-and-play way to speed up inference for image and video tasks.
Integrating this into 🤗 Transformers could make existing VLMs much more usable in real-time applications and low-resource environments. It also fits perfectly with Hugging Face's ongoing work on efficient inference (e.g., bitsandbytes, quantization, MobileLLM).
Your contribution
Yes! I’d be happy to contribute an initial implementation of SparseVLM, including the visual token selection mechanism, sparsification logic, and a wrapper for models like BLIP or ViT.
Once the authors release a reference implementation or weights, I can also help align it with their code for accuracy.