Description
Feature request
Hi π€ team and contributors,
I'm currently exploring ways to extend the transformers
library to support temporal grounding β the task of identifying a [start, end] timestamp segment in a video given a natural language query.
While Hugging Face already supports pipelines like video-classification
, image-to-text
, and zero-shot-image-classification
, it seems there is currently no pipeline or task definition for video moment retrieval / temporal grounding tasks.
Motivation
As multimodal models become increasingly capable of understanding both vision and language (e.g., BLIP2, VideoChatGPT, TimeChat), there is a growing demand for models that can not only recognize what is happening in a video, but also when it happens.
Temporal Grounding β the task of identifying a relevant moment span [start, end] in a video given a natural language query β is a fundamental step in making video-language models temporally aware.
For example:
Given a query like "the person starts cooking", a temporal grounding model is expected to localize the clip where this action occurs. --> [3.5, 8.9]
This capability is critical for a wide range of downstream tasks:
- Video Question Answering (When does X happen?)
- Video Summarization and Highlighting
- Instruction-following agents in videos
Your contribution
Iβd love to know if there is any ongoing work on this?
If not, Iβd love to propose an initiative to explore what it might look like to support temporal grounding as a task within the transformers
library β either through a new pipeline or modular components that make it easier to build moment retrieval models using Hugging Face tools.
Iβm also curious if any other contributors might be interested in collaborating on this idea. Iβd be very happy to contribute and work with others who are exploring similar directions.