💡 Proposal: Add temporal-grounding pipeline for video-language tasks

### Feature request

Hi 🤗 team and contributors,

I'm currently exploring ways to extend the `transformers` library to support **temporal grounding** — the task of identifying a [start, end] timestamp segment in a video given a natural language query.

While Hugging Face already supports pipelines like `video-classification`, `image-to-text`, and `zero-shot-image-classification`, it seems there is currently **no pipeline or task definition for video moment retrieval / temporal grounding** tasks.

### Motivation

As multimodal models become increasingly capable of understanding both vision and language (e.g., BLIP2, VideoChatGPT, TimeChat), there is a growing demand for models that can not only recognize **what** is happening in a video, but also **when** it happens.

Temporal Grounding — the task of identifying a relevant moment span [start, end] in a video given a natural language query — is a fundamental step in making video-language models temporally aware.

> For example:
> Given a query like "the person starts cooking", a temporal grounding model is expected to localize the clip where this action occurs. --> [3.5, 8.9]

This capability is critical for a wide range of downstream tasks:
- **Video Question Answering** (When does X happen?)
- **Video Summarization and Highlighting**
- **Instruction-following agents in videos**

### Your contribution

I’d love to know if there is any ongoing work on this?

If not, I’d love to propose an initiative to explore what it might look like to support temporal grounding as a task within the `transformers` library — either through a new pipeline or modular components that make it easier to build moment retrieval models using Hugging Face tools.

I’m also curious if any other contributors might be interested in collaborating on this idea. I’d be very happy to contribute and work with others who are exploring similar directions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

💡 Proposal: Add temporal-grounding pipeline for video-language tasks #38450

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

💡 Proposal: Add temporal-grounding pipeline for video-language tasks #38450

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions