Skip to content

πŸ’‘ Proposal: Add temporal-grounding pipeline for video-language tasksΒ #38450

Open
@mreraser

Description

@mreraser

Feature request

Hi πŸ€— team and contributors,

I'm currently exploring ways to extend the transformers library to support temporal grounding β€” the task of identifying a [start, end] timestamp segment in a video given a natural language query.

While Hugging Face already supports pipelines like video-classification, image-to-text, and zero-shot-image-classification, it seems there is currently no pipeline or task definition for video moment retrieval / temporal grounding tasks.

Motivation

As multimodal models become increasingly capable of understanding both vision and language (e.g., BLIP2, VideoChatGPT, TimeChat), there is a growing demand for models that can not only recognize what is happening in a video, but also when it happens.

Temporal Grounding β€” the task of identifying a relevant moment span [start, end] in a video given a natural language query β€” is a fundamental step in making video-language models temporally aware.

For example:
Given a query like "the person starts cooking", a temporal grounding model is expected to localize the clip where this action occurs. --> [3.5, 8.9]

This capability is critical for a wide range of downstream tasks:

  • Video Question Answering (When does X happen?)
  • Video Summarization and Highlighting
  • Instruction-following agents in videos

Your contribution

I’d love to know if there is any ongoing work on this?

If not, I’d love to propose an initiative to explore what it might look like to support temporal grounding as a task within the transformers library β€” either through a new pipeline or modular components that make it easier to build moment retrieval models using Hugging Face tools.

I’m also curious if any other contributors might be interested in collaborating on this idea. I’d be very happy to contribute and work with others who are exploring similar directions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions