CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
Zhijian Hou, Wanjun Zhong, Lei Ji, DIFEI GAO, Kun Yan, W.K. Chan, Chong-Wah Ngo, Mike Zheng Shou, Nan Duan
Main: Language Grounding to Vision, Robotics, and Beyond Main-poster Paper
    Session 1: Language Grounding to Vision, Robotics, and Beyond (Virtual Poster)
    
Conference Room: Pier 7&8 
    Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)
    Global Time: July 10, Session 1 (15:00-16:30 UTC)
    
    
  
          Keywords:
          cross-modal application
        
        
        
        
          TLDR:
          This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference com...
        
  
    You can open the
    #paper-P527
    channel in a separate window.
  
  
    
            Abstract:
            This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient  COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87\% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.
          
         Anthology
 Anthology
       Underline
 Underline