Papers
arxiv:2509.16972

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

Published on Sep 21, 2025
Authors:
,
,
,
,
,

Abstract

SegSaSa2VA enhances video object segmentation by addressing sparse frame sampling and single token reliance, achieving top performance in the LSVOS Challenge.

Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a J&F of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/magic-research/Sa2VA.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2509.16972
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.16972 in a Space README.md to link it from this page.

Collections including this paper 1