Electronic International Standard Serial Number (EISSN)
1872-9681
abstract
This paper presents Video Object Segmentation Enhanced with Segment Anything Model (VOS-E-SAM), a multi-stage architecture for Semi-supervised Video Object Segmentation (SVOS) using the foundational Segment Anything Model (SAM) architecture, aimed at addressing the challenges of mask degradation over time in long video sequences. Our architectural approach enhances the object masks produced by the XMem model by incorporating SAM. This integration uses various input combinations and low-level computer vision techniques to generate point prompts, in order to improve mask continuity and accuracy throughout the entire video cycle. The main challenge addressed is the fading or vanishing of object masks in long video sequences due to problems such as changes in object appearance, occlusions, camera movements, and approach changes. Both the baseline architecture and the newer high-quality version are tested, addressing the primary challenge of object mask fading or vanishing in long video sequences due to changes in object appearance, occlusions, camera movements, and variations in approach. Through rigorous experimentation with different prompt configurations, we identified an outstanding configuration of SAM inputs to improve mask refinement. Evaluations on benchmark long video datasets, such as LongDataset and LVOS, show that our approach significantly improves mask quality in single-object extended video sequences proven by percentage increments on jaccard index ( ) and contour accuracy ( ) based metrics (mean, recall and decay). Our results show remarkable improvements in mask persistence and accuracy, which sets a new standard for the integration of foundational models in video segmentation and lays the foundation for future research in this field.
Classification
subjects
Computer Science
keywords
video object segmentation; foundation model; object segmentation; long-term videos; deep learning