GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

1The University of Hong Kong, 2Shanghai AI Laboratory

✶ indicates equal contribution. † indicates Corresponding Author

TL;DR

Using only egocentric video input, we achieve SOTA performance across all 3D language models without relying on point clouds.



Abstract

In recent years, 2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks. However, their performance in 3D spatial comprehension, critical for embodied intelligence, remains limited. Recent advances have leveraged point clouds and multi-view images as inputs, yielding promising results. However, we propose exploring a purely vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding. This paper empirically investigates the limitations of VLMs in 3D spatial understanding, revealing that their primary shortcoming lies in the lack of global-local correspondence between the scene and individual frames. To address this, we introduce GPT4Scene, a novel visual prompting paradigm in VLM training and inference that helps build the global-local relationship, significantly improving the 3D spatial understanding of indoor scenes. Specifically, GPT4Scene constructs a 3D Bird's Eye View (BEV) image from the video and marks consistent object IDs across both frames and the BEV image. The model then inputs the concatenated BEV image and video frames with markers. In zero-shot evaluations, GPT4Scene improves performance over closed-source VLMs like GPT-4o. Additionally, we prepare a processed training dataset consisting of 165K text annotation to fine-tune an open-source model, achieving state-of-the-art performance on 3D QA tasks with Qwen2-VL (GPT4Scene). Surprisingly, after training with the GPT4Scene paradigm, VLMs consistently improve during inference, even without visual prompting and BEV image. This demonstrates the proposed paradigm helps VLMs develop an intrinsic ability to understand 3D scenes. This paves the way for a non-invasive approach to extend pretrained VLMs for 3D scene understanding.

Overview

The overall architecture of GPT4Scene. It is capable of understanding 3D scenes and performing tasks such as 3D question answering, dense captioning, and visual grounding using only video input. In contrast to 3D point LLMs, GPT4Scene takes input solely from the vision modality, with global information provided by the BEV image reconstructed from the 3D structure derived from the video.

Pipeline and Model Architecture

The Framework of GPT4Scene. A scene video is processed by sampling frames, reconstructing a point cloud, and generating a BEV image. Object locations are detected from the point cloud and projected onto the video frames. The resulting frames and BEV image, enhanced with STO-Markers, are then used as inputs for VLM training and inference.

Experiments Results

3D Question Answering

Evaluation of 3D Question Answering on ScanQA and SQA3D datasets. GPT-4o (GPT4Scene) in the zero-shot setting outperforms most 3D LLM models. Fine-tuned with GPT4Scene, Qwen2-VL achieves state-of-the-art. The base setting uses N = 8 frames at 128 × 123, ”HD” increases resolution to 512 × 490, and ”HDM” combines this resolution with N = 32 frames.

3D Dense Caption & 3D Visual Grounding

Evaluation of 3D Dense Caption and 3D Visual Grounding Our results outperform all existing 3D LLM based models and this proves that indoor scenes can be understood using only the visual modality, without the 3D point clouds.

BibTeX

@article{GPT4Scene,
  title={GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models},
  author={Zhangyang Qi and Zhixiong Zhang and Ye Fang and Jiaqi Wang and Hengshuang Zhao},
  journal={arXiv preprint arXiv:2501.01428},
  year={2024}
}