SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Data

Introduction

The central paradox of 3D scene understanding — the task of enabling machines to perceive, reason about, and interact with three-dimensional environments — is that while the internet provides an effectively unlimited supply of video data depicting real-world indoor scenes, existing annotated datasets remain bottlenecked at a scale of thousands of scenes collected through expensive, instrumented capture pipelines. ScanNet, the de facto benchmark for 3D perception, has stagnated at ~1,500 scenes since 2017. ARKitScenes, despite leveraging consumer-grade depth sensors, covers only single-room apartments captured under constrained protocols. This data scarcity fundamentally limits progress: models trained on small datasets overfit to domain-specific biases, fail to generalize across scene types, and cannot leverage the scale advantages that have driven breakthroughs in 2D vision and NLP.

SceneVerse++ (SV++) resolves this paradox through an automated data engine that transforms unlabeled internet Room Tour videos into fully annotated 3D scenes at scale. The system, accepted as a high-score paper at CVPR 2026 by BIGAI (Beijing Institute for General Artificial Intelligence) in collaboration with Peking University, Tsinghua University, Beijing University of Posts and Telecommunications, and Beijing Institute of Technology (first author Yixin Chen, corresponding author Siyuan Huang, including Song-Chun Zhu), produces 6,687 real indoor scenes sourced from 8,217 YouTube videos, with an average of 49 objects per scene spanning 21 object categories. Crucially, SV++ scenes are significantly larger than ScanNet or ARKitScenes — covering multi-floor layouts, large commercial spaces, and complex room configurations that reflect genuine real-world diversity.

This article provides a deep technical analysis of the SV++ pipeline, its experimental validation across four downstream tasks (3D object detection, instance segmentation, spatial VQA, and vision-language navigation), and the broader implications for data-driven 3D perception research.

Background: From Video to 3D Scenes

2.1 The Rise of Data Engine Paradigms

The concept of a data engine — an automated pipeline that converts raw, unannotated data into structured training signals — has become a defining methodology in modern AI research. Segment Anything (SAM) demonstrated that semi-automatic annotation at scale could produce foundational segmentation models capable of zero-shot transfer to novel object categories. The LLaVA family showed that image-caption pairs mined from the web could train capable vision-language models without curated instruction datasets. In the 3D domain, however, this paradigm faces unique challenges: videos must be geometrically consistent across frames, camera poses must be estimated without IMU or depth sensor instrumentation, and 2D annotations must be lifted into 3D space with spatial coherence across multiple views. SV++ is, to our knowledge, the first system to successfully close this loop end-to-end for real-world 3D scene understanding at this scale.

A key distinction from prior work is the treatment of label quality as a first-class design constraint. Earlier approaches to web-scale 3D data (e.g., scene-level reconstructions from YouTube8M) produced geometry only — no semantic labels, no instance masks, no structured relationships. SV++’s contribution is demonstrating that VLM-assisted annotation can achieve sufficient label fidelity to serve as pretraining signal for downstream tasks, closing the gap between geometric reconstruction and semantic understanding.

2.2 Dataset Scale Overview

Dataset	# Scenes	Source	Capture Method	Avg Objects/Scene	Scene Complexity
ScanNet v2	~1,500	Instrumented	Structured Light + IMU	~30	Single-room apartments
ARKitScenes	~5,000	Instrumented	LiDAR (iPad Pro)	~25	Single-floor residential
Replica / Synthetic	~100+	Synthetic	Renderer	Variable	Clean, artifact-free
SceneVerse++	6,687	Internet Video	SfM from RGB	49	Multi-floor, large-area, diverse

The scale advantage is substantial but not the full story. SV++ scenes are drawn from genuine human-curated content — home tours, hotel walkthroughs, museum visits, office space showcases — which introduces natural diversity in layout, lighting, object arrangement, and architectural style that synthetic or instrumented datasets cannot replicate. The average of 21 categories per scene also reflects richer semantic composition than prior benchmarks. Importantly, the scene area distribution has a much longer tail: SV++ includes multi-story houses, large open-plan offices, exhibition halls, and commercial spaces that exercise models on spatial reasoning at scales rarely seen in existing benchmarks.

Core Technology: Automated Data Generation Pipeline

The SV++ pipeline consists of five sequential stages, each addressing a specific subproblem in the video-to-3D transformation. The overall data flow is:

\text{YouTube Video} \xrightarrow{\text{TransNetV2}} \text{Shots} \xrightarrow{\text{Filter}} \text{Keyframes} \xrightarrow{\text{COLMAP}} \text{Sparse 3D} \xrightarrow{\text{PriorDA + TSDF}} \text{Dense Mesh} \xrightarrow{\text{CropFormer + Qwen-VL}} \text{Annotated Scene}

We examine each stage in detail.

3.1 Video Preprocessing and Filtering

Raw internet videos are unsuitable for 3D reconstruction without preprocessing. Room Tour videos exhibit characteristics that would catastrophically degrade SfM if fed directly: frequent cuts between rooms, handheld motion blur, human presenters occupying significant frame regions, outdoor establishing shots interspersed with interior footage, and periods where the camera remains nearly stationary (providing zero parallax). The pipeline applies three filtering operations in sequence:

Shot Boundary Detection: TransNetV2 segments each video into temporally coherent shots at frame-level granularity. This neural shot boundary detector operates on frame pairs, classifying whether a hard cut or gradual transition occurs between consecutive frames. The output is a set of shot boundaries that isolate continuous camera motion segments suitable for incremental SfM processing.
Quality Filtering: A multi-stage cascade filter removes undesirable content:
- Person removal: A person detector (likely based on a pretrained detector such as YOLO or RetinaNet) identifies frames where humans occupy more than a threshold fraction of the image area. Such frames are excluded because human presence both occludes scene geometry and introduces non-rigid motion that violates SfM’s rigid-scene assumption.
- Outdoor rejection: Sky/vegetation classifiers identify frames captured outside buildings or through windows showing exterior views. Outdoor frames provide no useful indoor geometry and introduce infinite-depth features that destabilize bundle adjustment.
- Exposure filtering: Frames with mean pixel intensity below a dark threshold or above a bright threshold are discarded as underexposed or overexposed — such frames produce unreliable feature detections.
- Parallax thresholding: Shots where the estimated camera motion falls below a minimum translation threshold are filtered out. Stationary cameras produce degenerate SfM problems where depth is unobservable.
This stage is critical — contaminated input propagates errors multiplicatively through all downstream modules. A single bad frame introduced into the SfM optimization can cause bundle adjustment divergence, corrupting the entire scene reconstruction.
Keyframe Extraction: Rather than using uniform temporal sampling (which either wastes computation on redundant frames or misses critical viewpoints), the pipeline selects keyframes based on parallax maximization. For each candidate frame $i$ , the pipeline estimates the essential matrix $E_{i,i-1}$ between frame $i$ and its predecessor, decomposes it to extract relative translation $\mathbf{t}_i$ , and accepts the frame only when $\|\mathbf{t}_i\| > \tau_{\text{parallax}}$ . This adaptive strategy ensures every selected frame contributes meaningful epipolar constraints while reducing redundant computation by 3-5× compared to uniform sampling.

3.2 SfM 3D Reconstruction

Structure-from-Motion (SfM) jointly estimates camera intrinsics $(K)$ and extrinsics $(R_i, \mathbf{t}_i)$ for all frames in a shot, together with a sparse 3D point cloud of triangulated feature positions. The optimization objective is the standard reprojection error:

\min_{K, \{R_i, \mathbf{t}_i\}, \{\mathbf{X}_j\}} \sum_i \sum_{j \in \mathcal{V}(i)} \left\| \pi(K, R_i, \mathbf{t}_i; \mathbf{X}_j) - \mathbf{x}_{ij} \right\|^2_2

where $\pi(\cdot)$ is the projection function, $\mathbf{X}_j$ is the 3D position of landmark $j$ , $\mathbf{x}_{ij}$ is its observed 2D position in frame $i$ , and $\mathcal{V}(i)$ is the set of landmarks visible in frame $i$ .

SV++ uses COLMAP as the backbone SfM solver but introduces two optimizations specific to long internet videos that distinguish it from off-the-shelf usage:

Trajectory Pseudo-pixel Optimization: Long Room Tour videos often contain degenerate camera motions — pure rotation sequences (camera operator spinning in place), rapid pans (whip pans), or near-stationary periods. These produce poorly conditioned normal equations in bundle adjustment, where certain camera parameters become unobservable (the classic gauge freedom problem). The pipeline addresses this by constraining pseudo-pixel trajectories: rather than optimizing each camera pose independently, a smoothness regularizer penalizes large accelerations in the camera trajectory:
$\mathcal{L}_{\text{smooth}} = \lambda_s \sum_i \left\| \mathbf{t}_{i+1} - 2\mathbf{t}_i + \mathbf{t}_{i-1} \right\|^2_2$
This regularization reduces drift in extended sequences (>500 frames) by preventing the optimizer from exploiting degenerate degrees of freedom. The regularization strength $\lambda_s$ is adaptively reduced as convergence progresses, allowing fine-grained pose refinement once a coarse trajectory is established.
Relative Image Similarity Weighting: Not all image pairs contribute equally reliable feature matches. Pairs with significant appearance change — caused by lighting transitions (walking from dark hallway to sunlit room), auto-exposure adjustments, or motion blur — tend to produce spurious matches even when feature descriptors appear confident. The pipeline computes a similarity score $s_{ij}$ between each image pair (using perceptual hash or deep embedding distance) and weights the corresponding reprojection terms:
$\mathcal{L} = \sum_i \sum_j w_{ij} \left\| \pi(\cdot) - \mathbf{x}_{ij} \right\|^2_2, \quad w_{ij} = s_{ij}^\alpha$
where $\alpha > 0$ controls the sharpness of weighting. This downweighting prevents a small number of outlier pairs from dominating the optimization.

The output of this stage is a sparse point cloud with registered camera poses stored as mesh.ply (geometry) and camera_info.json (per-frame intrinsics/extrinsics) for each reconstructed scene. Scenes where SfM fails to converge (insufficient parallax, too few valid keyframes) are automatically discarded — approximately 15-20% of input videos fail at this stage, highlighting the challenge of working with unconstrained internet video.

3.3 Dense Reconstruction and Instance Segmentation

With sparse geometry established, the pipeline proceeds to dense surface reconstruction and object-level semantic annotation — the stages that transform geometric point clouds into machine-learning-ready training data.

PriorDA for Depth Prediction: A monocular depth estimation model (PriorDA) predicts dense depth maps $D_i$ for each keyframe $i$ . PriorDA employs a diffusion-based formulation that iteratively refines depth predictions from a noise distribution conditioned on the RGB image. The diffusion process provides better uncertainty handling than regression-based alternatives (such as MiDaS or DPT), particularly for the out-of-distribution frames common in internet video (unusual lighting, reflective surfaces, transparent objects). Throughput is approximately 71 seconds per scene, dominated by the iterative denoising steps required for each frame.

TSDF Fusion: Predicted depth maps are fused into a unified Truncated Signed Distance Function (TSDF) volume using the estimated camera poses from SfM. For each voxel $\mathbf{v}$ at depth $d(\mathbf{v})$ , the TSDF update rule integrates measurements from multiple views:

F^{new}(\mathbf{v}) = \frac{W^{old}(\mathbf{v}) F^{old}(\mathbf{v}) + w_i \cdot S(d_i(\mathbf{v}) - d_{ref})}{W^{old}(\mathbf{v}) + w_i}

where $S(\cdot)$ is the truncated signed distance function, $d_i(\mathbf{v})$ is the depth measurement from frame $i$ , $d_{ref}$ is the current voxel depth estimate, and $w_i$ is a confidence weight derived from the PriorDA prediction uncertainty. The TSDF representation handles sensor noise gracefully through weighted averaging and naturally produces watertight meshes via Marching Cubes extraction.

CropFormer Instance Segmentation + 3D Aggregation: CropFormer performs 2D instance segmentation on each keyframe, producing per-pixel instance mask predictions $M_i(k)$ for instance $k$ in frame $i$ . These masks are projected into 3D via the depth maps: each pixel $(u, v)$ belonging to instance $k$ in frame $i$ maps to a 3D point $\mathbf{p} = D_i(u,v) \cdot K^{-1}[u, v, 1]^\top$ in the camera coordinate system, then transformed to world coordinates via the SfM pose $(R_i, \mathbf{t}_i)$ .

Cross-view aggregation combines these per-view 3D points into consistent instance segments. The voting scheme assigns each 3D point to the instance receiving the most votes across all observing views, with tie-breaking favoring the view with highest estimated depth confidence. This stage runs at approximately 96 seconds per scene.

A critical sub-module, SVPPProcessor, orchestrates three operations:

Superpoint segmentation: Over-segmentation of the dense mesh into compact, approximately-planar superpoints using a graph-based algorithm.
Instance label assignment: Each superpoint receives an instance label based on majority voting of its constituent points’ 2D segment assignments.
Segment re-splitting (resplit_segments): A post-processing step that detects and splits superpoints straddling multiple true instances — a common failure mode when two adjacent objects share similar visual appearance but different semantic identity.

Automatic Semantic Annotation: DescribeAnything (a vision-language model designed for dense region captioning) combined with Qwen-VL generates semantic labels for each segmented 3D instance. The VLM receives cropped image patches of each instance from the most fronto-parallel viewing angle and produces a natural language description (e.g., “wooden dining table with four chairs around it”). A lightweight parser extracts the primary object category and attributes from this description. This VLM-based labeling step is where the pipeline achieves true scalability — no human annotator reviews any mask or label, eliminating the traditional bottleneck that has limited 3D dataset construction to instrumented capture settings.

3.4 Spatial VQA Data Generation

Spatial Visual Question Answering requires not just object detections but structured relationships between objects — distances, directions, spatial containment, and relative positions. Standard VQA datasets (VQAv2, GQA) operate on 2D images and cannot express genuinely 3D spatial relations. SV++ constructs this data through a two-stage procedure:

Scene Graph Construction: From the 3D instance annotations (positions $\{\mathbf{p}_k\}$ , bounding boxes $\{B_k\}$ , semantic labels $\{l_k\}$ ), the pipeline extracts a directed graph $G = (V, E)$ where nodes represent instances and edges encode spatial relations. Each edge $e_{ij} \in E$ carries attributes:
- Relation type: $r_{ij} \in \{$ left-of, right-of, above, below, inside, near, behind $\}$
- Distance: $d_{ij} = \|\mathbf{p}_i - \mathbf{p}_j\|_2$ in meters
- Direction vector: $\hat{\mathbf{d}}_{ij} = (\mathbf{p}_j - \mathbf{p}_i) / d_{ij}$
Template-Based QA Generation: Seven categories of question templates are instantiated automatically from the scene graph:
- Existence: “Is there a [object] in the [room]?” → Boolean answer from existence check
- Count: “How many [objects] are visible?” → Integer count from node filter
- Spatial Relation: “What is to the left of the [object]?” → Object label from edge traversal
- Distance: “How far is the [object] from the [object]?” → Numerical answer from $d_{ij}$
- Direction: “In which direction is the [object] from the viewer?” → Cardinal direction from ray casting
- Room Size: “What is the approximate area of this room?” → Floor area from mesh geometry
- Object Attribute: “What color is the [object]?” → Attribute from VLM description

Each template is instantiated by substituting ground-truth values from the scene graph, ensuring answer correctness by construction. No manual QA annotation is performed. The result is millions of spatially-grounded QA pairs spanning diverse question types, scene layouts, and object configurations.

Vision-Language Navigation (VLN) requires trajectories through environments paired with natural language instructions that describe how to follow them. The R2R (Room-to-Room) benchmark standardizes this task: given a photograph of the starting location, a language instruction, and a sequence of egocentric observations, the agent must navigate to the correct goal location. SV++ generates VLN-compatible training data through a three-stage pipeline:

Path Preprocessing: Valid navigation paths are extracted from the reconstructed mesh by identifying walkable surfaces (horizontal planes with normal vectors within $\pm 15°$ of vertical, above floor level thresholds) and computing shortest-path routes between sampled navigable points using A* search on the walkability graph. Paths passing through narrow passages (< 0.8m width) or approaching obstacles (< 0.5m clearance) are filtered out as unrealistically constrained.
Action Encoding: Each continuous path is discretized into a sequence of atomic actions compatible with the R2R simulator action space: “forward” ( $0.5$ m step), “turn left” / “turn right” ( $15°$ increment), and stop. The discretization uses piecewise-linear approximation with error bounded below one step size, ensuring the discrete path faithfully represents the continuous trajectory.
Instruction Generation: A language model (conditioned on the action sequence $\mathbf{a}_{1:T}$ and observed visual features along the path) generates natural language navigation instructions. Critically, the pipeline applies three augmentation strategies:
- Paraphrasing: Instructions are rewritten with varied syntactic structure (“turn left after the table” vs. “once you pass the table, make a left turn”)
- Elaboration: Secondary landmarks and distractor descriptions are added to increase instruction length and complexity
- Distractor insertion: References to objects not along the correct path are occasionally included, forcing agents to maintain visual grounding rather than pattern-matching keywords

These augmentations address a well-documented failure mode in VLP systems: exploitation of superficial linguistic correlations (e.g., associating “stop” with short paths regardless of visual context) rather than genuine vision-language grounding.

Experimental Results: Deep Analysis

We present results across four tasks, with emphasis on what each experiment reveals about data quality, model behavior, and the limitations of current evaluation paradigms. All experiments follow a consistent protocol: pretrain on SV++ (or baseline) data, finetune on target benchmark, evaluate on standard test sets.

4.1 3D Object Detection (SpatialLM)

SpatialLM is a 3D object detection framework with a distinctive architecture: it tokenizes raw point clouds and injects them directly into a language model (Qwen2-0.5B as the backbone) via special tokens <|point_start|> and <|point_end|>. The core class SpatialLMQwenForCausalLM inherits from Qwen2’s causal LM and extends the forward method to accept a point_clouds parameter — a tensor of shape $(B, N, 3)$ containing $N$ 3D points per sample in batch $B$ . Point coordinates are embedded through learned position encodings and inserted into the token sequence at designated positions, allowing the language model to attend jointly over text and 3D geometry.

The model outputs predictions in SceneScript DSL format — structured text describing scene entities (Wall, Door, Window, Bbox) with normalized discrete coordinates. Inference configuration: temperature=0.6, top_k=10, max_new_tokens=4096.

Pre-training Strategy	F1@0.25	F1@0.5	Notes
Scratch (ScanNet only)	2.9	—	Nearly non-functional baseline
Synthetic Data → ScanNet FT	38.0	—	Synthetic pretraining helps significantly
SV++ → ScanNet FT	58.6	—	+20.6 over synthetic; best overall
SV++ Zero-shot	30.9	—	Exceeds synthetic+finetuned level

The most striking result is the zero-shot performance of 30.9 F1@0.25, which already exceeds the synthetic-pretrained-and-finetuned baseline of 38.0. This indicates that SV++ data transfers substantially better than synthetic data, likely due to distributional alignment: SV++ scenes originate from real internet video captures (realistic lighting, authentic textures, natural clutter), whereas synthetic data exhibits rendering artifacts (over-clean surfaces, uniform lighting, lack of wear) that create a domain gap with the real ScanNet evaluation scenes (captured via structured-light scanners).

The +20.6 absolute gain from SV++ pretraining over the next-best approach represents one of the largest reported improvements for 3D object detection attributable solely to data scaling. To contextualize: this improvement exceeds many architecture innovations proposed in recent years (novel attention mechanisms, specialized 3D backbones, complex loss functions), suggesting that for 3D perception, data quality and diversity may currently be the binding constraint rather than model architecture.

4.2 3D Instance Segmentation (Mask3D)

For Mask3D, a query-based 3D instance segmentation model that predicts object instances directly from point cloud features without proposal generation, the results reveal a starkly different picture:

Pre-training Strategy	AP	AP $_{50}$	AP $_{25}$	Notes
Scratch (ScanNet only)	22.8	—	—	Baseline
SV++ → ScanNet FT	23.6	—	—	+0.8 marginal gain

The modest improvement of +0.8 AP stands in sharp contrast to the +20.6 F1 gain observed for SpatialLM on the same pretraining data. This divergence is not a statistical anomaly — it exposes a fundamental architectural sensitivity with important implications for 3D perception system design.

Mask3D consumes precomputed 3D features: input point clouds are typically voxelized, encoded through a 3D backbone (such as SparseUNet or MinkowskiNet), and the resulting feature volumes are fed to transformer decoder queries. This means Mask3D inherits the distributional characteristics of whatever upstream preprocessing pipeline generated those features. When the pretraining pipeline (SV++: SfM sparse reconstruction → PriorDA monocular depth → TSDF fusion → voxelization) differs from the evaluation pipeline (ScanNet: structured-light depth fusion → voxelization), the feature distributions shift — not just in mean and variance but potentially in the manifold structure of the feature space itself.

SpatialLM avoids this problem entirely because it processes raw point clouds through its own learned tokenizer. The tokenizer adapts to whatever point cloud distribution it encounters during pretraining, making the model robust to upstream pipeline variation. This is the insight we formalize in Section 5.1.

4.3 Spatial VQA (VSI-Bench)

The spatial VQA experiments use the VSI-Bench benchmark, testing Qwen2.5-VL-3B with various pretraining data configurations. Unlike detection and segmentation, VQA evaluates joint vision-language understanding — the model must both perceive the scene (via images or 3D representations) and reason spatially (via language):

Configuration	Overall Score	Relative Distance	Direction	Object Count	Room Size
Zero-shot (no 3D data)	27.9	—	—	—	—
SV++ only	42.8 (+14.9)	↑↑ Strongest gain	↑↑ Strongest gain	Moderate (+)	Moderate (+)
ScanNet (SN)	Higher than ZS	Lower than SV++	Lower than SV++	↑↑ Best	↑↑ Best
SN + SV++ combined	Highest overall	High	High	High	High

Several findings merit detailed analysis:

Universal spatial knowledge benefits most: The largest gains from SV++ pretraining appear in Relative Distance and Direction questions — capabilities that depend on generalizable spatial reasoning (understanding Euclidean geometry, perspective, and reference frames) rather than dataset-specific statistics. SV++’s diverse scene distribution forces models to learn transferable spatial concepts because no single scene layout can serve as a shortcut. A model trained on ScanNet-style apartments might learn “tables are usually 2-3 meters apart” as a heuristic; a model trained on SV++’s heterogeneous scenes (from cramped studios to exhibition halls) must learn genuine distance estimation instead.
Domain-specific knowledge favors in-domain data: For Object Count and Room Size, ScanNet (SN) and SN++ outperform SV++. These tasks correlate with dataset-specific priors (e.g., typical furniture counts in ScanNet-style apartments cluster tightly around specific values; room areas in ScanNet fall within a narrow band). Models exploit these statistical shortcuts — a form of domain overfitting where in-domain data improves performance on questions whose answers can be predicted from dataset-level statistics rather than per-instance perception.
Overfitting inflection point exists: As training progresses on combined SN+SV++ data, the authors observe that in-domain metrics (Object Count on ScanNet-like scenes) continue improving while out-of-domain metrics (Direction on SV++-style diverse scenes) plateau or decline after a certain number of training iterations. This is a classic signature of distribution mismatch: the model increasingly specializes in the in-domain distribution at the expense of generalizable representations.

VLN results on the R2R (Room-to-Room) benchmark demonstrate the most dramatic relative improvement — and the most revealing ablation patterns:

Configuration	Success Rate (SR)	nDTW	SDTW	Improvement
No pretraining	0.088	—	—	Baseline (near-random)
SV++ → R2R Finetune	0.228	—	—	+159%
w/o Instruction Augmentation	0.074	—	—	Collapse (-67% vs. full)
w/o Trajectory Refinement	0.177	—	—	Significant drop (-22%)

The 159% improvement in success rate is remarkable — it transforms a navigator that barely exceeds random chance into one that completes more than one in five trajectories correctly. However, the ablation studies reveal that this gain is highly fragile and contingent on specific design choices:

Removing instruction augmentation causes SR to collapse to 0.074, which is below the no-pretraining baseline of 0.088. This is a deeply concerning result: it suggests the model was not learning genuine vision-language grounding but rather exploiting linguistic patterns present in the augmented instructions (specific phrasing structures, keyword co-occurrence statistics) as shortcuts. Without augmentation, these patterns disappear and performance degrades below random. This finding has direct parallels in the VLA literature, where navigation models have been shown to achieve high success rates on familiar instruction templates while failing completely on paraphrased versions.
Removing trajectory refinement reduces SR to 0.177, confirming that path quality matters substantially. The trajectory refinement module filters out geometrically implausible paths — those that pass through furniture, require impossibly tight turns, or violate physical constraints — that would otherwise inject noise into the training signal. The 22% drop indicates that roughly half of SV++’s gains come from having cleaner training trajectories, not just more of them.

Key takeaway: Task-specific data processing (instruction diversity management, path validity verification) is not optional engineering detail — it is the primary determinant of whether scaled data translates into task performance. Naive data scaling without task-aligned curation can be actively harmful, as demonstrated by the sub-baseline performance when augmentation is removed.

Discussion: Critical Insights

5.1 Model Scalability Differences: SpatialLM vs Mask3D

The divergent responses of SpatialLM (+20.6 F1) and Mask3D (+0.8 AP) to identical pretraining data constitute the most important empirical finding in the SV++ paper, yet it risks being dismissed as a curiosity. We argue it reveals a fundamental property of 3D perception architectures with broad implications for model design.

Models that operate directly on raw sensor modalities — processing unstructured point clouds through learned tokenizers before feeding into transformer backbones — exhibit far greater data efficiency and cross-domain transfer than models that consume precomputed intermediate representations (voxel grids, TSDF meshes, hand-crafted 3D features). The reason is structural: learned tokenizers act as domain-adaptive front-ends that can adjust their internal representations to match whatever distribution they encounter during pretraining. Fixed preprocessing pipelines, by contrast, impose a static mapping that may be well-suited to one domain (structured-light scans) but poorly suited to another (monocular SfM reconstructions).

This principle extends beyond 3D vision. In NLP, tokenizers trained on diverse corpora (Byte-Pair Encoding on web text) enable cross-domain transfer; fixed vocabularies struggle. In speech, learned audio front-ends (wav2vec 2.0) outperform hand-crafted MFCC features for transfer learning. SV++ provides the 3D analog of this well-established pattern.

For practitioners, the implication is clear: when building perception systems intended to leverage heterogeneous data sources (internet video, synthetic renders, multiple sensor types, legacy datasets), architectural choices should minimize dependency on fixed upstream processing pipelines. The PQ3D framework included in SV++’s codebase exemplifies this direction: its CoordinateEncoder applies Fourier positional encoding (adapted from Mask3D) to learn spatial embeddings from coordinates; MaskHeadSegLevel fuses multi-scale predictions from voxel, multi-view, and point cloud feature sources (taking their average); and SVPPProcessor manages the end-to-end flow from raw mesh to instance predictions. The full model trains with hidden_size=768, num_queries=120, following a two-phase schedule: 200 epochs of pretraining on SV++ data, followed by 600 epochs of finetuning on target benchmarks.

5.2 Data Quality Over Raw Scale

A recurring theme across all four tasks — detection, segmentation, VQA, and navigation — is that data quality consistently outweighs raw quantity. SV++’s 6,687 scenes outperform synthetic datasets containing orders of magnitude more samples because each SV++ sample carries higher fidelity geometric and semantic information. But what does “quality” mean concretely in this context?

Quality comprises at least four dimensions:

Distributional alignment: How closely the pretraining data distribution matches the evaluation distribution. SV++ scenes come from real video captures; ScanNet evaluation scenes come from real instrumented captures. Both are “real,” sharing texture statistics, lighting variability, and clutter patterns that synthetic data lacks.
Annotation correctness: How accurately the labels reflect ground truth. VLM-based annotation introduces errors (misclassified objects, missed instances, confused categories), but the error rate is low enough that models learn signal rather than noise — and the errors themselves may serve as implicit noise augmentation that improves robustness.
Geometric consistency: Whether the 3D structure is physically plausible. SfM reconstruction failures (misregistered cameras, flipped room layouts) would poison training data. The multi-stage filtering pipeline (parallax checks, reprojection error thresholds, mesh validity tests) enforces a minimum geometric quality bar.
Task relevance: Whether the annotation schema supports the target task. Instance masks help detection and segmentation; scene graphs enable VQA; trajectories support VLN. Mismatched schemas (e.g., having bounding boxes but needing segmentation masks for Mask3D) waste data potential regardless of quantity.

The quality control mechanisms in the SV++ pipeline — parallax-based keyframe selection, appearance-weighted SfM, VLM-based semantic verification, trajectory refinement — collectively act as implicit curriculum design. They ensure that the data presented to models progressively increases in difficulty and reliability, analogous to how curriculum learning orders training examples from easy to hard. The 71-second PriorDA inference time and 96-second CropFormer processing time per scene are not overhead to be minimized — they are the mechanism by which quality is enforced.

5.3 Domain Bias and Overfitting Risk

The VQA experiments reveal a troubling pattern that deserves emphasis: models trained predominantly on SV++ data show strong improvements on universal spatial reasoning tasks (direction, distance estimation) but weaker or sometimes negative transfer on domain-specific queries (object count, room size estimation). This asymmetry has a clear interpretation.

SV++’s scene distribution — dominated by large, diverse, architecturally varied spaces sourced from global internet content — creates a statistical prior that differs substantially from ScanNet’s prior (compact North American/European apartments with standardized layouts). When asked “how many chairs are in this room?”, a model trained on SV++ learns a broad, high-variance distribution (anywhere from 0 to 20+, depending on scene type). A model trained on ScanNet learns a tight, low-variance distribution (typically 2-6). On a ScanNet test scene, the ScanNet-trained model’s narrower prior gives it an advantage — even though its prior is less generally correct.

The implication for benchmark design is significant: aggregate metrics conceal underlying distribution shifts. A model that appears superior on overall score may be systematically worse on certain question types, scene categories, or geographic/architectural styles. We recommend that future 3D perception evaluations report fully disaggregated metrics by question category, scene type, object class, and — where available — geographic/cultural origin. Only through disaggregation can researchers determine whether improvements reflect genuine capability gains or merely better alignment with benchmark-specific priors.

Furthermore, the observed inflection point — where continued training on combined data improves in-domain metrics while degrading out-of-domain ones — should inform early stopping and model selection strategies. Selecting models purely on validation set performance (drawn from the same distribution as test) may systematically favor overfitted models that perform poorly in deployment scenarios where the data distribution differs from the benchmark.

5.4 Implications for Autonomous Driving and VLA Systems

While SV++ targets indoor scene understanding, its methodological lessons transfer directly to autonomous driving and Vision-Language-Action (VLA) systems — domains where the reader community likely has direct professional interest.

Data engine architecture is universal: The five-stage SV++ pipeline (filter → reconstruct → annotate → augment → validate) applies with straightforward substitutions to driving scenarios:

Replace Room Tour videos with dashcam recordings, surround-view sensor logs, or crowd-sourced driving videos
Replace indoor SfM with automotive SLAM (visual-inertial odometry, LiDAR SLAM) or neural radiance field reconstruction
Replace spatial QA with driving-relevant question types (traffic sign recognition, pedestrian intent prediction, maneuver feasibility)
Replace VLN trajectories with driving trajectories paired with navigation instructions

The infrastructure investments in SV++ — particularly the quality filtering cascades and VLM-based annotation — reduce to a template that can be adapted to new domains without reinventing the methodological foundation.

Real-world data beats synthetic for perception pretraining — probably: The zero-shot transfer results strongly suggest that pretraining on real (even noisy, imperfectly reconstructed) data outperforms clean synthetic data when the target domain is also real. For autonomous driving, this implies that large-scale curated dashcam datasets (e.g., BDD100K at scale, Waymo Open Dataset extended with VLM annotations) may be more valuable for foundation model pretraining than high-fidelity simulation renders from CARLA or Unreal Engine. Simulation excels for safety-critical edge case generation and reinforcement learning, but for pretraining perception foundations, real-world data’s distributional authenticity appears to outweigh simulation’s cleanliness. This hypothesis warrants direct empirical testing.

VLN instruction fragility is a warning for VLA systems: The observation that removing instruction augmentation causes VLN performance to collapse below the unpretrained baseline mirrors documented challenges in VLA systems for autonomous driving. Driving instruction generators (converting planned trajectories to natural language like “turn left at the intersection”) risk creating exploitable linguistic shortcuts — models may learn to associate “intersection” with “turn” statistically rather than grounding the instruction in perceived scene geometry. SV++’s response — aggressive augmentation with paraphrasing, elaboration, and distractors — provides a concrete blueprint for building robust VLA language grounding. For practitioners developing VLA systems for vehicles or robots, instruction diversity management should be treated as a first-class component of the data pipeline, not an afterthought.

References

Chen, Y., Huang, S., et al. “Lifting Unlabeled Internet-level Data for 3D Scene Understanding.” CVPR, 2026. arXiv:2604.01907
SceneVerse++ Project Page: https://sv-pp.github.io/
SceneVerse++ Dataset: HuggingFace
SceneVerse++ Codebase: GitHub
Dai, A., Ritchie, D., et al. “ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes.” CVPR, 2017.
Kirillov, A., Mintun, E., et al. “Segment Anything.” ICCV, 2023.
Liu, H., Li, C., et al. “Visual Instruction Tuning.” NeurIPS, 2023. (LLaVA)
Anderson, P., Chang, A., et al. “Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments.” CVPR, 2018. (R2R)
Schonberger, J.L., Frahm, J.M. “Structure-from-Motion Revisited.” CVPR, 2016. (COLMAP)
Barroso-Laguna, A. et al. “TransNetV2: Efficient HD Shot Boundaries Detection Using Temporal Residuals.” ACM MM, 2021.

SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Data

Introduction

Background: From Video to 3D Scenes

2.1 The Rise of Data Engine Paradigms

2.2 Dataset Scale Overview

Core Technology: Automated Data Generation Pipeline

3.1 Video Preprocessing and Filtering

3.2 SfM 3D Reconstruction

3.3 Dense Reconstruction and Instance Segmentation

3.4 Spatial VQA Data Generation

3.5 VLN Navigation Data Generation

Experimental Results: Deep Analysis

4.1 3D Object Detection (SpatialLM)

4.2 3D Instance Segmentation (Mask3D)

4.3 Spatial VQA (VSI-Bench)

4.4 Vision-Language Navigation (R2R)

Discussion: Critical Insights

5.1 Model Scalability Differences: SpatialLM vs Mask3D

5.2 Data Quality Over Raw Scale

5.3 Domain Bias and Overfitting Risk

5.4 Implications for Autonomous Driving and VLA Systems

References

Introduction#

Background: From Video to 3D Scenes#

2.1 The Rise of Data Engine Paradigms#

2.2 Dataset Scale Overview#

Core Technology: Automated Data Generation Pipeline#

3.1 Video Preprocessing and Filtering#

3.2 SfM 3D Reconstruction#

3.3 Dense Reconstruction and Instance Segmentation#

3.4 Spatial VQA Data Generation#

3.5 VLN Navigation Data Generation#

Experimental Results: Deep Analysis#

4.1 3D Object Detection (SpatialLM)#

4.2 3D Instance Segmentation (Mask3D)#

4.3 Spatial VQA (VSI-Bench)#

4.4 Vision-Language Navigation (R2R)#

Discussion: Critical Insights#

5.1 Model Scalability Differences: SpatialLM vs Mask3D#

5.2 Data Quality Over Raw Scale#

5.3 Domain Bias and Overfitting Risk#

5.4 Implications for Autonomous Driving and VLA Systems#

References#

Introduction

Background: From Video to 3D Scenes

2.1 The Rise of Data Engine Paradigms

2.2 Dataset Scale Overview

Core Technology: Automated Data Generation Pipeline

3.1 Video Preprocessing and Filtering

3.2 SfM 3D Reconstruction

3.3 Dense Reconstruction and Instance Segmentation

3.4 Spatial VQA Data Generation

3.5 VLN Navigation Data Generation

Experimental Results: Deep Analysis

4.1 3D Object Detection (SpatialLM)

4.2 3D Instance Segmentation (Mask3D)

4.3 Spatial VQA (VSI-Bench)

4.4 Vision-Language Navigation (R2R)

Discussion: Critical Insights

5.1 Model Scalability Differences: SpatialLM vs Mask3D

5.2 Data Quality Over Raw Scale

5.3 Domain Bias and Overfitting Risk

5.4 Implications for Autonomous Driving and VLA Systems

References