3D-Reconstruction

SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Data

Introduction The central paradox of 3D scene understanding — the task of enabling machines to perceive, reason about, and interact with three-dimensional environments — is that while the internet provides an effectively unlimited supply of video data depicting real-world indoor scenes, existing annotated datasets remain bottlenecked at a scale of thousands of scenes collected through expensive, instrumented capture pipelines. ScanNet, the de facto benchmark for 3D perception, has stagnated at ~1,500 scenes since 2017. ARKitScenes, despite leveraging consumer-grade depth sensors, covers only single-room apartments captured under constrained protocols. This data scarcity fundamentally limits progress: models trained on small datasets overfit to domain-specific biases, fail to generalize across scene types, and cannot leverage the scale advantages that have driven breakthroughs in 2D vision and NLP. ...

VGGT: 几何重建作为世界模型的 reconstruct 维度

1. 动机：传统几何重建在什么地方失效一辆自动驾驶车驶入隧道。GNSS 信号在 50 米内衰减为噪声，IMU 漂移开始累积，前向 6 路相机持续以 10 Hz 输入。系统需要在 100 ms 内回答两个问题：相机相对于隧道结构的位姿是什么？前方 30 米处那个反射点距离车头多远？ ...

Depth Anything 3: Geometric Grounding for World Models

Figure from Depth Anything 3: Recovering the Visual Space from Any Views 几何地基：深度为何是世界模型的基石一个无法度量距离的世界模型，也无法预测后果。这不是比喻。当自动驾驶汽车决定刹车还是转向时，决策的核心依赖于一个几何量：与前方障碍物的距离。当机械臂伸手去拿咖啡杯时，运动轨迹必须考虑杯子相对于夹爪的深度。当小孩接球时，大脑持续估计球的距离和速度以计算拦截点。在每一个例子中，支配行动的物理推理都锚定在几何之上，而几何始于深度。 ...