World Model Series: Four Dimensions of World Representation on Xu'Blog

World Model Series: Four Dimensions of World Representation on Xu'Bloghttps://xuquant.com/posts/world-models/Recent content in World Model Series: Four Dimensions of World Representation on Xu'BlogXu'Bloghttps://xuquant.com/og-default.pnghttps://xuquant.com/og-default.pngHugo -- 0.152.2zhWed, 20 May 2026 10:00:00 +0800X-World：小鹏可控自车视角多相机世界模型——量产驾驶世界模型的工程化https://xuquant.com/posts/world-models/xpeng-x-world/Wed, 20 May 2026 10:00:00 +0800https://xuquant.com/posts/world-models/xpeng-x-world/深度解读小鹏 X-World：DiT-based latent video diffusion + 两阶段训练（Rectified Flow → DMD + Self-Forcing 蒸馏）+ Action 多通道注入 + 7 相机 view-temporal SA。从 Vista / DriveDreamer / GAIA-2 / Waymo WM 横向对比看 production-grade 世界模型的工程化路径。自动驾驶世界模型 × Action：六范式在 NAVSIM 上的落地与跨域对偶https://xuquant.com/posts/world-models/world-model-action-autonomous-driving/Tue, 19 May 2026 10:00:00 +0800https://xuquant.com/posts/world-models/world-model-action-autonomous-driving/上一篇综述把世界模型 × Action 接口的六范式建立在机器人场景上。本文是它的 AD 对偶篇——把同一套理论骨架带到自动驾驶，以 2026 H1 的 DriveLaW、DriveWorld-VLA、LaST-VLA、Latent-WAM、Uni-World VLA 五篇 NAVSIM 成绩 87-91 级别工作为锚，分析五篇的范式归属、机器人与 AD 在同范式下的不同 trade-off，以及 PDMS 作为同构指标的批判。从预测未来到驱动行动：机器人世界模型的架构与评测https://xuquant.com/posts/world-models/world-model-robot-learning/Fri, 15 May 2026 10:00:00 +0800https://xuquant.com/posts/world-models/world-model-robot-learning/围绕 NTU/UC Berkeley/Stanford 联合综述 World Model for Robot Learning，从闭环动机、六范式对比、评测转向到一个关于 disentangled metric 的批判，把机器人世界模型放回本系列的正交视角之中。X-Cache：小鹏自动驾驶世界模型的推理加速 Infrahttps://xuquant.com/posts/world-models/xpeng-x-cache-world-model-inference-acceleration/Sat, 28 Mar 2026 18:00:00 +0800https://xuquant.com/posts/world-models/xpeng-x-cache-world-model-inference-acceleration/深度解读小鹏 X-Cache：通过跨段残差缓存实现世界模型 2.7 倍推理加速，71% DiT block 跳过率且几乎零画质损失，training-free 的自动驾驶推理优化方案。VGGT: 几何重建作为世界模型的 reconstruct 维度https://xuquant.com/posts/world-models/vggt/Sat, 21 Mar 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vggt/VGGT 把多视图 3D 重建压缩为单次前向传播。本文重新核对 alternating attention 的复杂度推导、用 VGGT 论文 Table 3/5/6 的原始数字检验过完备预测策略，并从几何先验转移与表示哲学两个角度回答：为什么从深度与位姿组合出的点图反而比直接预测更准。Wan2.2 and the Boundary of Video World Modelshttps://xuquant.com/posts/world-models/wan2.2-video-world-model-boundary/Sat, 14 Mar 2026 10:00:00 +0800https://xuquant.com/posts/world-models/wan2.2-video-world-model-boundary/Wan2.2 pushes video generation toward photorealistic world simulation, but where is the boundary between generating videos and understanding worlds? This article examines the architecture, training, and fundamental limits of video-based world models.从 2D 到 4D：视觉表征的本体论问题https://xuquant.com/posts/world-models/vision-2d-to-4d/Sat, 07 Mar 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vision-2d-to-4d/4D 视觉表征的本体论之辨：4D = 3D + 时间，还是 4D = 多视角 + 几何？为什么 4D 是 world model 的关键？spatial-temporal joint 与 decoupled 在表征几何上意味着什么？本文是 world model 哲学方向的讨论，工程实现见 4D Vision Encoder for Autonomous Driving。Driving JEPA 综述：V-JEPA 系列方法在自动驾驶场景的应用https://xuquant.com/posts/world-models/driving-jepa/Sat, 21 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/driving-jepa/V-JEPA 系列在自动驾驶 benchmark 上的迁移综述：因果未来掩码、motion-aware mask、temporal-coherent mask 等 driving-specific 变体的 fine-tune 结果对比，以及 driving 与通用视频自监督在 mask 假设上的根本 mismatch。Depth Anything 3: Geometric Grounding for World Modelshttps://xuquant.com/posts/world-models/depth-anything-3/Sat, 07 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/depth-anything-3/Depth Anything 3 unifies monocular depth, multi-view reconstruction, pose estimation, and novel view synthesis under a single depth-ray representation. This article analyzes why minimal representation matters for world models and what depth estimation reveals about the geometric foundations of physical understanding.LeJEPA：当 JEPA 不再需要启发式https://xuquant.com/posts/world-models/lejepa/Sat, 07 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/lejepa/LeJEPA 把 JEPA 从依赖 stop-gradient、teacher-student、EMA 等一系列启发式的工程产物，重新拉回到可证明最优的理论框架——SIGReg 通过随机切片把嵌入分布对齐到各向同性高斯，单超参、线性复杂度、约 50 行代码。本文把这件事放回到 JEPA 防 collapse 的方法学谱系里，并解释它为什么是 LeCun 在 2025 年访谈中亲自背书的方向。DINOv3：自监督视觉基模的规模化困局与 Gram Anchoring 破局https://xuquant.com/posts/world-models/dinov3/Sat, 24 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/dinov3/DINOv3 核心贡献剖析：Gram anchoring 如何解决大规模自监督训练中 dense feature 退化的根本问题，7B 参数 SSL 模型的训练工程，以及它在深度估计和 3D 匹配上的突破意味着什么。V-JEPA 2.1: When Self-Supervised Vision Learns to See Every Pixelhttps://xuquant.com/posts/world-models/vjepa-2.1/Sat, 10 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vjepa-2.1/A deep analysis of V-JEPA 2.1's architectural innovations — dense predictive loss, deep self-supervision, multi-modal tokenizer, and scaling — tracing the path from collapsed context tokens to dense features that encode spatial structure, and the connection to depth estimation as geometric grounding.