<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>World Model Series: Four Dimensions of World Representation on Xu'Blog</title><link>https://xuquant.com/posts/world-models/</link><description>Recent content in World Model Series: Four Dimensions of World Representation on Xu'Blog</description><image><title>Xu'Blog</title><url>https://xuquant.com/og-default.png</url><link>https://xuquant.com/og-default.png</link></image><generator>Hugo -- 0.152.2</generator><language>zh</language><lastBuildDate>Wed, 20 May 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://xuquant.com/posts/world-models/index.xml" rel="self" type="application/rss+xml"/><item><title>X-World：小鹏可控自车视角多相机世界模型——量产驾驶世界模型的工程化</title><link>https://xuquant.com/posts/world-models/xpeng-x-world/</link><pubDate>Wed, 20 May 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/xpeng-x-world/</guid><description>深度解读小鹏 X-World：DiT-based latent video diffusion + 两阶段训练（Rectified Flow → DMD + Self-Forcing 蒸馏）+ Action 多通道注入 + 7 相机 view-temporal SA。从 Vista / DriveDreamer / GAIA-2 / Waymo WM 横向对比看 production-grade 世界模型的工程化路径。</description></item><item><title>自动驾驶世界模型 × Action：六范式在 NAVSIM 上的落地与跨域对偶</title><link>https://xuquant.com/posts/world-models/world-model-action-autonomous-driving/</link><pubDate>Tue, 19 May 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/world-model-action-autonomous-driving/</guid><description>上一篇综述把世界模型 × Action 接口的六范式建立在机器人场景上。本文是它的 AD 对偶篇——把同一套理论骨架带到自动驾驶，以 2026 H1 的 DriveLaW、DriveWorld-VLA、LaST-VLA、Latent-WAM、Uni-World VLA 五篇 NAVSIM 成绩 87-91 级别工作为锚，分析五篇的范式归属、机器人与 AD 在同范式下的不同 trade-off，以及 PDMS 作为同构指标的批判。</description></item><item><title>从预测未来到驱动行动：机器人世界模型的架构与评测</title><link>https://xuquant.com/posts/world-models/world-model-robot-learning/</link><pubDate>Fri, 15 May 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/world-model-robot-learning/</guid><description>围绕 NTU/UC Berkeley/Stanford 联合综述 World Model for Robot Learning，从闭环动机、六范式对比、评测转向到一个关于 disentangled metric 的批判，把机器人世界模型放回本系列的正交视角之中。</description></item><item><title>X-Cache：小鹏自动驾驶世界模型的推理加速 Infra</title><link>https://xuquant.com/posts/world-models/xpeng-x-cache-world-model-inference-acceleration/</link><pubDate>Sat, 28 Mar 2026 18:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/xpeng-x-cache-world-model-inference-acceleration/</guid><description>深度解读小鹏 X-Cache：通过跨段残差缓存实现世界模型 2.7 倍推理加速，71% DiT block 跳过率且几乎零画质损失，training-free 的自动驾驶推理优化方案。</description></item><item><title>VGGT: 几何重建作为世界模型的 reconstruct 维度</title><link>https://xuquant.com/posts/world-models/vggt/</link><pubDate>Sat, 21 Mar 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/vggt/</guid><description>VGGT 把多视图 3D 重建压缩为单次前向传播。本文重新核对 alternating attention 的复杂度推导、用 VGGT 论文 Table 3/5/6 的原始数字检验过完备预测策略，并从几何先验转移与表示哲学两个角度回答：为什么从深度与位姿组合出的点图反而比直接预测更准。</description></item><item><title>Wan2.2 and the Boundary of Video World Models</title><link>https://xuquant.com/posts/world-models/wan2.2-video-world-model-boundary/</link><pubDate>Sat, 14 Mar 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/wan2.2-video-world-model-boundary/</guid><description>Wan2.2 pushes video generation toward photorealistic world simulation, but where is the boundary between generating videos and understanding worlds? This article examines the architecture, training, and fundamental limits of video-based world models.</description></item><item><title>从 2D 到 4D：视觉表征的本体论问题</title><link>https://xuquant.com/posts/world-models/vision-2d-to-4d/</link><pubDate>Sat, 07 Mar 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/vision-2d-to-4d/</guid><description>4D 视觉表征的本体论之辨：4D = 3D + 时间，还是 4D = 多视角 + 几何？为什么 4D 是 world model 的关键？spatial-temporal joint 与 decoupled 在表征几何上意味着什么？本文是 world model 哲学方向的讨论，工程实现见 4D Vision Encoder for Autonomous Driving。</description></item><item><title>Driving JEPA 综述：V-JEPA 系列方法在自动驾驶场景的应用</title><link>https://xuquant.com/posts/world-models/driving-jepa/</link><pubDate>Sat, 21 Feb 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/driving-jepa/</guid><description>V-JEPA 系列在自动驾驶 benchmark 上的迁移综述：因果未来掩码、motion-aware mask、temporal-coherent mask 等 driving-specific 变体的 fine-tune 结果对比，以及 driving 与通用视频自监督在 mask 假设上的根本 mismatch。</description></item><item><title>Depth Anything 3: Geometric Grounding for World Models</title><link>https://xuquant.com/posts/world-models/depth-anything-3/</link><pubDate>Sat, 07 Feb 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/depth-anything-3/</guid><description>Depth Anything 3 unifies monocular depth, multi-view reconstruction, pose estimation, and novel view synthesis under a single depth-ray representation. This article analyzes why minimal representation matters for world models and what depth estimation reveals about the geometric foundations of physical understanding.</description></item><item><title>LeJEPA：当 JEPA 不再需要启发式</title><link>https://xuquant.com/posts/world-models/lejepa/</link><pubDate>Sat, 07 Feb 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/lejepa/</guid><description>LeJEPA 把 JEPA 从依赖 stop-gradient、teacher-student、EMA 等一系列启发式的工程产物，重新拉回到可证明最优的理论框架——SIGReg 通过随机切片把嵌入分布对齐到各向同性高斯，单超参、线性复杂度、约 50 行代码。本文把这件事放回到 JEPA 防 collapse 的方法学谱系里，并解释它为什么是 LeCun 在 2025 年访谈中亲自背书的方向。</description></item><item><title>DINOv3：自监督视觉基模的规模化困局与 Gram Anchoring 破局</title><link>https://xuquant.com/posts/world-models/dinov3/</link><pubDate>Sat, 24 Jan 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/dinov3/</guid><description>DINOv3 核心贡献剖析：Gram anchoring 如何解决大规模自监督训练中 dense feature 退化的根本问题，7B 参数 SSL 模型的训练工程，以及它在深度估计和 3D 匹配上的突破意味着什么。</description></item><item><title>V-JEPA 2.1: When Self-Supervised Vision Learns to See Every Pixel</title><link>https://xuquant.com/posts/world-models/vjepa-2.1/</link><pubDate>Sat, 10 Jan 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/vjepa-2.1/</guid><description>A deep analysis of V-JEPA 2.1&amp;#39;s architectural innovations — dense predictive loss, deep self-supervision, multi-modal tokenizer, and scaling — tracing the path from collapsed context tokens to dense features that encode spatial structure, and the connection to depth estimation as geometric grounding.</description></item></channel></rss>