Xu'Blog

Xu'Bloghttps://xuquant.com/Recent content on Xu'BlogXu'Bloghttps://xuquant.com/og-default.pnghttps://xuquant.com/og-default.pngHugo -- 0.152.2zhThu, 28 May 2026 22:30:00 +0800Qwen-VLA 解读：T2A 解压先验、流匹配 PPO、跨形态零样本https://xuquant.com/posts/foundation-models/qwen-vla/Thu, 28 May 2026 22:30:00 +0800https://xuquant.com/posts/foundation-models/qwen-vla/Qwen Team 2026-05-28 放出的 Qwen-VLA (arXiv:2605.30280) 把 Qwen3.5-4B 多模态骨干和 1.15B 单流 DiT 流匹配动作专家拼成统一具身策略，最有意思的不是数字而是 T2A——冻住 VLM、屏蔽图像，只用文本和 embodiment prompt 把动作先验学出来，再分别灌图像、专门化、RL。本文照 paper 走一遍架构、四阶段 recipe、五维 T2A 消融、流匹配 PPO 的 log-prob 技巧、DOMINO 零样本 26.6% 这个数字背后的含义，以及几条保留的质疑。VLA 加几何 backbone 的负结果：GR00T × VGGT 三架构对照https://xuquant.com/posts/foundation-models/vla-geometric-fusion-three-architectures/Thu, 28 May 2026 20:00:00 +0800https://xuquant.com/posts/foundation-models/vla-geometric-fusion-three-architectures/NVIDIA + MIT + UT Austin 团队（arXiv:2605.24642）把 GR00T-N1.5 (manipulation VLA) 跟 VGGT (geometric foundation model) 拼起来，做了 Early Fusion / Late Fusion / Spatial Forcing 三种几何注入架构的 controlled 对照。主结果是一个负结果——standard finetune 下三种几何 VLA 都不显著超过 GR00T baseline。但 ablation 链里的几条判断（don't unfreeze LLM、probe 改进不等于 task 改进、mid-training 比架构选择影响更大、gate 近零起步）跟 production AD VLA 的工程决策直接相关。深入理解 KL 散度：四个视角https://xuquant.com/posts/mathematics/probability/kl-divergence-four-views/Thu, 28 May 2026 08:00:00 +0800https://xuquant.com/posts/mathematics/probability/kl-divergence-four-views/KL 散度在 ML 里到处出现——cross-entropy / ELBO / Information Bottleneck / RLHF / SAC——但它的'为什么是这一坨'容易卡在公式层面。本文从 coding length、似然比、信息几何（Bregman）、mode-seeking vs mass-covering 四个互补视角拆 KL，每个视角解释它的一个性质。最后把这四个视角挂回 cross-entropy / ELBO / IB / SAC / RLHF 几个具体应用，看每个用了哪个视角的语言。HiF-VLA：把 codec 副产品当成 VLA 的时间记忆https://xuquant.com/posts/foundation-models/hif-vla-codec-motion-temporal-memory/Wed, 27 May 2026 22:00:00 +0800https://xuquant.com/posts/foundation-models/hif-vla-codec-motion-temporal-memory/CVPR 2026 的 HiF-VLA (arXiv:2512.09928) 在 OpenVLA 基础上加了一组从 MPEG-4 编码副产物里抠出来的 motion vectors，前向预测未来 motion，反向用历史 motion 通过 AdaLN 调制动作流。本文照着 paper 和 motion_layers/ 代码走一遍，覆盖表征选择、Hindsight Encoder 的真实代码维度、Joint Expert 的 AdaLN 调制、Table 3 延迟分解，以及几个 paper 没讲透的点。量产 VLA 的 8 个工程判断 + 4 个反例https://xuquant.com/posts/autonomous-driving/production-vla-engineering-tradeoffs/Tue, 26 May 2026 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/production-vla-engineering-tradeoffs/量产 VLA 在 VLM 训练、轨迹 head 选型、cross-attention 信号、不定长 stream membank、同构特征蒸馏、SFT-AFT-RL 三段配比、单板部署等 8 个具体选择上的取舍逻辑；以及 4 个'试过没用'的反例，标定了搜索空间的边界。熵与信息论：从 -log p 到深度学习https://xuquant.com/posts/mathematics/probability/entropy-and-information/Mon, 25 May 2026 20:00:00 +0800https://xuquant.com/posts/mathematics/probability/entropy-and-information/从公理化角度推出 -log p 的必然性，依次过熵、互信息、KL 散度、最大熵原理，再回到深度学习里反复出现的几种形态——交叉熵损失、ELBO、信息瓶颈、最大熵强化学习。Affordance vs Symbolic Perception in AD：二分 framing 错在哪https://xuquant.com/posts/autonomous-driving/affordance-vs-symbolic-perception/Sun, 24 May 2026 11:00:00 +0800https://xuquant.com/posts/autonomous-driving/affordance-vs-symbolic-perception/AD 圈把 affordance / symbolic 当二分讨论，但 symbolic 一词同时指结构化感知输出和 language 输出，benchmark 排序不一致，Wayve / Tesla / 蔚小理实际站位都是 hybrid——这条 spectrum 是 framing 错位。真正决定 production VLA 的是几条独立工程 axis。Dense Latent Predictive Supervision in AD VLA：为什么 pixel 不是最优https://xuquant.com/posts/autonomous-driving/dense-latent-predictive-supervision/Sun, 24 May 2026 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/dense-latent-predictive-supervision/AD VLA 用 sparse trajectory loss（12 个 waypoint × 2D = 24 scalars）监督 2B+ 参数 backbone，信息论 ratio ~10⁻¹⁰——supervision deficit 是 NAVSIM 87-93 区间停滞的核心原因。DriveVLA-W0 用 pixel-level future image prediction 补，方向对但路线非最优。V-JEPA 风格 latent predictive supervision 在 capacity / 推理 cost / 评测同构性三条上都更友好。自动驾驶 VLA 的 3D 视觉表征：从能力边界到工程注入https://xuquant.com/posts/autonomous-driving/3d-vision-injection-for-ad-vla/Fri, 22 May 2026 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/3d-vision-injection-for-ad-vla/自动驾驶 VLA 系统部署时，vision tower 端 3D 注入的工程化决策——从 driving 真实需要的几何能力出发，经 latent space 拓扑分析、几何 prior 三种来源、五种注入路径，到车端推理预算这条硬约束，给出一套可操作的判别原则。ATLAS：视觉推理的动作词表https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/Thu, 21 May 2026 20:00:00 +0800https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/解读 ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both：把画辅助线、框选区域、箭头指示、文本标注等中间视觉操作压缩成可训练的 functional tokens。X-World：小鹏可控自车视角多相机世界模型——量产驾驶世界模型的工程化https://xuquant.com/posts/world-models/xpeng-x-world/Wed, 20 May 2026 10:00:00 +0800https://xuquant.com/posts/world-models/xpeng-x-world/深度解读小鹏 X-World：DiT-based latent video diffusion + 两阶段训练（Rectified Flow → DMD + Self-Forcing 蒸馏）+ Action 多通道注入 + 7 相机 view-temporal SA。从 Vista / DriveDreamer / GAIA-2 / Waymo WM 横向对比看 production-grade 世界模型的工程化路径。自动驾驶世界模型 × Action：六范式在 NAVSIM 上的落地与跨域对偶https://xuquant.com/posts/world-models/world-model-action-autonomous-driving/Tue, 19 May 2026 10:00:00 +0800https://xuquant.com/posts/world-models/world-model-action-autonomous-driving/上一篇综述把世界模型 × Action 接口的六范式建立在机器人场景上。本文是它的 AD 对偶篇——把同一套理论骨架带到自动驾驶，以 2026 H1 的 DriveLaW、DriveWorld-VLA、LaST-VLA、Latent-WAM、Uni-World VLA 五篇 NAVSIM 成绩 87-91 级别工作为锚，分析五篇的范式归属、机器人与 AD 在同范式下的不同 trade-off，以及 PDMS 作为同构指标的批判。Resumehttps://xuquant.com/resume/Mon, 18 May 2026 10:00:00 +0800https://xuquant.com/resume/Algorithm Engineer · Autonomous Driving · VLA / World ModelsPolar Express：用 Chebyshev 逼近把 Muon 的矩阵正交化提速一倍https://xuquant.com/posts/mathematics/matrix/polar-express/Mon, 18 May 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/matrix/polar-express/Newton-Schulz 迭代在 Muon 优化器里有个隐疾——前十几步几乎不动。ICLR 2026 Honorable Mention 论文 The Polar Express 用区间最优多项式 + Chebyshev 等振荡逼近修好了这个问题，并给出 GPT-2 上一致的 val loss 改善。本文从 Newton-Schulz 的痛点出发，对比 Jordan 启发式、You 六步法、Polar Express 三家解法，详解 Remez 算法在 odd quintic 上的应用、区间复合多项式的收敛性证明，以及 bfloat16 上的工程取舍。为什么大扩散模型不会背诵训练数据：两个时间尺度的隐式正则化https://xuquant.com/posts/mathematics/diffusion/why-diffusion-dont-memorize/Mon, 18 May 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/diffusion/why-diffusion-dont-memorize/NeurIPS 2025 Best Paper (Bonnaire et al. 2025) 给出了一个干净的回答：扩散模型训练存在两个分离的时间尺度——泛化窗口 τ_gen 和记忆窗口 τ_mem。τ_mem 正比于数据集规模 n（实测斜率约 300K steps per sample），意味着数据集越大，安全训练窗口自动越长。背后机制是神经网络梯度流的 spectral bias：低频 population score 先被学到，高频 empirical score 尖刺要等大量步数才被追上。本文从 Carlini 2023 的实证担忧切入，详解两个时间尺度的实验现象、n-线性标度律的推导、Random Feature 网络的谱分析，以及对训练实践的启示。4D Vision Encoder for Autonomous Driving：信息瓶颈视角下的统一审视https://xuquant.com/posts/autonomous-driving/4d-vision-encoder-for-autonomous-driving/Sun, 17 May 2026 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/4d-vision-encoder-for-autonomous-driving/把 AR1 Tri-plane、Flex、MEM、Memory VLA、BEV/OCC、V-JEPA、DA3、VGGT 等 9 种 4D 视觉编码方案放进同一个信息瓶颈坐标系，从 Y 的四元结构（感知/预测/规划/推理）推出理想 4D encoder 的五个必要条件，给出 Qwen3.5 上 4V→7V 升级的评估路径。从预测未来到驱动行动：机器人世界模型的架构与评测https://xuquant.com/posts/world-models/world-model-robot-learning/Fri, 15 May 2026 10:00:00 +0800https://xuquant.com/posts/world-models/world-model-robot-learning/围绕 NTU/UC Berkeley/Stanford 联合综述 World Model for Robot Learning，从闭环动机、六范式对比、评测转向到一个关于 disentangled metric 的批判，把机器人世界模型放回本系列的正交视角之中。VLA 语义下的导航信息注入：从 Prompt 到 Diffusion Conditionhttps://xuquant.com/posts/autonomous-driving/diffusion-planner-navigation-injection/Thu, 14 May 2026 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/diffusion-planner-navigation-injection/业界商用导航 SDK 普遍输出 maneuver 链、辅助动作语义、车道级引导、路径航点等丰富信号，但公开数据集 nuPlan/nuScenes/NAVSIM 在采集环节就没有接入这些字段——消费完整 navi 的 VLA 研究当前只能在私有集上做。本文以「VLA 如何高效消费业界已提供的导航信息」为线索，逐层剖析 Prompt 编码、Adapter 对齐、Diffusion 条件、统一空间 Token 四层注入机制，并讨论 VLN 持续交互范式，涵盖 SpaceDrive、SSR、DiffusionPlanner、GoalFlow、ONR/MAT、SGDrive 等 2024-2026 最新工作。得分匹配、GAN 与生成模型的统一https://xuquant.com/posts/mathematics/probability/score-matching-gan/Mon, 11 May 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/probability/score-matching-gan/从 Hyvarinen 得分匹配到去噪得分匹配，从 GAN 的对抗训练到得分函数，建立 VAE、GAN、扩散模型在分布匹配框架下的统一理解。VLM 时序记忆机制：从视频压缩到长短时记忆融合https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/Sat, 09 May 2026 06:00:00 +0800https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/系统梳理 VLM 中时序建模的主流方案：Nvidia Flex 编码器、LlamaFactory 视频处理管线、Qwen 时空压缩、Pi 0.7 MEM 时空可分离注意力与 Memory VLA，并基于 Qwen3-VL 工程实现详解 MEM 的零参数改造方案。最优传输与 Wasserstein 距离：从 Monge 到 Kantorovichhttps://xuquant.com/posts/mathematics/probability/optimal-transport-wasserstein/Wed, 06 May 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/probability/optimal-transport-wasserstein/从 Monge 搬运问题到 Kantorovich 松弛，推导 Wasserstein 距离的定义与对偶形式，解释为何 W 距离比 KL 散度更适合衡量分布差异。代码即感知：当大模型「看得懂代码」才是攻克理科题的钥匙https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/Sat, 02 May 2026 10:00:00 +0800https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/深度解读 CVPR 2026 论文 CodePercept：通过系统性缩放实验论证感知（而非推理）才是 STEM 视觉推理的真正瓶颈，提出以可执行代码为感知媒介的双通道范式，8B 模型超越 72B 基线 6.2%，32B 模型在 STEM2Code-Eval 上超越 GPT5-Thinking。变分自编码器：从 ELBO 到重参数化https://xuquant.com/posts/mathematics/probability/vae-elbo/Sat, 02 May 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/probability/vae-elbo/从生成模型的推断难题出发，推导 ELBO 的两种等价形式，解释重参数化技巧的必要性，分析 VAE 的信息瓶颈与后验坍塌问题。ReflectDrive-2：理想汽车的离散扩散端到端驾驶与 RL 联合优化https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/Sat, 25 Apr 2026 18:00:00 +0800https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/深度解读理想汽车 ReflectDrive-2：离散扩散用于端到端规划，「决策-起草-反思」三阶段配 AutoEdit 局部修正，RL 联合优化把 AutoEdit 增益放大 6 倍，纯相机输入 91.0 PDMS（NAVSIM v1 navtest），Thor 上 31.8ms/帧。Flow Matching 与一致性模型：生成范式的新统一https://xuquant.com/posts/mathematics/diffusion/flow-matching-consistency/Sat, 25 Apr 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/diffusion/flow-matching-consistency/从扩散模型的随机路径到 Flow Matching 的确定性最优传输路径，再到一致性模型的单步蒸馏，建立生成模型 ODE 视角的统一框架。扩散模型的 SDE/ODE 统一：随机微分方程到确定性采样https://xuquant.com/posts/mathematics/diffusion/sde-ode-unified/Wed, 22 Apr 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/diffusion/sde-ode-unified/从离散马尔可夫链推导连续 SDE 极限，建立概率流 ODE 的严格推导，解释得分函数的几何意义与朗之万动力学的等价性。凯明的方法论：从 ResNet 到 iMF —— 一个本质追问者的研究路径https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/Sat, 18 Apr 2026 18:00:00 +0800https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/以 iMF（Improved Mean Flow，arXiv:2512.02012）为主线深读何恺明 2026 CVPR 工作，并把它放回 ResNet / MoCo / MAE / SiT 十年脉络中，抓四条贯穿性的方法论 DNA：朴素到极致、改变问题假设、强先验少假设、方法与任务解耦。强链 mathematics/diffusion 系列。扩散模型的变分基础：从 ELBO 到去噪https://xuquant.com/posts/mathematics/diffusion/ddpm-variational/Sat, 18 Apr 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/diffusion/ddpm-variational/从 ELBO 推导 DDPM 的变分下界，解释三项分解的物理意义，证明预测噪声与预测数据的等价性，建立扩散训练的变分理解。旋转约束下的压缩：从 RoPE 到 DeepSeek MLAhttps://xuquant.com/posts/mathematics/position-encoding/mla-from-rope/Sat, 11 Apr 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/position-encoding/mla-from-rope/RoPE 与低秩压缩的不兼容性是 MLA 设计的核心驱动力——从旋转矩阵破坏低秩结构的数学证明，到解耦 RoPE 设计的工程解法。DeepSeek 以视觉原语思考：让多模态大模型学会「用手指着推理」https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/Sat, 04 Apr 2026 20:00:00 +0800https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/解读 DeepSeek 联合北大/清华提出的「以视觉原语思考」技术报告：将坐标和边界框作为思维链原语穿插在 CoT 中，尝试用结构化的空间符号缓解推理过程中的指代漂移。本文整理其方法机制并对其「modality 即 ontology」的本体论提案做批判性审视。RoPE 的 β 进制类比与长度外推https://xuquant.com/posts/mathematics/position-encoding/rope-ntk-extrapolation/Sat, 04 Apr 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/position-encoding/rope-ntk-extrapolation/将 RoPE 的旋转角度类比为 β 进制数的各位数字，统一理解 NTK-Aware、YaRN 等长度外推方法，揭示分辨率与范围的根本取舍。X-Cache：小鹏自动驾驶世界模型的推理加速 Infrahttps://xuquant.com/posts/world-models/xpeng-x-cache-world-model-inference-acceleration/Sat, 28 Mar 2026 18:00:00 +0800https://xuquant.com/posts/world-models/xpeng-x-cache-world-model-inference-acceleration/深度解读小鹏 X-Cache：通过跨段残差缓存实现世界模型 2.7 倍推理加速，71% DiT block 跳过率且几乎零画质损失，training-free 的自动驾驶推理优化方案。旋转位置编码的几何本质：从复数到旋转矩阵https://xuquant.com/posts/mathematics/position-encoding/rope-geometry/Sat, 28 Mar 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/position-encoding/rope-geometry/从复数乘法 = 旋转的几何直觉出发，推导 RoPE 的分块对角旋转矩阵构造，解释内积只依赖相对位置的核心性质。SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Datahttps://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/Sat, 21 Mar 2026 18:00:00 +0800https://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/Deep analysis of CVPR 2026 SceneVerse++: how to build the largest-scale real-world 3D scene dataset from unlabeled internet videos, covering detection, segmentation, spatial VQA, and vision-language navigation.VGGT: 几何重建作为世界模型的 reconstruct 维度https://xuquant.com/posts/world-models/vggt/Sat, 21 Mar 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vggt/VGGT 把多视图 3D 重建压缩为单次前向传播。本文重新核对 alternating attention 的复杂度推导、用 VGGT 论文 Table 3/5/6 的原始数字检验过完备预测策略，并从几何先验转移与表示哲学两个角度回答：为什么从深度与位姿组合出的点图反而比直接预测更准。Wan2.2 and the Boundary of Video World Modelshttps://xuquant.com/posts/world-models/wan2.2-video-world-model-boundary/Sat, 14 Mar 2026 10:00:00 +0800https://xuquant.com/posts/world-models/wan2.2-video-world-model-boundary/Wan2.2 pushes video generation toward photorealistic world simulation, but where is the boundary between generating videos and understanding worlds? This article examines the architecture, training, and fundamental limits of video-based world models.Muon 优化器：矩阵正交化驱动的梯度更新https://xuquant.com/posts/mathematics/matrix/muon-optimizer/Sat, 14 Mar 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/matrix/muon-optimizer/从动量法到矩阵动量的正交化，推导 Newton-Schulz 迭代的收敛性，解释流式幂迭代的工程折衷，以及 Muon 在 Kimi K2 训练中的 2x 加速。Qwen3.5 vs Qwen3: A Deep Architectural Comparisonhttps://xuquant.com/posts/foundation-models/qwen3-vs-qwen3-5-architecture/Sat, 07 Mar 2026 14:00:00 +0800https://xuquant.com/posts/foundation-models/qwen3-vs-qwen3-5-architecture/深入对比 Qwen3.5 与 Qwen3 的架构差异：混合注意力机制、联合多模态训练策略、高稀疏 MoE、部分 RoPE 在注意力、视觉与 MoE 三个维度的演进从 2D 到 4D：视觉表征的本体论问题https://xuquant.com/posts/world-models/vision-2d-to-4d/Sat, 07 Mar 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vision-2d-to-4d/4D 视觉表征的本体论之辨：4D = 3D + 时间，还是 4D = 多视角 + 几何？为什么 4D 是 world model 的关键？spatial-temporal joint 与 decoupled 在表征几何上意味着什么？本文是 world model 哲学方向的讨论，工程实现见 4D Vision Encoder for Autonomous Driving。谱范数、条件数与优化景观https://xuquant.com/posts/mathematics/matrix/spectral-norm/Sat, 07 Mar 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/matrix/spectral-norm/谱范数是矩阵的最大拉伸因子，条件数决定梯度下降的收敛速度——从优化景观的几何到谱归一化的实践。奇异值分解与低秩近似：从矩阵压缩到 LoRA 微调https://xuquant.com/posts/mathematics/matrix/svd-low-rank/Sat, 28 Feb 2026 09:00:00 +0800https://xuquant.com/posts/mathematics/matrix/svd-low-rank/从 SVD 的几何直觉出发，推导 Eckart-Young 低秩近似定理，解释 LoRA 微调背后的矩阵论原理——为什么一个 rank 远小于 d 的分解仍然有效。Driving JEPA 综述：V-JEPA 系列方法在自动驾驶场景的应用https://xuquant.com/posts/world-models/driving-jepa/Sat, 21 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/driving-jepa/V-JEPA 系列在自动驾驶 benchmark 上的迁移综述：因果未来掩码、motion-aware mask、temporal-coherent mask 等 driving-specific 变体的 fine-tune 结果对比，以及 driving 与通用视频自监督在 mask 假设上的根本 mismatch。Depth Anything 3: Geometric Grounding for World Modelshttps://xuquant.com/posts/world-models/depth-anything-3/Sat, 07 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/depth-anything-3/Depth Anything 3 unifies monocular depth, multi-view reconstruction, pose estimation, and novel view synthesis under a single depth-ray representation. This article analyzes why minimal representation matters for world models and what depth estimation reveals about the geometric foundations of physical understanding.LeJEPA：当 JEPA 不再需要启发式https://xuquant.com/posts/world-models/lejepa/Sat, 07 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/lejepa/LeJEPA 把 JEPA 从依赖 stop-gradient、teacher-student、EMA 等一系列启发式的工程产物，重新拉回到可证明最优的理论框架——SIGReg 通过随机切片把嵌入分布对齐到各向同性高斯，单超参、线性复杂度、约 50 行代码。本文把这件事放回到 JEPA 防 collapse 的方法学谱系里，并解释它为什么是 LeCun 在 2025 年访谈中亲自背书的方向。DINOv3：自监督视觉基模的规模化困局与 Gram Anchoring 破局https://xuquant.com/posts/world-models/dinov3/Sat, 24 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/dinov3/DINOv3 核心贡献剖析：Gram anchoring 如何解决大规模自监督训练中 dense feature 退化的根本问题，7B 参数 SSL 模型的训练工程，以及它在深度估计和 3D 匹配上的突破意味着什么。V-JEPA 2.1: When Self-Supervised Vision Learns to See Every Pixelhttps://xuquant.com/posts/world-models/vjepa-2.1/Sat, 10 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vjepa-2.1/A deep analysis of V-JEPA 2.1's architectural innovations — dense predictive loss, deep self-supervision, multi-modal tokenizer, and scaling — tracing the path from collapsed context tokens to dense features that encode spatial structure, and the connection to depth estimation as geometric grounding.CORAL：面向开放式发现的自主多Agent进化https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/Sat, 22 Nov 2025 10:00:00 +0800https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/将进化搜索的关键决策委托给自主Agent而非固定启发式规则，如何在数学优化和系统优化任务上实现更快的收敛和更强的结果。扩散模型与自动驾驶规划：从去噪的数学到轨迹的生成https://xuquant.com/posts/autonomous-driving/diffusion-for-driving/Sat, 08 Nov 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/diffusion-for-driving/面向自动驾驶的扩散模型原理深度梳理：从 DDPM 的变分推断到 Flow Matching 的直线耦合，从 Classifier-Free Guidance 的条件控制到 Truncated Diffusion 的截断加速——理解每一步'为什么'而非仅仅是'怎么做'。ReconVLA：用 gaze-crop 重建给 VLA 视觉接地https://xuquant.com/posts/foundation-models/reconvla-gaze-crop-implicit-grounding/Mon, 27 Oct 2025 22:00:00 +0800https://xuquant.com/posts/foundation-models/reconvla-gaze-crop-implicit-grounding/OpenHelix 的 ReconVLA (arXiv:2508.10333) 在 OpenVLA 风格的 backbone 后挂一个 3 层 DiT，用 gaze-crop 的 VAE-latent 重建当辅助监督，把 VLA 的注意力锚到目标物体上。本文对照 paper 与开源 code 读一遍，包含 paper 没强调的工程细节，以及几个 paper 没回答的问题——recon-on/off ablation 缺位，'隐式接地' 在训练 supervision 上其实依赖 offline YOLO bbox。InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modelinghttps://xuquant.com/posts/foundation-models/inspatio-world-4d-simulator/Sat, 25 Oct 2025 10:00:00 +0800https://xuquant.com/posts/foundation-models/inspatio-world-4d-simulator/InSpatio-World 深度技术分析：一个 13 亿参数的实时 4D 世界模拟器，通过隐式时空缓存与显式几何约束的结合，实现从单目视频以 24 FPS 进行新视角合成。Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvementhttps://xuquant.com/posts/autonomous-driving/basic_rl/Sat, 20 Sep 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/basic_rl/全面分析将强化学习应用于端到端自动驾驶系统，涵盖 metric caching 机制、不同动作表示下的 DPO，以及突破迭代自改进流水线采样上限的策略。Multi-Head Latent Attention: DeepSeek V2/V3 工程视角https://xuquant.com/posts/foundation-models/deepseek_series1_mla/Sat, 13 Sep 2025 10:00:00 +0800https://xuquant.com/posts/foundation-models/deepseek_series1_mla/从 DeepSeek V2/V3 的实际部署视角分析 MLA：KV cache 压缩比、推理 throughput、与 GQA/MQA 的工程对比、长 context 下的真实收益。MLA 的数学推导见配套文章。Alpamayo：面向自动驾驶的推理-动作对齐 VLA 系统https://xuquant.com/posts/autonomous-driving/nvidia_vla/Sat, 30 Aug 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/nvidia_vla/深入技术解析 Nvidia Alpamayo VLA 自动驾驶系统，以 Cosmos-Reason 为 VLM 主干，涵盖三平面视觉编码、自车捷径规避、变化因数据集范式，以及通过强化学习实现的推理-动作对齐。Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPOhttps://xuquant.com/posts/autonomous-driving/rl-policy-optimization-e2e-driving/Sat, 09 Aug 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/rl-policy-optimization-e2e-driving/端到端自动驾驶策略优化方法的系统推导：从 REINFORCE 到 PPO 再到 GRPO，涵盖优势估计、LLM 与驾驶采样的差异、多目标损失设计，以及扩散模型探索中噪声的作用。End-to-End Autonomous Driving: From Modular Decoders to VLA Architectureshttps://xuquant.com/posts/autonomous-driving/e2e-autonomous-driving-evolution/Sat, 19 Jul 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/e2e-autonomous-driving-evolution/端到端自动驾驶架构演化的技术综述，涵盖规划器解码器选择（AR vs Diffusion vs Flow Matching）、VLA 集成策略，以及数据基础设施、训练优化和评估系统的工程最佳实践。Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigmhttps://xuquant.com/posts/autonomous-driving/ar-trajectory-tokenization/Sat, 28 Jun 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/ar-trajectory-tokenization/深入探讨自回归驾驶规划器的轨迹分词方法：从基于 k-means 聚类的状态离散化，到 token 匹配与重建，再到 AR+Diffusion 范式与基于 GRPO 的强化学习后训练。Why Generative Planning? The Non-Convexity Argument Against Regression in Autonomous Drivinghttps://xuquant.com/posts/autonomous-driving/generative-planning-nonconvex/Sat, 07 Jun 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/generative-planning-nonconvex/从第一性原理分析回归式规划器在自动驾驶中失败的原因：可行域是非凸的，MSE 将模式平均到障碍物上，GMM 是补丁而非解决方案，生成式方法是必要的。Serieshttps://xuquant.com/series/Mon, 01 Jan 0001 00:00:00 +0000https://xuquant.com/series/文章系列索引热门文章https://xuquant.com/popular/Mon, 01 Jan 0001 00:00:00 +0000https://xuquant.com/popular/热门文章排行知识图谱https://xuquant.com/graph/Mon, 01 Jan 0001 00:00:00 +0000https://xuquant.com/graph/全站文章与核心概念的交互式知识图谱