Paper-Reading on Xu'Blog

Paper-Reading on Xu'Bloghttps://xuquant.com/tags/paper-reading/Recent content in Paper-Reading on Xu'BlogXu'Bloghttps://xuquant.com/og-default.pnghttps://xuquant.com/og-default.pngHugo -- 0.152.2zhThu, 28 May 2026 22:30:00 +0800Qwen-VLA 解读：T2A 解压先验、流匹配 PPO、跨形态零样本https://xuquant.com/posts/foundation-models/qwen-vla/Thu, 28 May 2026 22:30:00 +0800https://xuquant.com/posts/foundation-models/qwen-vla/Qwen Team 2026-05-28 放出的 Qwen-VLA (arXiv:2605.30280) 把 Qwen3.5-4B 多模态骨干和 1.15B 单流 DiT 流匹配动作专家拼成统一具身策略，最有意思的不是数字而是 T2A——冻住 VLM、屏蔽图像，只用文本和 embodiment prompt 把动作先验学出来，再分别灌图像、专门化、RL。本文照 paper 走一遍架构、四阶段 recipe、五维 T2A 消融、流匹配 PPO 的 log-prob 技巧、DOMINO 零样本 26.6% 这个数字背后的含义，以及几条保留的质疑。ATLAS：视觉推理的动作词表https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/Thu, 21 May 2026 20:00:00 +0800https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/解读 ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both：把画辅助线、框选区域、箭头指示、文本标注等中间视觉操作压缩成可训练的 functional tokens。X-World：小鹏可控自车视角多相机世界模型——量产驾驶世界模型的工程化https://xuquant.com/posts/world-models/xpeng-x-world/Wed, 20 May 2026 10:00:00 +0800https://xuquant.com/posts/world-models/xpeng-x-world/深度解读小鹏 X-World：DiT-based latent video diffusion + 两阶段训练（Rectified Flow → DMD + Self-Forcing 蒸馏）+ Action 多通道注入 + 7 相机 view-temporal SA。从 Vista / DriveDreamer / GAIA-2 / Waymo WM 横向对比看 production-grade 世界模型的工程化路径。代码即感知：当大模型「看得懂代码」才是攻克理科题的钥匙https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/Sat, 02 May 2026 10:00:00 +0800https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/深度解读 CVPR 2026 论文 CodePercept：通过系统性缩放实验论证感知（而非推理）才是 STEM 视觉推理的真正瓶颈，提出以可执行代码为感知媒介的双通道范式，8B 模型超越 72B 基线 6.2%，32B 模型在 STEM2Code-Eval 上超越 GPT5-Thinking。ReflectDrive-2：理想汽车的离散扩散端到端驾驶与 RL 联合优化https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/Sat, 25 Apr 2026 18:00:00 +0800https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/深度解读理想汽车 ReflectDrive-2：离散扩散用于端到端规划，「决策-起草-反思」三阶段配 AutoEdit 局部修正，RL 联合优化把 AutoEdit 增益放大 6 倍，纯相机输入 91.0 PDMS（NAVSIM v1 navtest），Thor 上 31.8ms/帧。凯明的方法论：从 ResNet 到 iMF —— 一个本质追问者的研究路径https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/Sat, 18 Apr 2026 18:00:00 +0800https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/以 iMF（Improved Mean Flow，arXiv:2512.02012）为主线深读何恺明 2026 CVPR 工作，并把它放回 ResNet / MoCo / MAE / SiT 十年脉络中，抓四条贯穿性的方法论 DNA：朴素到极致、改变问题假设、强先验少假设、方法与任务解耦。强链 mathematics/diffusion 系列。DeepSeek 以视觉原语思考：让多模态大模型学会「用手指着推理」https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/Sat, 04 Apr 2026 20:00:00 +0800https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/解读 DeepSeek 联合北大/清华提出的「以视觉原语思考」技术报告：将坐标和边界框作为思维链原语穿插在 CoT 中，尝试用结构化的空间符号缓解推理过程中的指代漂移。本文整理其方法机制并对其「modality 即 ontology」的本体论提案做批判性审视。SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Datahttps://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/Sat, 21 Mar 2026 18:00:00 +0800https://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/Deep analysis of CVPR 2026 SceneVerse++: how to build the largest-scale real-world 3D scene dataset from unlabeled internet videos, covering detection, segmentation, spatial VQA, and vision-language navigation.VGGT: 几何重建作为世界模型的 reconstruct 维度https://xuquant.com/posts/world-models/vggt/Sat, 21 Mar 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vggt/VGGT 把多视图 3D 重建压缩为单次前向传播。本文重新核对 alternating attention 的复杂度推导、用 VGGT 论文 Table 3/5/6 的原始数字检验过完备预测策略，并从几何先验转移与表示哲学两个角度回答：为什么从深度与位姿组合出的点图反而比直接预测更准。Depth Anything 3: Geometric Grounding for World Modelshttps://xuquant.com/posts/world-models/depth-anything-3/Sat, 07 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/depth-anything-3/Depth Anything 3 unifies monocular depth, multi-view reconstruction, pose estimation, and novel view synthesis under a single depth-ray representation. This article analyzes why minimal representation matters for world models and what depth estimation reveals about the geometric foundations of physical understanding.LeJEPA：当 JEPA 不再需要启发式https://xuquant.com/posts/world-models/lejepa/Sat, 07 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/lejepa/LeJEPA 把 JEPA 从依赖 stop-gradient、teacher-student、EMA 等一系列启发式的工程产物，重新拉回到可证明最优的理论框架——SIGReg 通过随机切片把嵌入分布对齐到各向同性高斯，单超参、线性复杂度、约 50 行代码。本文把这件事放回到 JEPA 防 collapse 的方法学谱系里，并解释它为什么是 LeCun 在 2025 年访谈中亲自背书的方向。DINOv3：自监督视觉基模的规模化困局与 Gram Anchoring 破局https://xuquant.com/posts/world-models/dinov3/Sat, 24 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/dinov3/DINOv3 核心贡献剖析：Gram anchoring 如何解决大规模自监督训练中 dense feature 退化的根本问题，7B 参数 SSL 模型的训练工程，以及它在深度估计和 3D 匹配上的突破意味着什么。V-JEPA 2.1: When Self-Supervised Vision Learns to See Every Pixelhttps://xuquant.com/posts/world-models/vjepa-2.1/Sat, 10 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vjepa-2.1/A deep analysis of V-JEPA 2.1's architectural innovations — dense predictive loss, deep self-supervision, multi-modal tokenizer, and scaling — tracing the path from collapsed context tokens to dense features that encode spatial structure, and the connection to depth estimation as geometric grounding.CORAL：面向开放式发现的自主多Agent进化https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/Sat, 22 Nov 2025 10:00:00 +0800https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/将进化搜索的关键决策委托给自主Agent而非固定启发式规则，如何在数学优化和系统优化任务上实现更快的收敛和更强的结果。InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modelinghttps://xuquant.com/posts/foundation-models/inspatio-world-4d-simulator/Sat, 25 Oct 2025 10:00:00 +0800https://xuquant.com/posts/foundation-models/inspatio-world-4d-simulator/InSpatio-World 深度技术分析：一个 13 亿参数的实时 4D 世界模拟器，通过隐式时空缓存与显式几何约束的结合，实现从单目视频以 24 FPS 进行新视角合成。Alpamayo：面向自动驾驶的推理-动作对齐 VLA 系统https://xuquant.com/posts/autonomous-driving/nvidia_vla/Sat, 30 Aug 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/nvidia_vla/深入技术解析 Nvidia Alpamayo VLA 自动驾驶系统，以 Cosmos-Reason 为 VLM 主干，涵盖三平面视觉编码、自车捷径规避、变化因数据集范式，以及通过强化学习实现的推理-动作对齐。