<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Paper-Reading on Xu'Blog</title><link>https://xuquant.com/tags/paper-reading/</link><description>Recent content in Paper-Reading on Xu'Blog</description><image><title>Xu'Blog</title><url>https://xuquant.com/og-default.png</url><link>https://xuquant.com/og-default.png</link></image><generator>Hugo -- 0.152.2</generator><language>zh</language><lastBuildDate>Thu, 28 May 2026 22:30:00 +0800</lastBuildDate><atom:link href="https://xuquant.com/tags/paper-reading/index.xml" rel="self" type="application/rss+xml"/><item><title>Qwen-VLA 解读：T2A 解压先验、流匹配 PPO、跨形态零样本</title><link>https://xuquant.com/posts/foundation-models/qwen-vla/</link><pubDate>Thu, 28 May 2026 22:30:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/qwen-vla/</guid><description>Qwen Team 2026-05-28 放出的 Qwen-VLA (arXiv:2605.30280) 把 Qwen3.5-4B 多模态骨干和 1.15B 单流 DiT 流匹配动作专家拼成统一具身策略，最有意思的不是数字而是 T2A——冻住 VLM、屏蔽图像，只用文本和 embodiment prompt 把动作先验学出来，再分别灌图像、专门化、RL。本文照 paper 走一遍架构、四阶段 recipe、五维 T2A 消融、流匹配 PPO 的 log-prob 技巧、DOMINO 零样本 26.6% 这个数字背后的含义，以及几条保留的质疑。</description></item><item><title>ATLAS：视觉推理的动作词表</title><link>https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/</link><pubDate>Thu, 21 May 2026 20:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/</guid><description>解读 ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both：把画辅助线、框选区域、箭头指示、文本标注等中间视觉操作压缩成可训练的 functional tokens。</description></item><item><title>X-World：小鹏可控自车视角多相机世界模型——量产驾驶世界模型的工程化</title><link>https://xuquant.com/posts/world-models/xpeng-x-world/</link><pubDate>Wed, 20 May 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/xpeng-x-world/</guid><description>深度解读小鹏 X-World：DiT-based latent video diffusion + 两阶段训练（Rectified Flow → DMD + Self-Forcing 蒸馏）+ Action 多通道注入 + 7 相机 view-temporal SA。从 Vista / DriveDreamer / GAIA-2 / Waymo WM 横向对比看 production-grade 世界模型的工程化路径。</description></item><item><title>代码即感知：当大模型「看得懂代码」才是攻克理科题的钥匙</title><link>https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/</link><pubDate>Sat, 02 May 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/</guid><description>深度解读 CVPR 2026 论文 CodePercept：通过系统性缩放实验论证感知（而非推理）才是 STEM 视觉推理的真正瓶颈，提出以可执行代码为感知媒介的双通道范式，8B 模型超越 72B 基线 6.2%，32B 模型在 STEM2Code-Eval 上超越 GPT5-Thinking。</description></item><item><title>ReflectDrive-2：理想汽车的离散扩散端到端驾驶与 RL 联合优化</title><link>https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/</link><pubDate>Sat, 25 Apr 2026 18:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/</guid><description>深度解读理想汽车 ReflectDrive-2：离散扩散用于端到端规划，「决策-起草-反思」三阶段配 AutoEdit 局部修正，RL 联合优化把 AutoEdit 增益放大 6 倍，纯相机输入 91.0 PDMS（NAVSIM v1 navtest），Thor 上 31.8ms/帧。</description></item><item><title>凯明的方法论：从 ResNet 到 iMF —— 一个本质追问者的研究路径</title><link>https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/</link><pubDate>Sat, 18 Apr 2026 18:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/</guid><description>以 iMF（Improved Mean Flow，arXiv:2512.02012）为主线深读何恺明 2026 CVPR 工作，并把它放回 ResNet / MoCo / MAE / SiT 十年脉络中，抓四条贯穿性的方法论 DNA：朴素到极致、改变问题假设、强先验少假设、方法与任务解耦。强链 mathematics/diffusion 系列。</description></item><item><title>DeepSeek 以视觉原语思考：让多模态大模型学会「用手指着推理」</title><link>https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/</link><pubDate>Sat, 04 Apr 2026 20:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/</guid><description>解读 DeepSeek 联合北大/清华提出的「以视觉原语思考」技术报告：将坐标和边界框作为思维链原语穿插在 CoT 中，尝试用结构化的空间符号缓解推理过程中的指代漂移。本文整理其方法机制并对其「modality 即 ontology」的本体论提案做批判性审视。</description></item><item><title>SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Data</title><link>https://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/</link><pubDate>Sat, 21 Mar 2026 18:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/</guid><description>Deep analysis of CVPR 2026 SceneVerse++: how to build the largest-scale real-world 3D scene dataset from unlabeled internet videos, covering detection, segmentation, spatial VQA, and vision-language navigation.</description></item><item><title>VGGT: 几何重建作为世界模型的 reconstruct 维度</title><link>https://xuquant.com/posts/world-models/vggt/</link><pubDate>Sat, 21 Mar 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/vggt/</guid><description>VGGT 把多视图 3D 重建压缩为单次前向传播。本文重新核对 alternating attention 的复杂度推导、用 VGGT 论文 Table 3/5/6 的原始数字检验过完备预测策略，并从几何先验转移与表示哲学两个角度回答：为什么从深度与位姿组合出的点图反而比直接预测更准。</description></item><item><title>Depth Anything 3: Geometric Grounding for World Models</title><link>https://xuquant.com/posts/world-models/depth-anything-3/</link><pubDate>Sat, 07 Feb 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/depth-anything-3/</guid><description>Depth Anything 3 unifies monocular depth, multi-view reconstruction, pose estimation, and novel view synthesis under a single depth-ray representation. This article analyzes why minimal representation matters for world models and what depth estimation reveals about the geometric foundations of physical understanding.</description></item><item><title>LeJEPA：当 JEPA 不再需要启发式</title><link>https://xuquant.com/posts/world-models/lejepa/</link><pubDate>Sat, 07 Feb 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/lejepa/</guid><description>LeJEPA 把 JEPA 从依赖 stop-gradient、teacher-student、EMA 等一系列启发式的工程产物，重新拉回到可证明最优的理论框架——SIGReg 通过随机切片把嵌入分布对齐到各向同性高斯，单超参、线性复杂度、约 50 行代码。本文把这件事放回到 JEPA 防 collapse 的方法学谱系里，并解释它为什么是 LeCun 在 2025 年访谈中亲自背书的方向。</description></item><item><title>DINOv3：自监督视觉基模的规模化困局与 Gram Anchoring 破局</title><link>https://xuquant.com/posts/world-models/dinov3/</link><pubDate>Sat, 24 Jan 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/dinov3/</guid><description>DINOv3 核心贡献剖析：Gram anchoring 如何解决大规模自监督训练中 dense feature 退化的根本问题，7B 参数 SSL 模型的训练工程，以及它在深度估计和 3D 匹配上的突破意味着什么。</description></item><item><title>V-JEPA 2.1: When Self-Supervised Vision Learns to See Every Pixel</title><link>https://xuquant.com/posts/world-models/vjepa-2.1/</link><pubDate>Sat, 10 Jan 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/world-models/vjepa-2.1/</guid><description>A deep analysis of V-JEPA 2.1&amp;#39;s architectural innovations — dense predictive loss, deep self-supervision, multi-modal tokenizer, and scaling — tracing the path from collapsed context tokens to dense features that encode spatial structure, and the connection to depth estimation as geometric grounding.</description></item><item><title>CORAL：面向开放式发现的自主多Agent进化</title><link>https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/</link><pubDate>Sat, 22 Nov 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/</guid><description>将进化搜索的关键决策委托给自主Agent而非固定启发式规则，如何在数学优化和系统优化任务上实现更快的收敛和更强的结果。</description></item><item><title>InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modeling</title><link>https://xuquant.com/posts/foundation-models/inspatio-world-4d-simulator/</link><pubDate>Sat, 25 Oct 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/inspatio-world-4d-simulator/</guid><description>InSpatio-World 深度技术分析：一个 13 亿参数的实时 4D 世界模拟器，通过隐式时空缓存与显式几何约束的结合，实现从单目视频以 24 FPS 进行新视角合成。</description></item><item><title>Alpamayo：面向自动驾驶的推理-动作对齐 VLA 系统</title><link>https://xuquant.com/posts/autonomous-driving/nvidia_vla/</link><pubDate>Sat, 30 Aug 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/nvidia_vla/</guid><description>深入技术解析 Nvidia Alpamayo VLA 自动驾驶系统，以 Cosmos-Reason 为 VLM 主干，涵盖三平面视觉编码、自车捷径规避、变化因数据集范式，以及通过强化学习实现的推理-动作对齐。</description></item></channel></rss>