<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>VLM on Xu'Blog</title><link>https://xuquant.com/tags/vlm/</link><description>Recent content in VLM on Xu'Blog</description><image><title>Xu'Blog</title><url>https://xuquant.com/og-default.png</url><link>https://xuquant.com/og-default.png</link></image><generator>Hugo -- 0.152.2</generator><language>zh</language><lastBuildDate>Thu, 21 May 2026 20:00:00 +0800</lastBuildDate><atom:link href="https://xuquant.com/tags/vlm/index.xml" rel="self" type="application/rss+xml"/><item><title>ATLAS：视觉推理的动作词表</title><link>https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/</link><pubDate>Thu, 21 May 2026 20:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/</guid><description>解读 ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both：把画辅助线、框选区域、箭头指示、文本标注等中间视觉操作压缩成可训练的 functional tokens。</description></item><item><title>VLM 时序记忆机制：从视频压缩到长短时记忆融合</title><link>https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/</link><pubDate>Sat, 09 May 2026 06:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/</guid><description>系统梳理 VLM 中时序建模的主流方案：Nvidia Flex 编码器、LlamaFactory 视频处理管线、Qwen 时空压缩、Pi 0.7 MEM 时空可分离注意力与 Memory VLA，并基于 Qwen3-VL 工程实现详解 MEM 的零参数改造方案。</description></item><item><title>代码即感知：当大模型「看得懂代码」才是攻克理科题的钥匙</title><link>https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/</link><pubDate>Sat, 02 May 2026 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/</guid><description>深度解读 CVPR 2026 论文 CodePercept：通过系统性缩放实验论证感知（而非推理）才是 STEM 视觉推理的真正瓶颈，提出以可执行代码为感知媒介的双通道范式，8B 模型超越 72B 基线 6.2%，32B 模型在 STEM2Code-Eval 上超越 GPT5-Thinking。</description></item><item><title>DeepSeek 以视觉原语思考：让多模态大模型学会「用手指着推理」</title><link>https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/</link><pubDate>Sat, 04 Apr 2026 20:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/</guid><description>解读 DeepSeek 联合北大/清华提出的「以视觉原语思考」技术报告：将坐标和边界框作为思维链原语穿插在 CoT 中，尝试用结构化的空间符号缓解推理过程中的指代漂移。本文整理其方法机制并对其「modality 即 ontology」的本体论提案做批判性审视。</description></item><item><title>SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Data</title><link>https://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/</link><pubDate>Sat, 21 Mar 2026 18:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/</guid><description>Deep analysis of CVPR 2026 SceneVerse++: how to build the largest-scale real-world 3D scene dataset from unlabeled internet videos, covering detection, segmentation, spatial VQA, and vision-language navigation.</description></item></channel></rss>