VLM on Xu'Blog

VLM on Xu'Bloghttps://xuquant.com/tags/vlm/Recent content in VLM on Xu'BlogXu'Bloghttps://xuquant.com/og-default.pnghttps://xuquant.com/og-default.pngHugo -- 0.152.2zhThu, 21 May 2026 20:00:00 +0800ATLAS：视觉推理的动作词表https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/Thu, 21 May 2026 20:00:00 +0800https://xuquant.com/posts/foundation-models/atlas-one-word-visual-reasoning/解读 ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both：把画辅助线、框选区域、箭头指示、文本标注等中间视觉操作压缩成可训练的 functional tokens。VLM 时序记忆机制：从视频压缩到长短时记忆融合https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/Sat, 09 May 2026 06:00:00 +0800https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/系统梳理 VLM 中时序建模的主流方案：Nvidia Flex 编码器、LlamaFactory 视频处理管线、Qwen 时空压缩、Pi 0.7 MEM 时空可分离注意力与 Memory VLA，并基于 Qwen3-VL 工程实现详解 MEM 的零参数改造方案。代码即感知：当大模型「看得懂代码」才是攻克理科题的钥匙https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/Sat, 02 May 2026 10:00:00 +0800https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/深度解读 CVPR 2026 论文 CodePercept：通过系统性缩放实验论证感知（而非推理）才是 STEM 视觉推理的真正瓶颈，提出以可执行代码为感知媒介的双通道范式，8B 模型超越 72B 基线 6.2%，32B 模型在 STEM2Code-Eval 上超越 GPT5-Thinking。DeepSeek 以视觉原语思考：让多模态大模型学会「用手指着推理」https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/Sat, 04 Apr 2026 20:00:00 +0800https://xuquant.com/posts/foundation-models/deepseek-thinking-with-visual-primitives/解读 DeepSeek 联合北大/清华提出的「以视觉原语思考」技术报告：将坐标和边界框作为思维链原语穿插在 CoT 中，尝试用结构化的空间符号缓解推理过程中的指代漂移。本文整理其方法机制并对其「modality 即 ontology」的本体论提案做批判性审视。SceneVerse++: Lifting Unlabeled Internet Videos into 3D Scene Understanding Training Datahttps://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/Sat, 21 Mar 2026 18:00:00 +0800https://xuquant.com/posts/foundation-models/sceneverse-plus-data-engine-for-3d-scene-understanding/Deep analysis of CVPR 2026 SceneVerse++: how to build the largest-scale real-world 3D scene dataset from unlabeled internet videos, covering detection, segmentation, spatial VQA, and vision-language navigation.