Perception on Xu'Blog

Perception on Xu'Bloghttps://xuquant.com/tags/perception/Recent content in Perception on Xu'BlogXu'Bloghttps://xuquant.com/og-default.pnghttps://xuquant.com/og-default.pngHugo -- 0.152.2zhSun, 17 May 2026 10:00:00 +08004D Vision Encoder for Autonomous Driving：信息瓶颈视角下的统一审视https://xuquant.com/posts/autonomous-driving/4d-vision-encoder-for-autonomous-driving/Sun, 17 May 2026 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/4d-vision-encoder-for-autonomous-driving/把 AR1 Tri-plane、Flex、MEM、Memory VLA、BEV/OCC、V-JEPA、DA3、VGGT 等 9 种 4D 视觉编码方案放进同一个信息瓶颈坐标系，从 Y 的四元结构（感知/预测/规划/推理）推出理想 4D encoder 的五个必要条件，给出 Qwen3.5 上 4V→7V 升级的评估路径。VLM 时序记忆机制：从视频压缩到长短时记忆融合https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/Sat, 09 May 2026 06:00:00 +0800https://xuquant.com/posts/autonomous-driving/vlm-temporal-memory-mechanisms/系统梳理 VLM 中时序建模的主流方案：Nvidia Flex 编码器、LlamaFactory 视频处理管线、Qwen 时空压缩、Pi 0.7 MEM 时空可分离注意力与 Memory VLA，并基于 Qwen3-VL 工程实现详解 MEM 的零参数改造方案。代码即感知：当大模型「看得懂代码」才是攻克理科题的钥匙https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/Sat, 02 May 2026 10:00:00 +0800https://xuquant.com/posts/foundation-models/codepercept-perception-bottleneck/深度解读 CVPR 2026 论文 CodePercept：通过系统性缩放实验论证感知（而非推理）才是 STEM 视觉推理的真正瓶颈，提出以可执行代码为感知媒介的双通道范式，8B 模型超越 72B 基线 6.2%，32B 模型在 STEM2Code-Eval 上超越 GPT5-Thinking。Alpamayo：面向自动驾驶的推理-动作对齐 VLA 系统https://xuquant.com/posts/autonomous-driving/nvidia_vla/Sat, 30 Aug 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/nvidia_vla/深入技术解析 Nvidia Alpamayo VLA 自动驾驶系统，以 Cosmos-Reason 为 VLM 主干，涵盖三平面视觉编码、自车捷径规避、变化因数据集范式，以及通过强化学习实现的推理-动作对齐。