Self-Supervised on Xu'Blog

Self-Supervised on Xu'Bloghttps://xuquant.com/tags/self-supervised/Recent content in Self-Supervised on Xu'BlogXu'Bloghttps://xuquant.com/og-default.pnghttps://xuquant.com/og-default.pngHugo -- 0.152.2zhSun, 24 May 2026 10:00:00 +0800Dense Latent Predictive Supervision in AD VLA：为什么 pixel 不是最优https://xuquant.com/posts/autonomous-driving/dense-latent-predictive-supervision/Sun, 24 May 2026 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/dense-latent-predictive-supervision/AD VLA 用 sparse trajectory loss（12 个 waypoint × 2D = 24 scalars）监督 2B+ 参数 backbone，信息论 ratio ~10⁻¹⁰——supervision deficit 是 NAVSIM 87-93 区间停滞的核心原因。DriveVLA-W0 用 pixel-level future image prediction 补，方向对但路线非最优。V-JEPA 风格 latent predictive supervision 在 capacity / 推理 cost / 评测同构性三条上都更友好。凯明的方法论：从 ResNet 到 iMF —— 一个本质追问者的研究路径https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/Sat, 18 Apr 2026 18:00:00 +0800https://xuquant.com/posts/foundation-models/kaiming-he-cvpr2026-five-papers-flow-matching-breakthrough/以 iMF（Improved Mean Flow，arXiv:2512.02012）为主线深读何恺明 2026 CVPR 工作，并把它放回 ResNet / MoCo / MAE / SiT 十年脉络中，抓四条贯穿性的方法论 DNA：朴素到极致、改变问题假设、强先验少假设、方法与任务解耦。强链 mathematics/diffusion 系列。LeJEPA：当 JEPA 不再需要启发式https://xuquant.com/posts/world-models/lejepa/Sat, 07 Feb 2026 10:00:00 +0800https://xuquant.com/posts/world-models/lejepa/LeJEPA 把 JEPA 从依赖 stop-gradient、teacher-student、EMA 等一系列启发式的工程产物，重新拉回到可证明最优的理论框架——SIGReg 通过随机切片把嵌入分布对齐到各向同性高斯，单超参、线性复杂度、约 50 行代码。本文把这件事放回到 JEPA 防 collapse 的方法学谱系里，并解释它为什么是 LeCun 在 2025 年访谈中亲自背书的方向。DINOv3：自监督视觉基模的规模化困局与 Gram Anchoring 破局https://xuquant.com/posts/world-models/dinov3/Sat, 24 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/dinov3/DINOv3 核心贡献剖析：Gram anchoring 如何解决大规模自监督训练中 dense feature 退化的根本问题，7B 参数 SSL 模型的训练工程，以及它在深度估计和 3D 匹配上的突破意味着什么。V-JEPA 2.1: When Self-Supervised Vision Learns to See Every Pixelhttps://xuquant.com/posts/world-models/vjepa-2.1/Sat, 10 Jan 2026 10:00:00 +0800https://xuquant.com/posts/world-models/vjepa-2.1/A deep analysis of V-JEPA 2.1's architectural innovations — dense predictive loss, deep self-supervision, multi-modal tokenizer, and scaling — tracing the path from collapsed context tokens to dense features that encode spatial structure, and the connection to depth estimation as geometric grounding.ReconVLA：用 gaze-crop 重建给 VLA 视觉接地https://xuquant.com/posts/foundation-models/reconvla-gaze-crop-implicit-grounding/Mon, 27 Oct 2025 22:00:00 +0800https://xuquant.com/posts/foundation-models/reconvla-gaze-crop-implicit-grounding/OpenHelix 的 ReconVLA (arXiv:2508.10333) 在 OpenVLA 风格的 backbone 后挂一个 3 层 DiT，用 gaze-crop 的 VAE-latent 重建当辅助监督，把 VLA 的注意力锚到目标物体上。本文对照 paper 与开源 code 读一遍，包含 paper 没强调的工程细节，以及几个 paper 没回答的问题——recon-on/off ablation 缺位，'隐式接地' 在训练 supervision 上其实依赖 offline YOLO bbox。