Deep technical analyses of foundation model architectures — from attention mechanism innovations (MLA, GQA, hybrid attention) to MoE sparsity, multimodal reasoning, and generative paradigms like Flow Matching.

Architecture & Attention

ArticleTopic
Multi-Head Latent AttentionDeepSeek-V2’s KV cache compression via latent attention
Qwen3.5 vs Qwen3Hybrid attention, joint multimodal training, and high-sparsity MoE

Multimodal & Reasoning

ArticleTopic
DeepSeek Visual PrimitivesThinking with visual primitives in multimodal LLMs
ATLAS One-Word Visual ReasoningFunctional tokens as compact visual operations for VLM reasoning
SceneVerse++ Data EngineLifting internet videos into 3D scene understanding
Kaiming He CVPR 2026Flow Matching paradigm breakthroughs
InSpatio-World 4D Simulator13 亿参数实时 4D 世界模拟器,时空 autoregressive + 隐式缓存 + 24 FPS 新视角合成

Agents & Frameworks

ArticleTopic
CORAL Multi-Agent EvolutionOpen-ended discovery via LLM-driven evolutionary search

Perception & Reasoning Bottleneck

ArticleTopic
代码即感知When LLMs “understand code” as the key to mastering STEM reasoning

Embodied VLA

ArticleTopic
Qwen-VLAT2A 解压先验 + 流匹配 PPO + Qwen3.5-4B 跨形态通用具身策略
HiF-VLAH.264 codec motion vectors 当时间记忆,前向预测 + AdaLN 调制动作流
VLA × VGGT 几何注入Early/Late/Spatial-Forcing 三架构对照下的负结果,mid-training 才是真杠杆
ReconVLAgaze-crop VAE-latent 重建做 VLA 的隐式视觉接地