Jinchao Li

Thinking with Visual Primitives (转)

2026-05-01T00:00:00+08:00

作者：PaperAgent

链接：https://zhuanlan.zhihu.com/p/2033494636023559146

来源：知乎

著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

上周DeepSeek V4发布了，但遗憾的还是没有多模态，今天（老规矩，节假日发布）DeepSeek把这块补上了，开源了最新的多模态技术&Paper：Thinking with Visual Primitives（以视觉原语思考）

DeepSeek&北京大学&清华大学提出”视觉基元推理”框架，将边界框与坐标点提升为”最小思考单元”，解决MLLM在复杂空间推理中的Reference Gap（指代鸿沟）问题。基于DeepSeek-V4-Flash构建的模型，在仅使用约90个KV Cache视觉token的情况下，性能比肩GPT-5.4、Claude-Sonnet-4.6与Gemini-3-Flash。Skills驱动推理新范式，清华&北大：Token立省59%

一、从感知鸿沟到指代鸿沟：问题重新定义

当前多模态大语言模型（MLLM）的Chain-of-Thought（CoT）推理几乎完全发生在语言空间。即便前沿模型通过高分辨率裁剪、动态分块等策略解决了”看不清”的Perception Gap（感知鸿沟），它们在面对密集计数、拓扑导航、多步空间推演时，仍然频繁出现逻辑崩塌（logical collapse）。

DeepSeek团队指出，这背后是一个更本质的瓶颈——Reference Gap（指代鸿沟）：

自然语言天生是模糊、连续的，而视觉空间是精确、离散的。当模型用语言描述”左边第二个红色的物体”时，它实际上已经丢失了精确的空间锚点，导致推理链条与图像实体脱节，最终引发级联幻觉。

人类是如何解决这个问题的？我们在数一堆密集物体或走迷宫时，会本能地用手指指向目标，将抽象的语义概念锚定到具体的物理坐标上，大幅降低工作记忆负担。

受此启发，论文提出Thinking with Visual Primitives（基于视觉基元的思考）：将边界框（bounding boxes）和点（points）提升为与语言token同级的”最小思考单元”，直接交错插入模型的推理轨迹中。模型不再是”说完再指”，而是边指边想（point while it reasons）。

Figure 1对比了各模型在800×800分辨率下的KV Cache Entries与7项基准平均分

Figure 1揭示了这一范式的惊人效率：对于800×800的输入，该模型在KV Cache中仅保留约90个视觉条目（总token约361），远低于GPT-5.4（~740）、Claude-Sonnet-4.6（~870）和Gemini-3-Flash（~1100），同时在计数与空间推理任务上取得77.2%的平均分，超越所有对比模型。

二、架构与训练 pipeline：效率与专项能力的平衡

2.1 架构设计

模型采用类LLaVA的标准架构，以DeepSeek-V4-Flash（284B总参数，13B激活参数的MoE模型）为语言骨干，视觉编码器采用自研的DeepSeek-ViT，支持任意分辨率输入。

极致压缩是架构的核心：

14×14 Patch Embedding：将图像切分为基础patch；
3×3空间压缩：每9个相邻patch在通道维度压缩为1个token；
*Compressed Sparse Attention (CSA)**：在LLM的KV Cache层进一步压缩视觉token。

以756×756图像为例：原始571,536像素 → ViT处理为2,916个patch token → 3×3压缩后324个token送入LLM → CSA机制最终仅保留81个视觉KV条目。从原始像素到KV Cache，整体压缩比高达7,056:1。

2.2 五阶段后训练流程

论文设计了一套”先训专家，再合并”的范式：

Pretraining：在数万亿多模态token上预训练，赋予模型输出视觉基元的基础能力；
Specialized SFT：分别针对Box（FTwG）和Point（FTwP）构建冷启动数据，独立微调，避免模态冲突；
Specialized RL：对两个专家模型分别应用GRPO强化学习，使用格式、质量、准确率三重Reward Model；
Unified RFT：用两位专家模型生成拒绝采样数据，统一训练一个融合模型；
On-Policy Distillation：通过反向KL散度，将专家模型的输出分布蒸馏到统一模型，弥合性能差距。

三、冷启动数据构造：四大推理场景的精细化设计

为了让模型学会”用基元思考”，团队没有依赖简单的指令微调，而是为四类任务构建了带显式视觉锚定的思维链冷启动数据。

3.1 计数（Counting）

MLLM在密集场景中计数失败，本质是无法建立”语言数字↔视觉实体”的一一对应。

包含足球队照片与熊群照片的两个完整推理案例

粗粒度计数：模型先进行意图分析，再批量 grounding（同时框出所有候选对象），最后统计求和。Figure 3展示了对团队照片的人数统计，模型一次性框出25个人，再分排验证。
细粒度计数：基于GQA场景图构造属性约束问题（如”地面上的熊有几只”），模型需逐一枚举验证，排除不符合属性的负样本。

3.2 空间推理与通用VQA（Spatial Reasoning & General VQA）

利用GQA和CLEVR构造数据。在CLEVR合成场景中，模型需要执行多跳逻辑推理（如”与灰色金属球同尺寸的紫色橡胶物体是否存在”）。每个推理步骤都必须通过<|ref|>...<|/ref|><|box|>...<|/box|>将提及的物体锚定到图像坐标，避免语义漂移。

展示CLEVR场景中多属性验证的完整思维链

这是检验拓扑推理能力的极端场景。纯语言CoT难以描述不规则路径的连通性。

团队使用DFS、Prim、Kruskal算法生成矩形、圆形、六边形三种迷宫拓扑，并构造不可解迷宫（在路径中段故意设墙）。模型的思维链以<|point|>[[x,y]]<|/point|>记录每一步探索坐标，形成类似人类”试错-回溯”的DFS轨迹。

展示六边形迷宫中从起点到终点的完整探索与回溯过程

3.4 路径追踪（Path Tracing）

在缠绕的贝塞尔曲线中，模型需要追踪指定线条找到终点。难点在于交叉点消歧：当两条线相交时，模型必须依据局部几何连续性判断走向，而非依赖颜色捷径。思维链以自适应密度的坐标序列记录路径——直线段稀疏采样，弯曲/交叉处密集采样。

展示从皇冠图标出发追踪洋红色曲线至终点的过程

四、Reward Model设计：让强化学习”看懂”视觉推理

在Specialized RL阶段，论文为不同任务设计了精细的Accuracy RM：

任务	Reward Model核心逻辑
计数	基于相对误差的指数衰减奖励：$R = \alpha \cdot \exp(-\beta \cdot \frac{
空间推理/VQA	LLM-based GRM，分别对思维链和最终回答评分后取平均
迷宫导航	四维加权：因果探索进度（截断于首次撞墙）、探索完整度（不可解迷宫）、撞墙惩罚、最终路径有效性
路径追踪	双向轨迹对齐（预测点→真值线 / 真值点→预测线）、端点精度、轨迹连续性惩罚（禁止跳点）

Table 1的结果极具说服力：

计数：Pixmo-Count达到89.2%，超越所有对手；CountQA上RA@10为74.1%，仅次于Gemini-3-Flash；
空间推理：DS_Spatial_Reasoning达到98.7%，显著领先Claude的97.2%和Qwen3-VL的96.8%；
拓扑推理：这是所有前沿模型的盲区。DS_Maze_Navigation 66.9%（次高仅50.6%），DS_Path_Tracing 56.7%（次高仅46.5%），形成断层式领先。

五、定性分析：视觉基元如何重塑推理体验

论文通过大量案例展示，视觉基元不仅是内部推理工具，更外化为可解释的”注意力轨迹”。

5.1 边界框作为基元

模型展现出强大的涌现协同能力：

世界知识融合：看到金门大桥照片，模型框出大桥主体，关联到旧金山，进而回答”附近有NBA球队吗”（金州勇士）；

反事实推理：在”天平哪边更重”问题中，模型框出左右物体及托盘，通过视觉证据（倾斜角度）推翻外观直觉；

可操作建议：在”如何做拿铁”问题中，模型框出咖啡机、蒸汽棒、奶壶、咖啡豆、杯子，给出带空间坐标的操作步骤。

5.2 点作为基元

在迷宫和路径追踪中，模型输出的点序列构成了可视化的推理路径。人类可以沿着这些坐标还原模型的”心路历程”：何时尝试分支、何时发现死胡同、何时回溯。这种可解释性是纯语言CoT无法提供的。

圆形迷宫导航与多曲线追踪

https://github.com/deepseek-ai/Thinking-with-Visual-Primitives/blob/main/Thinking_with_Visual_Primitives.pdf

DeepSeek-V4与流形撕裂(转)

2026-04-25T00:00:00+08:00

Single Token Geometry: DeepSeek V4 and Manifold Tearing

单标几何：DeepSeek V4 与流形撕裂

Deep Manifold

Apr 25, 2026

I was preparing Single Token Geometry: Data Complexity, the second entry in this series , when DeepSeek V4 dropped. I set it aside immediately.

What caught my attention wasn’t the benchmark numbers, impressive as they are. It was a paragraph buried in Section 4.2.3 of the technical report:

“We identified that the occurrence of spikes is consistently tied to outliers in the MoE layers, and the routing mechanism itself appears to exacerbate the emergence of these outliers.”

And then, with unusual candor:

“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community.”

I appreciate DeepSeek’s transparency here. They found three empirical fixes that work. They admitted they don’t fully understand why. And they published anyway. That intellectual honesty is rare, and it’s exactly the kind of opening that geometric thinking is built for.

This post is my attempt to supply part of what they left open: a geometric account of what training instability is in a system like DeepSeek V4, and why their mitigations work.

The thesis is simple and I want to state it plainly upfront:

Loss spikes in MoE training are not optimization failures. They are manifold tears.

我原本正在准备《单标几何：数据复杂性》这个系列的第二篇, 这时 DeepSeek V4 发布了。于是我立刻把原来的文章放到一边。

真正吸引我注意的，并不是那些令人印象深刻的基准测试数字，而是技术报告第 4.2.3 节中埋着的一段话：

“我们发现，尖峰的出现始终与 MoE 层中的异常值相关，而路由机制本身似乎会加剧这些异常值的产生。”

然后，他们又以一种少见的坦诚写道：

“尽管目前我们对这些方法背后的机制尚缺乏完整的理论理解，但我们仍将它们公开分享，以促进社区进一步探索。”

我非常欣赏 DeepSeek 在这里的透明度。他们找到了三种有效的经验性修复方法。他们承认自己并不完全理解其原因。然后他们仍然选择发表出来。这种智识上的诚实并不常见，而这正是几何思考可以进入的地方。

这篇文章就是我试图补上他们留下的那一部分：为 DeepSeek V4 这样的系统中的训练不稳定性，提供一种几何解释；并说明为什么他们的缓解方法会起作用。

本文的核心论点很简单，我想在开头直接说清楚：

MoE 训练中的损失峰值不是优化失败。它们是流形撕裂。

What Is a Manifold Tear?

Before we can call something a tear, we need to be precise about what a manifold is and what it requires.

A manifold is a space that is locally Euclidean. The canonical example is the surface of the Earth: globally curved and non-Euclidean, but zoom in far enough on any point and it looks flat, like a patch of ℝ². More formally, a topological manifold requires that every point has a neighborhood homeomorphic to ℝⁿ. A smooth manifold further requires that the transition maps between overlapping neighborhoods, called charts, are differentiable. A Riemannian manifold additionally equips the space with a metric tensor, giving you notions of distance and curvature.

The key requirement at the foundation of all of this is continuity. Not just continuity :smoothness. The whole apparatus of differential geometry, and crucially the whole apparatus of gradient-based optimization, assumes that the space being traversed is smooth enough to support derivatives everywhere.

A manifold tear is a violation of this assumption. Precisely: a tear occurs when a map between manifold regions fails to be continuous — when a point has no well-defined image, or when nearby points are sent to regions that are not nearby in the target space. A tear is not a large deformation. A large deformation of a manifold is still a manifold. A tear is a discontinuity, the manifold’s local Euclidean structure breaks down at the torn point.

In the context of a transformer’s residual stream, we can define this operationally:

A manifold tear in a transformer’s residual stream is a discontinuity in the layer-to-layer transport map, induced by a discrete routing decision that is inconsistent with the local geometry of the representation manifold at that point.

Now let’s see exactly how that happens in DeepSeek V4 and how the three mitigations each address a different stage of the same failure cascade.

什么是流形撕裂？

在我们把某种现象称为“撕裂”之前，必须先明确：什么是流形？它需要满足什么条件？

流形是一个局部欧几里得的空间。最经典的例子是地球表面：从整体看，它是弯曲的、非欧几里得的；但如果你在任意一点足够放大，它看起来就是平坦的, 像一小片 ℝ²。更形式化地说，一个拓扑流形要求每一个点都有一个邻域，并且这个邻域与 ℝ² 同胚。一个光滑流形进一步要求重叠邻域之间的转换映射, 也就是坐标图之间的转换, 是可微的。黎曼流形则在此基础上进一步为空间配备度量张量，从而赋予距离和曲率的概念。

所有这些结构的基础要求，都是连续性。不只是连续性，而是光滑性。整个微分几何的装置，尤其是整个基于梯度的优化装置，都默认被穿行的空间足够光滑，因而处处可以支持导数。

流形撕裂就是对这一假设的破坏。更准确地说：当流形区域之间的某个映射不再连续时，撕裂就发生, 也就是说，某个点没有良好定义的像，或者彼此相近的点被送到了目标空间中并不相近的区域。撕裂不是大的形变。一个流形即使发生很大的形变，仍然可以是流形。撕裂是不连续性, 在被撕裂的点上，流形的局部欧几里得结构失效了。

在 Transformer 的残差流语境中，我们可以操作性地定义它：

Transformer 残差流中的流形撕裂，是层与层之间传输映射中的一种不连续性；这种不连续性由离散路由决策诱发，而该路由决策与该点处表征流形的局部几何不一致。

现在，让我们具体看看这一点在 DeepSeek V4 中是如何发生的, 以及它的三种缓解方法，如何分别处理同一个失效级联中的不同阶段。

The Geometry of the Residual Stream

Think of a single token passing through a transformer. At each layer, its hidden state is a point in ℝᵈ , a high-dimensional vector. As the token passes through successive layers, this point traces a trajectory through representation space. The claim that this trajectory lives on a manifold is implicit in how we train: gradient descent assumes a smooth loss landscape, backpropagation assumes differentiable operations everywhere, and the expressive power of deep networks comes precisely from learning smooth, structured transformations of this space.

In a standard transformer, every token passes through the same FFN at each layer. The map from layer l to layer l+1 is the same for all tokens. The manifold deforms, but it deforms continuously — the same function applied everywhere.

In a Mixture-of-Experts (MoE) transformer like DeepSeek V4, this changes fundamentally. At each MoE layer, a router examines the token’s hidden state and routes it to one or more experts, a discrete, Top-k selection from hundreds of possible FFNs. The composite map from hidden state to new hidden state is now:

hidden state → routing decision → expert application → new hidden state

The routing decision is discontinuous by construction. It is a discrete selector. A token sitting at position x in representation space gets routed to expert E₁. A token at position x + ε , infinitesimally close, might get routed to expert E₂. These two experts were trained on different regions of the manifold. Their outputs are not guaranteed to be close to each other.

This means the layer-to-layer transport map has discontinuities at routing boundaries. The manifold is already torn, structurally, at every expert boundary. This is not a bug — it is the price of the MoE architecture’s efficiency. But it is a latent geometric fragility that becomes catastrophic under the right conditions.

残差流的几何

想象一个单标正在穿过一个 Transformer。在每一层，它的隐藏状态都是 ℝᵈ中的一个点: 一个高维向量。随着这个单标穿过连续的层，这个点就在表征空间中划出一条轨迹。我们说这条轨迹生活在某个流形上，这一点其实隐含在训练方式之中：梯度下降假设损失景观是光滑的；反向传播假设所有操作处处可微；而深度网络的表达能力，恰恰来自于它能够学习这个空间上的光滑、有结构的变换。

在标准 Transformer 中，每个单标在每一层都会经过同一个 FFN。从第 l 层到第 l+1 层的映射，对所有单标来说都是同一个映射。流形会形变，但它是连续形变, 同一个函数作用在整个空间上。

但在像 DeepSeek V4 这样的 Mixture-of-Experts（MoE）Transformer 中，情况发生了根本变化。在每一个 MoE 层，路由器会检查单标的隐藏状态，并将它路由到一个或多个专家, 也就是从数百个可能的 FFN 中进行一次离散的 Top-k 选择。于是，从隐藏状态到新隐藏状态的复合映射变成了：

隐藏状态 → 路由决策 → 专家作用 → 新的隐藏状态

路由决策在构造上就是不连续的。它是一个离散选择器。位于表征空间中某一点 x 的单标，可能被路由到专家 x + ε。而位于 E₁ 的另一个单标——哪怕它与 x 无限接近——也可能被路由到专家 E₂。这两个专家是在流形的不同区域上训练出来的。它们的输出并不保证彼此接近。

这意味着，层与层之间的传输映射在路由边界处存在不连续性。从结构上说，流形在每一个专家边界处都已经存在撕裂的可能。这并不是一个 bug，而是 MoE 架构效率所付出的代价。但它是一种潜在的几何脆弱性；在合适的条件下，这种脆弱性会变成灾难性的训练不稳定。

The Failure Cascade

The paper’s observation, that spikes are tied to outliers in MoE layers, and that routing exacerbates them, describes a specific failure cascade. Geometrically, it unfolds in three stages:

Stage 1: Local curvature spike. An activation in an MoE expert grows very large. The SwiGLU gate or linear component produces an extreme value. This is not yet a tear — it is a region of very high curvature on the representation manifold. The local geometry is still technically defined, but it is poorly conditioned. Small changes in input produce large changes in output. The optimizer’s gradient step, calibrated for normal curvature, begins to overshoot.

Stage 2: Chart inconsistency. The router, updating synchronously with the backbone, now operates on a shifted manifold. At step t, the backbone parameters θₜ define a representation space Mₜ. But the router, which just updated, is making routing decisions as if the manifold were Mₜ₋₁. A token at position x on Mₜ gets routed as if it were on a manifold that no longer exists. This is a chart inconsistency: the coordinate chart (the routing assignment) no longer matches the local geometry of the point being mapped. The token is being sent to the wrong expert for where it actually lives in the representation space.

Stage 3: Tear amplification. The misrouted token enters an expert that wasn’t trained on its region of the manifold. The expert produces an outlier output — a point far from where the token should be in the next layer’s representation space. This discontinuous jump is the tear. And if the residual mapping matrices are expansive (spectral norm > 1), the tear grows as it propagates through subsequent layers. By layer L, a token that was slightly misrouted has been thrown far from its correct manifold region. The loss spike is the accumulated geometric damage becoming visible in the training objective.

This is the cascade: high curvature → chart inconsistency → routing boundary discontinuity → tear amplification → loss spike.

DeepSeek V4’s three mitigations each interrupt this cascade at a different stage.

失效级联

论文中的观察残差尖峰与 MoE 层中的异常值相关，而路由机制会加剧这些异常值——描述的是一个具体的失效级联。从几何上看，它分为三个阶段：

阶段一：局部曲率尖峰。MoE 专家中的某个激活值变得非常大。SwiGLU 的门控项或线性项产生了一个极端值。这还不是撕裂: 它是表征流形上一块曲率极高的区域。局部几何在技术上仍然是有定义的，但条件已经很差。输入中的微小变化，会导致输出中的巨大变化。原本按照正常曲率校准的优化器梯度步长，开始发生过冲。

阶段二：坐标图不一致。路由器与主干网络同步更新，于是它现在作用在一个已经发生位移的流形上。在第 t步，主干参数 θₜ 定义了一个表征空间 Mₜ。但刚刚更新过的路由器，却像是在流形 Mₜ₋₁ 上做路由决策。位于 Mₜ上某一点 x 的单标，被当作仍位于一个已经不存在的流形上来路由。这就是坐标图不一致：坐标图，也就是路由分配，已经不再匹配该点被映射时的局部几何。这个单标被送到了一个并不适合它当前表征位置的专家那里。

阶段三：撕裂放大。被错误路由的单标进入了一个并不是在它所在流形区域上训练出来的专家。这个专家产生了一个异常输出, 也就是一个远离该单标在下一层表征空间中本应到达位置的点。这个不连续跳跃，就是撕裂。而如果残差映射矩阵是扩张性的，也就是谱范数大于 1，那么这种撕裂会在后续层传播时不断放大。到第 L层时，一个原本只是轻微错路由的单标，已经被抛离了它正确的流形区域。残差尖峰，就是这种累积的几何损伤在训练目标中的显现。

这就是整个级联：

高曲率 → 坐标图不一致 → 路由边界不连续 → 撕裂放大 → 残差尖峰

DeepSeek V4 的三种缓解方法，正是在这个级联的不同阶段将其打断。

Mitigation 1: SwiGLU Clamping, Bounding Local Curvature

This is the earliest intervention, attacking Stage 1 before the cascade begins.

The SwiGLU activation computes:

SwiGLU(x, g) = x · σ(g) · g

When either the linear component x or the gate g becomes very large, the local curvature of the loss surface spikes. Mathematically, curvature is related to the second derivative of the map — and when activations are extreme, second derivatives become extreme. The optimizer’s gradient step assumes the loss landscape is locally well-approximated by its first-order Taylor expansion. High curvature violates this assumption. The step overshoots. The next gradient is wildly miscalibrated. The cascade is initiated.

DeepSeek V4 clamps the linear component to [−10, 10] and caps the gate at 10. Geometrically, this is a curvature bound: it enforces that no single point in the activation space can develop extreme local curvature. The manifold remains smooth enough at every point that gradient steps stay valid.

Think of it as a chart boundary condition. A chart in differential geometry is only valid within a bounded region — you cannot use a single flat map to cover the entire Earth. SwiGLU clamping is enforcing that activations stay within the region where the local chart (the linear approximation used by the optimizer) remains valid. Step outside that region, and the chart breaks down. The clamp keeps you inside.

The paper notes this works without compromising performance — which makes geometric sense. Clamping doesn’t restrict the manifold’s expressivity; it restricts the curvature of any single point. The manifold can still be highly complex and nonlinear globally. It just can’t have singular points locally.

缓解方法一：SwiGLU Clamping, 约束局部曲率

这是最早发生作用的干预方式，它在级联开始之前就攻击了阶段一。

SwiGLU 激活计算的是：

SwiGLU(x, g) = x · σ(g) · g

当线性项 xxx 或门控项 ggg 变得非常大时，损失曲面的局部曲率就会出现尖峰。从数学上说，曲率与映射的二阶导数相关——而当激活值变得极端时，二阶导数也会变得极端。优化器的梯度步，默认损失景观在局部可以被一阶泰勒展开很好地近似。高曲率破坏了这一假设。于是梯度步发生过冲。下一步的梯度被严重误校准。整个失效级联由此被启动。

DeepSeek V4 将线性项限制在 [−10, 10] 范围内，并将门控项的上界限制为 10。从几何上看，这就是一种曲率约束：它强制激活空间中的任何单个点都不能发展出极端的局部曲率。于是，流形在每个点上都保持足够光滑，使梯度步仍然有效。

可以把它理解为一种坐标图边界条件。在微分几何中，一个坐标图只在某个有界区域内有效——你不能用一张平面地图覆盖整个地球。SwiGLU clamping 强制激活值停留在这样一个区域内：在这个区域中，优化器所使用的局部坐标图，也就是线性近似，仍然有效。一旦走出这个区域，坐标图就会失效。clamp 的作用，就是把你留在这个区域之内。

论文指出，这种方法不会损害模型性能——这在几何上是说得通的。Clamping 并不限制流形的表达能力；它限制的是任何单个点的曲率。流形在全局上仍然可以高度复杂、强非线性。它只是不能在局部产生奇异点。

Mitigation 2: Anticipatory Routing, Preventing Chart Inconsistency

Where SwiGLU clamping addresses the precondition, Anticipatory Routing attacks Stage 2 directly: the moment of chart inconsistency.

Recall the problem: at training step t, the backbone parameters θₜ define the current representation manifold Mₜ. If the router also updates to θₜ simultaneously, it is now making routing decisions based on a manifold that is being constructed in real time. The routing chart and the geometric chart are out of sync.

Anticipatory Routing enforces a temporal consistency condition: at step t, routing decisions are made using the historical parameters θₜ₋Δₜ. The router operates on the snapshot of the manifold that produced the current token representations. In differential geometry terms, this is analogous to a connection — a rule for parallel-transporting objects along the manifold in a consistent way. You don’t route a vector using the geometry at the destination; you route it using the geometry at the source.

The implementation is elegant: the data for step t is fetched in advance at step t − Δt, and routing indices are precomputed and cached. The router never sees the manifold mid-update. The discrete chart (routing assignment) and the continuous chart (representation geometry) remain synchronized.

The dynamic activation is particularly revealing. The system detects loss spikes and then activates Anticipatory Routing for a stabilization period before reverting to standard training. This is the system monitoring for chart inconsistency and applying the consistency condition on demand — a feedback loop that treats geometric misalignment as a detectable, correctable event rather than an inevitable one.

缓解方法二：Anticipatory Routing, 防止坐标图不一致

如果说 SwiGLU clamping 处理的是失效级联的前提条件，那么 Anticipatory Routing 直接攻击的就是阶段二：坐标图不一致发生的那个瞬间。

回忆一下问题所在：在训练第 t 步，主干网络参数 θₜ₋Δₜ 定义了当前的表征流形 Mₜ。如果路由器也同时更新到 θₜ₋Δₜ，那么它就在一个正在实时构造的流形上做路由决策。路由坐标图与几何坐标图因此发生了不同步。

Anticipatory Routing 强制引入一种时间一致性条件：在第 t 步，路由决策使用历史参数 θₜ₋Δₜ 来完成。也就是说，路由器作用在产生当前单标表征的那个流形快照上。用微分几何的话说，这类似于一种联络, 一种沿着流形一致地平行移动对象的规则。你不应该用目的地处的几何来路由一个向量；你应该用源点处的几何来路由它。

它的实现很优雅：第 t 步的数据会在第 θₜ₋Δₜ 步提前取出，路由索引也会被预先计算并缓存。这样，路由器永远不会看到一个正在更新中的流形。离散坐标图，也就是路由分配；以及连续坐标图，也就是表征几何，因而保持同步。

它的动态激活机制尤其值得注意。系统会检测残差尖峰，然后在一段稳定化期间启用 Anticipatory Routing，之后再回到标准训练。这等于系统在监测坐标图不一致，并按需施加一致性条件, 这是一种反馈回路，它把几何错位视为一种可检测、可修正的事件，而不是不可避免的宿命。

Mitigation 3: mHC — Bounding Tear Propagation

The Manifold-Constrained Hyper-Connections (mHC) is the deepest of the three interventions, and the one most explicitly geometric in the paper’s own framing , it’s right there in the name.

Standard Hyper-Connections expand the residual stream width by a factor of n_hc, introducing a residual mapping matrix B_l ∈ ℝⁿ×ⁿ at each layer. The update rule is:

(X_{l+1} = B_l X_l + C_l F_l(A_l X_l))

The paper found that naive HC exhibited numerical instability when stacked, precisely because B_l is unconstrained. An unconstrained matrix can have spectral norm > 1. Spectral norm is the largest singular value, it is the maximum factor by which the matrix can stretch a vector. If ‖B_l‖₂ > 1, the residual mapping is expansive: it stretches the representation space at each layer. A small perturbation, say, a token slightly misrouted at layer 5:gets amplified at layer 6, further amplified at layer 7, and so on. By the time it reaches layer L, the tear has been stretched across a large region of representation space. This is tear amplification.

The Birkhoff polytope constraint fixes this by requiring B_l to be a doubly stochastic matrix:

B_l ∈ M := {M ∈ ℝⁿ×ⁿ M·1ₙ = 1ₙ, 1ₙᵀ·M = 1ₙᵀ, M ≥ 0}

A doubly stochastic matrix has spectral norm bounded by 1. This makes B_l non-expansive: for any two hidden states x and y,

‖B_l·x − B_l·y‖ ≤ ‖x − y‖

The residual mapping cannot increase the distance between any two points. It is a Lipschitz-1 map on the residual stream. A tear introduced at any layer cannot grow as it propagates through subsequent layers. The manifold can be torn, mHC does not remove the routing boundary discontinuities, but it cannot be stretched apart. The damage is contained.

The Sinkhorn-Knopp algorithm that enforces this constraint is itself geometrically beautiful: it performs alternating projections onto two convex constraint sets, row-normalized matrices, and column-normalized matrices, until their intersection (the Birkhoff polytope) is reached. You are iteratively projecting B_l onto the manifold of doubly stochastic matrices. The constraint manifold M is itself a well-studied convex polytope, and Sinkhorn-Knopp is a convergent algorithm for finding the nearest point on it.

Additionally, the input and output mappings A_l and C_l are constrained to be non-negative and bounded via Sigmoid functions. This prevents signal cancellation, another form of geometric pathology where opposing contributions annihilate each other, creating artificial zeros in the representation space that have no geometric meaning.

缓解方法三：mHC——约束撕裂传播

Manifold-Constrained Hyper-Connections（mHC，流形约束超连接）是三种干预中最深层的一种，也是论文自身表述中最明确具有几何含义的一种, 这一点已经直接写在名字里了。

标准 Hyper-Connections 会将残差流的宽度扩展 B_l 倍，并在每一层引入一个残差映射矩阵 B_l ∈ ℝⁿ×ⁿ。其更新规则为：

(X_{l+1} = B_l X_l + C_l F_l(A_l X_l))

论文发现，朴素的 HC 在堆叠多层时会表现出数值不稳定性——原因正是 BlB_lBl 没有受到约束。一个无约束矩阵的谱范数可能大于 1。谱范数是最大奇异值，也就是这个矩阵能够拉伸一个向量的最大倍数。如果 ‖B_l‖₂ > 1，那么残差映射就是扩张性的：它会在每一层拉伸表征空间。一个很小的扰动, 比如某个单标在第 5 层被轻微错误路由, 会在第 6 层被放大，在第 7 层进一步放大，如此继续。等它到达第 L 层时，这个撕裂已经被拉伸到表征空间中的一大片区域。这就是撕裂放大。

Birkhoff 多面体约束通过要求 B_l 成为一个双随机矩阵来解决这个问题：

B_l ∈ M := {M ∈ ℝⁿ×ⁿ M·1ₙ = 1ₙ, 1ₙᵀ·M = 1ₙᵀ, M ≥ 0}

一个双随机矩阵的谱范数被 1 所约束。这使得 BlB_lBl 成为非扩张映射：对任意两个隐藏状态 xxx 和 yyy，都有

‖B_l·x − B_l·y‖ ≤ ‖x − y‖

也就是说，残差映射不能增加任意两点之间的距离。它是残差流上的一个 Lipschitz-1 映射。任何一层中引入的撕裂，都不能在后续层传播时继续增大。流形仍然可能被撕裂, mHC 并没有消除路由边界处的不连续性, 但它不能被进一步拉开。损伤被限制住了。

执行这一约束的 Sinkhorn-Knopp 算法本身也具有很漂亮的几何意义：它在两个凸约束集合之间进行交替投影, 行归一化矩阵集合与列归一化矩阵集合, 直到到达它们的交集，也就是 Birkhoff 多面体。换句话说，你是在迭代地将 B_l 投影到双随机矩阵的流形上。这个约束流形 M 本身是一个研究充分的凸多面体，而 Sinkhorn-Knopp 是一种用于找到其上相应约束点的收敛算法。

此外，输入映射 A_l 和输出映射 C_l 也通过 Sigmoid 函数被约束为非负且有界。这可以防止信号抵消, 另一种几何病态：相反方向的贡献彼此湮灭，在表征空间中制造出没有真实几何意义的人为空零点.

The Unified Picture

The three mitigations form a layered geometric defense, each acting at a different stage of the failure cascade:

Stage 1.

Geometric Event: High local curvature
Mitigation: SwiGLU Clamping
Mechanism : Bounds activation magnitude; keeps optimizer’s local chart valid

Stage 2.

Geometric Event: Chart inconsistency
Mitigation: Anticipatory Routing
Mechanism : Synchronizes routing chart with geometric chart via temporal consistency

Stage 3.

Geometric Event: Tear amplification
Mitigation: mHC (Birkhoff)
Mechanism : Non-expansive residual mapping; Lipschitz-1 bound across layers

None of the three individually suffices. SwiGLU clamping reduces the probability of a high-curvature precondition but cannot prevent all routing misalignment. Anticipatory Routing prevents chart inconsistency during normal training but is only activated reactively. mHC contains the damage if a tear occurs but does not prevent it from forming.

Together, they form a cascade interrupt: clamping attacks the precondition, routing consistency attacks the event, and Lipschitz bounding attacks the aftermath. This is why the paper could say, with some confidence, that training was stabilized — even without a complete theoretical account.

统一图景

这三种缓解方法构成了一套分层的几何防御体系，分别作用于失效级联的不同阶段：

阶段 1.

几何事件: 高局部曲率
解方法: SwiGLU Clamping
机制 : 约束激活幅度；保持优化器的局部坐标图有效

阶段 2.

几何事件: 坐标图不一致
解方法: Anticipatory Routing
机制: 通过时间一致性，使路由坐标图与几何坐标图同步

阶段 3.

几何事件: 撕裂放大
解方法: mHC (Birkhoff)
机制: 非扩张残差映射；在层间施加 Lipschitz-1 约束

这三者中的任何一个，单独来看都不足够。SwiGLU clamping 降低了高曲率前提出现的概率，但不能阻止所有路由错位。Anticipatory Routing 可以在正常训练中防止坐标图不一致，但它是反应式激活的。mHC 可以在撕裂发生之后限制损伤，但并不能阻止撕裂本身的形成。

三者合在一起，构成了一种级联中断机制：clamping 攻击前提条件，路由一致性攻击事件本身，而 Lipschitz 约束攻击事后的传播放大。这也解释了为什么论文可以相当有信心地说，训练被稳定住了, 即使他们还没有给出一个完整的理论解释。

What the Paper Leaves Open

The authors write:

“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now…”

The geometric framing offered here suggests what that theoretical understanding might look like: a formal account of how routing boundary discontinuities propagate through residual streams, how spectral norm bounds on residual mappings contain this propagation, and how temporal consistency conditions in discrete selectors can be enforced as a parallel transport rule.

This is not a complete theory. It is a vocabulary, a set of geometric concepts precise enough to ask the right questions. The manifold is torn by construction in every MoE layer. The question is whether that tear propagates catastrophically or stays bounded. DeepSeek V4’s three mitigations are three answers to that question, each operating at a different scale.

Single Token Geometry is the project of taking seriously the idea that what happens to one token, passing through one forward pass, is already geometrically rich enough to explain phenomena we currently attribute to optimization dynamics. Loss spikes are one example.

论文留下了什么开放问题

作者写道：

““尽管目前我们对这些方法背后的机制尚缺乏完整的理论理解……”

这里提出的几何框架，提示了这种理论理解可能是什么样子：它可以是一种形式化解释，用来说明路由边界不连续性如何沿着残差流传播，残差映射上的谱范数约束如何限制这种传播，以及离散选择器中的时间一致性条件如何被理解为一种平行移动规则。

这还不是一个完整理论。它是一套词汇, 一组足够精确的几何概念，使我们能够提出正确的问题。每一个 MoE 层在结构上都会撕裂流形。真正的问题是：这种撕裂会灾难性地传播，还是会被约束在有界范围内？DeepSeek V4 的三种缓解方法，正是对这个问题的三种回答，并且分别作用在不同尺度上。

单标几何这个项目，就是认真对待这样一个想法：一个单标在一次前向传播中所经历的过程，本身就已经具有足够丰富的几何结构，足以解释许多我们目前归因于优化动力学的现象。残差尖峰只是其中一个例子。

Manifold Tearing Is Not DeepSeek’s Problem Alone

It is worth being precise about scope. Manifold tearing is not unique to DeepSeek, or to MoE architectures — though MoE does make it worse, by introducing routing boundaries that are discontinuous by construction. The pathology is more general.

We noticed and discussed this in our 2024 paper, Deep Manifold Part 1: Anatomy of Neural Network Manifold — specifically Section 3.6, Learning Transformation:

“This explains the zig-zag pattern observed in the loss curve during the slow decline stage of almost all foundation model training, suggesting that training is struggling to converge — a hidden bottleneck identified in our analysis. A model is considered to have converged effectively when the standard deviation of its loss values stabilizes at less than 5.”

That zig-zag pattern is not noise. It is not a quirk of the optimizer or the learning rate schedule. High loss deviation is an indication of manifold tearing — the loss surface is not smooth but discontinuous, and the optimizer is crossing those discontinuities rather than descending through them. The standard deviation of the loss is, in this framing, a roughness measure of the representation manifold. When it stabilizes, the manifold has settled into a geometry the optimizer can navigate. When it spikes, the manifold is tearing.

This means manifold tearing is visible in almost every foundation model training run ever published. It has been attributed to many things — learning rate warmup, batch size scheduling, data curriculum, optimizer hyperparameters. These attributions are not wrong. But they are proximate causes. They describe the conditions under which tearing is more or less likely. They do not identify the root cause.

The root cause, beyond model architecture, is data.

Specifically: data complexity. The representation manifold that a model learns is not chosen by the architect , it is induced by the training data. A dataset with discontinuous structure,sharp distributional boundaries between domains, conflicting label geometries, or extreme token frequency imbalances — induces a representation manifold with discontinuities baked in from the first forward pass. The model is not tearing a smooth manifold. It is trying to learn a manifold that was never smooth to begin with. The routing boundaries in MoE are a second-order effect; the data boundaries are primary.

This is the argument of the next article in this series.

Single Token Geometry: Data Complexity will ask what it means, geometrically, for data to be complicit in manifold tearing: why certain data compositions make smooth representation manifolds impossible, and what that implies for how we should think about dataset curation, not as an engineering convenience, but as a geometric necessity.

Manifold tearing will continue to appear throughout this series. DeepSeek V4 gave us a precise, honest, and unusually well-documented instance of it. The mitigations they found are real and they work. But they are defenses against a pathology whose origin sits upstream of the architecture — in the data the model is asked to learn from, and in the geometry that data imposes on the representation space before training even begins.

流形撕裂并不只是 DeepSeek 的问题

有必要准确界定一下范围。流形撕裂并不是 DeepSeek 独有的问题，也并不是 MoE 架构独有的问题——尽管 MoE 通过引入构造上不连续的路由边界，确实会让这个问题更加严重。这个病态现象其实更一般。

我们在 2024 年的论文《Deep Manifold Part 1: Anatomy of Neural Network Manifold》中已经注意并讨论过这一点，尤其是在第 3.6 节“Learning Transformation”中：

这解释了几乎所有基础模型训练在缓慢下降阶段的残差曲线中所观察到的锯齿形模式，表明训练正在艰难收敛——这是我们分析中识别出的一个隐藏瓶颈。当残差值的标准差稳定在小于 5 时，可以认为模型已经有效收敛。

这种锯齿形模式不是噪声。它不是优化器或学习率调度的某种偶然现象。高残差偏差是流形撕裂的迹象, 损失曲面不是光滑的，而是不连续的；优化器不是沿着它下降，而是在穿越这些不连续处。在这个框架中，残差的标准差可以被看作表征流形的一个粗糙度度量。当它稳定下来时，说明流形已经沉降到一种优化器可以导航的几何之中。当它出现尖峰时，说明流形正在撕裂。

这意味着，流形撕裂几乎可以在每一次公开发表的基础模型训练过程中看到。它过去被归因于很多事情, 学习率 warmup、batch size 调度、数据课程、优化器超参数。这些归因并不是错的。但它们只是近因。它们描述的是在什么条件下撕裂更容易或更不容易发生，却没有指出根本原因。

架构之外的根本原因，是数据。

更具体地说：是数据复杂性。模型所学习到的表征流形，并不是由架构师自由选择的,它是由训练数据诱导出来的。一个具有不连续结构的数据集,例如领域之间存在尖锐的分布边界、标签几何彼此冲突，或者单标频率存在极端不平衡, 会从第一次前向传播开始，就诱导出一个内部带有不连续性的表征流形。模型并不是在撕裂一个原本光滑的流形。它是在试图学习一个从一开始就并不光滑的流形。MoE 中的路由边界是二阶效应；数据边界才是一阶原因。

这正是本系列下一篇文章要讨论的问题。

《单标几何：数据复杂性》将追问：从几何上看，数据如何“共谋”参与流形撕裂？为什么某些数据组合会让光滑的表征流形变得不可能？这又意味着我们应该如何重新理解数据集清洗与构造, 它不只是工程上的便利，而是一种几何上的必要。

流形撕裂将会在这个系列中反复出现。DeepSeek V4 给了我们一个精确、诚实、而且罕见地记录充分的案例。他们找到的缓解方法是真实有效的。但这些方法所防御的病态，其源头位于架构之前, 位于模型被要求学习的数据之中，也位于这些数据在训练开始之前就施加到表征空间上的几何之中。

###

Python基础

2026-04-24T00:00:00+08:00

Lists

arr = [1, 2, 3]

# Common Operations
arr.index(1)      # Find index
arr.append(1)     # Add to end
arr.insert(0,10)  # Add 10 from left (at index 0 which is start)
arr.remove(3)     # Remove value
arr.pop()         # Remove & return last element
arr.sort()        # In-place sort (TimSort: O(n log n))
arr.sort(reverse=True)  # In-place reverse (High to low)
arr.reverse()     # In-place reverse
arr.copy()        # Return shallow copy

# List Slicing
arr[start:stop:step]  # Generic slice syntax
arr[-1]    # Last item
arr[::-1]  # Reverse list
arr[1:]    # Everything after index 1
arr[:3]    # First three elements

# Sublists (aka slicing), 左闭右开
arr[1:2]   # [2]
# Similar to for-loop ranges, last index is non-inclusive
# But no out of bounds error
arr[0:10]  # [1, 2, 3]

# Custom sort (e.g., by length of string)
arr = ["bob", "alice", "jane", "doe"]
arr.sort(key=lambda x: len(x))
print(arr)  # ['bob', 'doe', 'jane', 'alice']

# 2-D lists
arr = [[0] * 4 for i in range(4)]
print(arr)
print(arr[0][0], arr[3][3])

# This won't work
# arr = [[0] * 4] * 4

python里面的区间基本上是左闭右开，比如range、slicing

Tuples

# Tuples are immutable lists
t = (1, 2, 3, 1)

# Essential Operations
t.count(1)      # Count occurrences of value
t.index(2)      # Find first index of value

# Useful Patterns
x, y = (1, 2)   # Tuple unpacking
coords = [(1,2), (3,4)]  # Tuple in collections

Sets

s = {1,2,3}

# Common Operations
s.add(4)             # Add element
s.remove(4)          # Remove (raises error if missing)
s.discard(4)         # Remove (no error if missing)
s.pop()              # Remove and return arbitrary element

# Set Operations
a.union(b)           # Elements in a OR b
a.intersection(b)    # Elements in a AND b
a.difference(b)      # Elements in a but NOT in b
a.symmetric_difference(b)  # Elements in a OR b but NOT both
a.issubset(b)        # True if all elements of a are in b
a.issuperset(b)      # True if all elements of b are in a

Strings

s = "hello world"

# Essential Methods
s.split()            # Split on whitespace
s.split(',')         # Split on comma
s.strip()            # Remove leading/trailing whitespace
s.lower()            # Convert to lowercase
s.upper()            # Convert to uppercase
s.isalnum()          # Check if alphanumeric
s.isalpha()          # Check if alphabetic
s.isdigit()          # Check if all digits
s.find('sub')        # Index of substring (-1 if not found)
s.count('sub')       # Count occurrences
s.replace('old', 'new')  # Replace all occurrences

# ASCII Conversion
ord('a')             # Char to ASCII (97)
chr(97)              # ASCII to char ('a')

# Valid numeric strings can be converted
print(int("123") + int("123"))  # 246

# And numbers can be converted to strings
print(str(123) + str(123))  # 123123

Queues

# Queues (double ended queue)
from collections import deque

# Perfect for BFS - O(1) operations on both ends
d = deque()
d.append(1)          # Add right
d.appendleft(2)      # Add left
d.pop()              # Remove right
d.popleft()          # Remove left
d.extend([1,2,3])    # Extend right
d.extendleft([1,2,3])# Extend left
d.rotate(n)          # Rotate n steps right (negative for left)

Heaps

import heapq

# MinHeap Operations - All O(log n) except heapify
nums = [3,1,4,1,5]
heapq.heapify(nums)          # Convert to heap in-place: O(n)
heapq.heappush(nums, 2)      # Add element: O(log n)
smallest = heapq.heappop(nums)  # Remove smallest: O(log n)

# MaxHeap Trick: Multiply by -1
nums = [-x for x in nums]    # Convert to maxheap: O(n)
heapq.heapify(nums)          # O(n)
largest = -heapq.heappop(nums)  # Get largest: O(log n)

# Advanced Operations
k_largest = heapq.nlargest(k, nums)    # O(n * log k)
k_smallest = heapq.nsmallest(k, nums)  # O(n * log k)

# Custom Priority Queue
heap = []
heapq.heappush(heap, (priority, item))  # Sort by priority

# Under the hood are arrays
minHeap = []
heapq.heappush(minHeap, 3)
heapq.heappush(minHeap, 2)
heapq.heappush(minHeap, 4)

# Min is always at index 0
print(minHeap[0])  # 2

while len(minHeap):
    print(heapq.heappop(minHeap))
# 2
# 3
# 4

# No max heaps by default, work around is
# to use min heap and multiply by -1 when push & pop.
maxHeap = []
heapq.heappush(maxHeap, -3)
heapq.heappush(maxHeap, -2)
heapq.heappush(maxHeap, -4)

# Max is always at index 0
print(-1 * maxHeap[0])  # 4

while len(maxHeap):
    print(-1 * heapq.heappop(maxHeap))
# 4
# 3
# 2

# Build heap from initial values
arr = [2, 1, 8, 4, 5]
heapq.heapify(arr)
while arr:
    print(heapq.heappop(arr))
# 1
# 2
# 4
# 5
# 8

Built-in Functions

# Iteration Helpers
enumerate(lst)        # Index + value pairs
zip(lst1, lst2)      # Parallel iteration
map(fn, lst)         # Apply function to all elements
filter(fn, lst)      # Keep elements where fn returns True
any(lst)             # True if any element is True
all(lst)             # True if all elements are True

# Binary Search
import bisect

bisect.bisect(lst, x)     # Find insertion point
bisect.bisect_left(lst, x)# Find leftmost insertion point
bisect.insort(lst, x)     # Insert maintaining sort

# Type Conversion
int('42')            # String to int
str(42)              # Int to string
list('abc')          # String to list
''.join(['a','b'])   # List to string
set([1,2,2])         # List to set

# Math
abs(-5)              # Absolute value
pow(2, 3)            # Power
round(3.14159, 2)    # Round to decimals
divmod(10, 3)        # (3, 1) - returns (quotient, remainder)

# Binary representation
bin(10)              # '0b1010'
format(10, 'b')      # '1010' (without prefix)

Math

在正数时，int(a / b) 和 a // b 通常结果相同，但在负数时可能不同：

int(a / b) 是向零取整（trunc）。如果你希望总是向零取整，使用 int(a / b)。
// 是向下取整（floor）。如果你希望总是向下取整，使用 a // b。

另外，取模运算 a % b也遵从 floor法则，a = (a // b) * b + (a % b) -> a % b = a - (a // b) * b。

# 向零取整
print(int(3 / 2))  # 1
print(int(-3 / 2))  # -1

# 向下取整
print(int(3 // 2))  # 1
print(int(-3 // 2))  # -2
# floor(-1.5) = -2

# 取模
print(10 % 3)  # 1
print(-10 % 3)  # 2
# -10 - (-10//3) * 3 = -10 - (-4) * 3 = 2

import math

# Constants
math.pi       # 3.141592653589793
math.e        # 2.718281828459045

# Common Functions
math.ceil(2.3)        # 3 - Smallest integer greater than x
math.floor(2.3)       # 2 - Largest integer less than x
math.gcd(a, b)        # Greatest common divisor
math.log(x, base)     # Logarithm with specified base
math.sqrt(x)          # Square root
math.pow(x, y)        # x^y (prefer x ** y for integers)

# Trigonometry
math.degrees(rad)     # Convert radians to degrees
math.radians(deg)     # Convert degrees to radians

References

算法基础

2026-04-24T00:00:00+08:00

框架概括

整体框架

数据结构（增删查改）
- 数组（顺序存储）
  - 动态数组
  - 字符串
  - 哈希表
  - …
- 链表（链式存储）
  - 单/双链表
  - 树
  - …
算法（穷举）
- 如何避免遗漏
  - 回溯算法
  - 动态规划
  - DFS
  - BFS
  - …
- 如何避免冗余
  - 二分
  - 滑动窗口
  - 贪心
  - …

各类数据结构的遍历

数组的遍历，线性迭代结构：

def traverse(arr: List[int]):
    for i in range(len(arr)):
        # 迭代访问 arr[i]

链表的遍历，兼具迭代和递归结构：

# 基本的单链表节点
class ListNode:
    def __init__(self, val):
        self.val = val
        self.next = None

def traverse(head: ListNode) -> None:
    p = head
    while p is not None:
        # 迭代访问 p.val
        p = p.next

def traverse(head: ListNode) -> None:
    # 递归访问 head.val
    traverse(head.next)

二叉树的遍历，典型的非线性递归遍历结构：

# 基本的二叉树节点
class TreeNode:
    def __init__(self, val=0, left=None, right=None):
        self.val = val
        self.left = left
        self.right = right

def traverse(root: TreeNode):
    traverse(root.left)
    traverse(root.right)

算法复杂度

主定理（Master Theorem）

假设有递归关系式：

$T(N) = aT(N/b) + f(N), f(N) = N^{\log_b(a)} \log^k(N)$

其中，$N$为问题规模，$a$为递归的子问题数量，$N/b$为每个子问题的规模（假设每个子问题的规模基本一样），$f(N)$为递归以外进行的计算工作。

则其算法复杂度为：

$T(N) = O(N^{\log_b(a)} \log^{(k+1)}N)$

常见算法复杂度

算法	递归关系式	复杂度
二分查找	$T(N) = T(N/2) + O(1)$	$O(\log(N))$
二叉树遍历	$T(N) = 2T(N/2) + O(1)$	$O(N)$
归并排序	$T(N) = 2T(N/2) + O(N)$	$O(N\log(N))$

双指针

使用两个指针变量在数组或链表等线性结构上协同移动，避免嵌套循环，将部分 $O(N^2)$ 的算法优化为 $O(N)$。主要分为：

同向双指针（快慢指针）：一个快指针先行，慢指针跟进，常用于滑动窗口（去重）、链表操作（找中点、判断环、环入口）等。
相向双指针（对撞指针）：从两端向中间移动，常用于有序数组求和、回文判断、反转数组、数组合并等。
背向双指针：从中间向两边扩展，常用于回文串、最长子回文等问题。

算法复杂度

通常情况下，时间复杂度 $O(N)$（与最内层循环主体的执行次数有关），空间复杂度：$O(1)$。

使用场景

滑动窗口 (90%)
时间复杂度要求 $O(N)$ (80%是双指针)
要求原地操作，只可以使用交换，不能使用额外空间 (80%)
有子数组 subarray / 子字符串 substring 的关键词 (50%)
有回文 Palindrome 关键词(50%)

代码模板

初始化指针：left, right根据方向设置起点
循环控制：while或for控制移动（比如right扩展，left收缩）
状态更新：维护当前窗口或配对状态，根据条件分类讨论
结果记录：更新答案（相等时、满足条件时）
边界处理：空数组、单元素、去重跳过等

# 通用双指针框架（适用于数组/列表）
def two_pointers(arr):
    n = len(arr)
    if n == 0:
        return 0  # 或其他默认值

    # Step 1: 初始化指针
    left = 0                    # 左指针 / 慢指针
    # right = 0 或 n - 1，根据方向选择

    # Step 2: 根据类型选择遍历结构
    for right in range(n):      # 同向：快慢指针；滑动窗口
    # while left < right:       # 相向：对撞指针（常用于有序数组）
    # while left < n:           # 其他控制条件

        # Step 3: 扩展或移动右指针后，处理当前窗口/状态
        # ... 更新状态

        # Step 4: 判断是否需要收缩左指针（滑动窗口类）
        while left <= right and need_to_move_left(arr, left, right):
            # ... 更新或记录结果
            left += 1

        # 或：根据条件移动双指针（对撞类）
        # if condition:
        #     left += 1
        # else:
        #     right -= 1

    return result

例题

88. 合并两个有序数组

"""
📖描述：给你两个按 非递减顺序 排列的整数数组 `nums1` 和 `nums2`，另有两个整数 `m` 和 `n`，分别表示 `nums1` 和 `nums2` 中的元素数目。
    请你 合并 `nums2` 到 `nums1` 中，使合并后的数组同样按 非递减顺序 排列。
🧪样例：输入：nums1 = [1,2,3,0,0,0], m = 3, nums2 = [2,5,6], n = 3；输出：[1,2,2,3,5,6]
💡难点：从后往前操作以便直接覆盖。
"""

def merge(nums1: List[int], m: int, nums2: List[int], n: int) -> None:
    """
    Do not return anything, modify nums1 in-place instead.
    """
    # 逆向双指针，从后往前操作可以直接覆盖
    p1, p2 = m - 1, n - 1  # 同向，但是从后往前
    tail = m + n - 1  # 需要维护的状态：当前需要处理的索引
    while True:
        if p1 < 0 or p2 < 0:
            break
        if nums1[p1] <= nums2[p2]:
            nums1[tail] = nums2[p2]
            p2 -= 1
            tail -= 1
        else:
            nums1[tail] = nums1[p1]
            p1 -= 1
            tail -= 1
    # 由于比较，总会有一个数组先结束，对于后结束的一个数组：这里肯定是p2
    if p2 >= 0:
        nums1[: p2 + 1] = nums2[: p2 + 1]

def merge(nums1: List[int], nums2: List[int]) -> List[int]:
    """ 合并双指针，非原地操作。
    🧪样例：输入：nums1 = [1,2,3], nums2 = [2,5,6]；输出：[1,2,2,3,5,6]
    """
    m, n = len(num1), len(nums2)
    new_list = []
    i, j = 0, 0
    # 合并的过程只能操作 i, j 的移动，不要去用 list1.pop(0) 之类的操作
    # 因为 pop(0) 是 O(n) 的时间复杂度，而且会改变序号
    while True:
        if i >= m or j >= n:
            break
        if nums[i] < nums[j]:
            new_list.append(nums[i])
            i += 1
        else:
            new_list.append(nums[j])
            j += 1
    # 合并剩下的数到 new_list 里
    while i < m:
        new_list.append(nums[i])
        i += 1
    while j < n:
        new_list.append(nums[j])
        j += 1
    return new_list

21. 合并两个有序链表

# Definition for singly-linked list.
# class ListNode:
#     def __init__(self, val=0, next=None):
#         self.val = val
#         self.next = next
def mergeTwoLists(list1: ListNode, list2: ListNode) -> ListNode:
    dummy = ListNode()  # 虚拟头结点，它的唯一作用就是提供一个起始点，让 p 可以不断向后连接节点
    p = dummy  # p 指向虚拟链表的末尾
    p1 = l1
    p2 = l2

    while p1 is not None and p2 is not None:
        # 比较 p1 和 p2 两个指针
        # 将值较小的的节点接到 p 指针
        if p1.val > p2.val:
            p.next = p2
            p2 = p2.next
        else:
            p.next = p1
            p1 = p1.next
        # p 指针不断前进
        p = p.next

    if p1 is not None:
        p.next = p1

    if p2 is not None:
        p.next = p2

    return dummy.next  # 注意：不是返回指针 p，而是返回链表的头部，也就是 dummy.next

虚拟头结点

5. 最长回文子串

"""
📖描述：给你一个字符串 `s`，找到 `s` 中最长的 回文 子串。
🧪样例：输入s = "babad"；输出"bab"或"aba"。输入：s = "cbbd"；输出："bb"。
💡重点：
- 需要同时考虑奇数和偶数长的回文串
- 中心扩散
- 这题还可以用动态规划解:
    - 状态定义：dp[i][j]表示s[i:j+1]是否为回文
    - 初始化：dp = [[False for _ in range(size)] for _ in range(size)]
    - 转移方程：dp[i][j] = dp[i-1][j-1] and s[i] == s[j]
"""
def longestPalindrome(s: str) -> str:
    n = len(s)
    if n <= 1:
        return s
    max_s, max_len = "", 0
    for i in range(n):
        if n - 1 - i < (max_len - 1) / 2:
            break  # 提前终止
        # 处理奇数长度的回文子串，以i为中心向两边移动
        left, right = i, i
        while True:
            if left < 0 or right > n - 1:
                break
            if s[left] == s[right]:
                left -= 1
                right += 1
            else:
                break  # 注意所有break的情况
        cur_len = right - left - 1
        if cur_len > max_len:
            max_s = s[left + 1 : right]
            max_len = cur_len
        # 处理偶数长度的回文子串
        left, right = i, i + 1
        while True:
            if left < 0 or right > n - 1:
                break
            if s[left] == s[right]:
                left -= 1
                right += 1
            else:
                break
        cur_len = right - left - 1
        if cur_len > max_len:
            max_s = s[left + 1 : right]
            max_len = cur_len
    return max_s

930. 和相同的二元子数组

"""
📖描述：给你一个二元数组 `nums`，和一个整数 `goal`，请你统计并返回有多少个和为 `goal` 的非空子数组。
🧪样例：
    输入：nums = [1,0,1,0,1], goal = 2
    输出：4
    解释：有 4 个满足题目要求的子数组：[1,0,1]、[1,0,1,0]、[0,1,0,1]、[1,0,1]
💡重点：
    1. 可以用前缀和 + 哈希表，类似两数之和
    2. 也可以用滑动窗口，因为元素都是非负的（只有0和1）
"""

# 方法1：前缀和 + 哈希表，时间 O(N)，空间 O(N)
def numSubarraysWithSum(nums: List[int], goal: int) -> int:
    from collections import defaultdict
    prefix_sum = defaultdict(int)
    prefix_sum[0] = 1  # 前缀和为0的有1个（空前缀）
    cur_sum = 0
    count = 0
    for num in nums:
        cur_sum += num
        # 需要找之前的前缀和 = cur_sum - goal
        count += prefix_sum[cur_sum - goal]
        prefix_sum[cur_sum] += 1
    return count

# 方法2：滑动窗口，时间 O(N)，空间 O(1)
# 由于元素非负，可以利用滑动窗口
# atMost(goal) 返回和 <= goal 的子数组个数
# 答案 = atMost(goal) - atMost(goal - 1)
def numSubarraysWithSum(nums: List[int], goal: int) -> int:
    def atMost(goal):
        if goal < 0:
            return 0
        left = 0
        cur_sum = 0
        count = 0
        for right in range(len(nums)):
            cur_sum += nums[right]
            while cur_sum > goal:
                cur_sum -= nums[left]
                left += 1
            count += right - left + 1  # 以 right 结尾的子数组个数
        return count

    return atMost(goal) - atMost(goal - 1)

滑动窗口

滑动窗口可以归为快慢双指针，一快一慢两个指针前后相随，中间的部分就是窗口。滑动窗口算法技巧主要用来解决子数组问题，比如让你寻找符合某个条件的最长/最短子数组。

与普通的快慢指针（嵌套循环，$O(N^2)$）不同的是，滑动窗口（队列）维护的元素只进入/移出一次（指针 left, right 只增不减），所以复杂度为$O(N)$。算法的重点在于判断是否要把 left 移动。

# 滑动窗口模板
def sliding_window(s: str):
    # 用合适的数据结构记录窗口中的数据，根据具体场景变通
    # 比如说，我想记录窗口中元素出现的次数，就用 map
    # 如果我想记录窗口中的元素和，就可以只用一个 int
    window = set()

    left, right = 0, 0
    while right < len(s):
        # c 是将移入窗口的字符
        c = s[right]
        window.add(c)
        # 增大窗口
        right += 1
        # 进行窗口内数据的一系列更新
        ...

        # 判断左侧窗口是否要收缩
        while left < right and window_needs_shrink(s, left, right):
            # 把 s[left] 移出窗口
            window.remove(s[left])
            # 缩小窗口
            left += 1
            # 进行窗口内数据的一系列更新
            ...

基于这个框架，遇到子串/子数组相关的题目，你只需要回答以下三个问题：

什么时候应该移动 right 扩大窗口？窗口加入字符时，应该更新哪些数据？
什么时候窗口应该暂停扩大，开始移动 left 缩小窗口？从窗口移出字符时，应该更新哪些数据？
什么时候应该更新结果？

例题

3. 无重复字符的最长子串

"""
📖描述：给定一个字符串 s ，请你找出其中不含有重复字符的 最长 子串 的长度。
🧪样例：
    输入: s = "abcabcbb"
    输出: 3
    解释: 因为无重复字符的最长子串是 "abc"，所以其长度为 3。注意 "bca" 和 "cab" 也是正确答案。
💡重点：滑动窗口最重要的是指针只增不减
"""
# 错误用法
# 下面这种写法没有保证每个元素只处理一次（l递增，但r会回退），就是暴力的嵌套循环，复杂度为 O(N^2)
def lengthOfLongestSubstring(s: str) -> int:
    N = len(s)
    ans = 0
    for l in range(N):
        substr = {s[l]}
        for r in range(l + 1, N):
            if s[r] not in substr:
                substr.add(s[r])
            else:
                break
        ans = max(ans, len(substr))
    return ans

# 正确用法 1
# r只增不减：O(N), 23ms
def lengthOfLongestSubstring(s: str) -> int:
    N = len(s)
    right, ans = 0, 0
    for left in range(N):
        substr = s[left:right]
        while True and right < N:
            if s[right] in substr:
                break
            else:
                right += 1
                substr = s[left:right]
        ans = max(ans, len(substr))
    return ans

# 正确用法 2
# window维护每个字符出现的次数, 删除 s[left] 直至 win[s[right]] <= 1, O(N), 76ms
def lengthOfLongestSubstring(s: str) -> int:
    win, ans = dict(), 0
    left, right = 0, 0
    while right < len(s):
        win[s[right]] = win.get(s[right], 0) + 1
        # 滑动左指针直到win[s[right]]<=1
        while win[s[right]] > 1:
            win[s[left]] -= 1
            left += 1
        ans = max(ans, right - left + 1)
        right += 1
    return ans

567. 字符串的排列

"""
📖描述：给你两个字符串 s1 和 s2 ，写一个函数来判断 s2 是否包含 s1 的 排列。如果是，返回 true ；否则，返回 false 。
🧪样例：
    输入：s1 = "abb" s2 = "eidbabooo"
    输出：true
    解释：s2 包含 s1 的排列之一 ("bab").
💡重点：
"""

# 错误解法
# 暴力枚举，每次循环都重新计算Counter，时间复杂度 O(N^2)
def checkInclusion(s1: str, s2: str) -> bool:
    from collections import Counter

    N, M = len(s1), len(s2)
    ref = Counter(s1)
    # 窗长始终为N
    for i in range(M):
        substr = s2[i: i + N]
        win = Counter(substr)
        if win == ref:
            return True
    return False

# 正确解法
# 不需要每次都重新计算，只需要更新 left 和 right 的计数，时间复杂度 O(N)
def checkInclusion(s1: str, s2: str) -> bool:
    N, M = len(s1), len(s2)
    if N > M:
        return False

    ref = {}
    for c in s1:
        ref[c] = ref.get(c, 0) + 1

    window = {}
    for i in range(N):
        window[s2[i]] = window.get(s2[i], 0) + 1

    right = N
    while right < M:
        # print(window)
        if window == ref:
            return True
        new = s2[right]
        old = s2[right - N]
        # 维护滑动窗
        window[new] = window.get(new, 0) + 1
        window[old] = window.get(old, 0) - 1
        if window[old] == 0:
            del window[old]
        right += 1

    if window == ref:
        return True
    return False

查找

查找是最基础操作，其中最常用的是二分查找，即从有序数组array中直接寻找某个值query对应的index。一般解法：

双指针：比较array[mid]和query的大小（mid = low + (high-low)//2），从而更新左右指针low、high，终止条件:
- (1) 找到了query（array[mid] = query）
- (2) 左右指针相遇（low > high）
递归：分成左右两子数组，如果array[mid]不等于query则不断在左或者右子数组里面查找，直到找到了query或者子数组为空。

重点在于分类讨论，建议用双闭区间，仔细讨论array[low:mid], array[mid], array[mid+1:high+1]的情况，并注意三个数组是否为空。

算法复杂度

时间 $O(\log(N))$。每次只需要查一边，所以子问题数量为1。空间 $O(1)$。

使用场景

当数组已经排好序 (30-40%是二分)
当面试官要求你找一个比 $O(N)$ 更小的时间复杂度算法的时候(99%)
找到数组中的一个分割位置，使得左半部分满足某个条件，右半部分不满足(100%)
找到一个最大/最小的值使得某个条件被满足(90%)

代码模板

def hash_search(arr, query):
    # 哈希查找，用于无序数组
    seen = {}
    for i, val in enumerate(arr):
        complement = query - val
        if complement in seen:
            return [seen[complement], i]  # 如两数之和
        seen[val] = i
    return -1

def binary_search(array, query):
    """ Two points. [low, high] will be splitted:
        (1) [low, mid - 1]
        (2) [mid]
        (3) [mid + 1, high]
    """
    low, high = 0, len(array) - 1  # 闭区间 [left, right]
    while low <= high:
        mid = low + (high - low) // 2  # 防溢出
        val = array[mid]
        # array[low:mid], array[mid], array[mid+1:high+1]
        if val == query:
            return mid
        if val < query:
            low = mid + 1
        else:
            high = mid - 1
    return None

def binary_search_recur(array, low, high, query):
    """ Recurrence. [low, high] will be splitted:
        (1) [low, mid - 1]
        (2) [mid]
        (3) [mid + 1, high]
    """
    if low > high:
        return -1
    mid = low + (high - low) // 2   # This mid will not break integer range
    if query < array[mid]:
        return binary_search_recur(array, low, mid - 1, query)  # Go search in the left subarray
    if query > array[mid]:
        return binary_search_recur(array, mid + 1, high, query)  # Go search in the right subarray
    return mid  # `array[mid] = query`, stop recurrence

例题

33. 搜索旋转排序数组

"""
📖描述：给定旋转后的数组 `nums` 和一个整数 `target`，如果 `nums` 中存在这个目标值 `target`，则返回它的下标，否则返回 `-1`。
🧪样例：输入：`nums = [2,3,4,5,6,7,0,1]`, `target = 0`；输出：`target`的下标为`6`。
💡重点：
1. 数组不是有序的，但是是局部有序的。有序的那端一定是最左边小于最右边，无序的那端一定是最左边大于最右边。
2. 目标是否在有序部分比较好判断`nums[left_] <= target and target < nums[right_]`，如果不满足则落在另一边。

"""
def search(nums: List[int], target: int) -> int:
    low, high = 0, len(nums) - 1
    while low <= high:
        mid = low + (high - low) // 2
        val = nums[mid]
        # print(mid, nums[low:mid], nums[mid], nums[mid+1:high+1])
        if val == target:
            return mid
        if low < mid and nums[low] <= nums[mid - 1]:
            # 左边有序，先判断是否在左边
            if nums[low] <= target and target <= nums[mid - 1]:
                high = mid - 1
            else:
                low = mid + 1
        elif mid < high:
            # 右边有序，先判断是否在右边
            if nums[mid + 1] <= target and target <= nums[high]:
                low = mid + 1
            else:
                high = mid - 1
        else:
            return -1
    return -1

658. 找到 K 个最接近的元素

"""
📖描述：给定一个排序好的数组 `arr`，两个整数 `k` 和 `x`，从数组中找到最靠近 `x` 的 `k` 个数。返回的结果必须要是按升序排好的。
🧪样例：输入：`arr = [1,2,3,4,5]`, `k = 4`, `x = 3`；输出：`[1,2,3,4]`。
💡重点：
1. 反向思维，删除最边缘的`n - k`个，每次判断删最左边还是删最右边。
2. 返回结果要排好序，可以用双指针寻找最优子区间。
"""

def findClosestElements(arr: List[int], k: int, x: int) -> List[int]:
    # 排除法（双指针）
    N = len(arr)
    remove_nums = N - k
    left, right = 0, N - 1
    while remove_nums:
        # 注意：这里等于号的含义，题目中说，差值相等的时候取小的
        # 因此相等的时候，尽量缩小右边界
        if x - arr[left] <= arr[right] - x:
            right -= 1
        else:
            left += 1
        remove_nums -= 1
    return arr[left:left + k]

215. 数组中的第K个最大元素

"""
📖描述：
    给定整数数组 nums 和整数 k，请返回数组中第 k 个最大的元素。
    请注意，你需要找的是数组排序后的第 k 个最大的元素，而不是第 k 个不同的元素。
    你必须设计并实现时间复杂度为 O(n) 的算法解决此问题。
🧪样例：
    输入: [3,2,1,5,6,4], k = 2
    输出: 5

    输入: [3,2,3,1,2,4,5,5,6], k = 4
    输出: 4
💡重点：
"""
class Solution:
    def partition(self, nums: List[int], left: int, right: int) -> int:
        """
        在子数组 [left, right] 中随机选择一个基准元素 pivot
        根据 pivot 重新排列子数组 [left, right]
        重新排列后，<= pivot 的元素都在 pivot 的左侧，>= pivot 的元素都在 pivot 的右侧
        返回 pivot 在重新排列后的 nums 中的下标
        特别地，如果子数组的所有元素都等于 pivot，我们会返回子数组的中心下标，避免退化
        """

        # 1. 在子数组 [left, right] 中随机选择一个基准元素 pivot
        i = random.randint(left, right)
        pivot = nums[i]
        # 把 pivot 与子数组第一个元素交换，避免 pivot 干扰后续划分，从而简化实现逻辑
        nums[i], nums[left] = nums[left], nums[i]

        # 2. 相向双指针遍历子数组 [left + 1, right]
        # 循环不变量：在循环过程中，子数组的数据分布始终如下图
        # [ pivot | <=pivot | 尚未遍历 | >=pivot ]
        #   ^                 ^     ^         ^
        #   left              i     j         right

        i, j = left + 1, right
        while True:
            while i <= j and nums[i] < pivot:
                i += 1
            # 此时 nums[i] >= pivot

            while i <= j and nums[j] > pivot:
                j -= 1
            # 此时 nums[j] <= pivot

            if i >= j:
                break

            # 维持循环不变量
            nums[i], nums[j] = nums[j], nums[i]
            i += 1
            j -= 1

        # 循环结束后
        # [ pivot | <=pivot | >=pivot ]
        #   ^             ^   ^     ^
        #   left          j   i     right

        # 3. 把 pivot 与 nums[j] 交换，完成划分（partition）
        # 为什么与 j 交换？
        # 如果与 i 交换，可能会出现 i = right + 1 的情况，已经下标越界了，无法交换
        # 另一个原因是如果 nums[i] > pivot，交换会导致一个大于 pivot 的数出现在子数组最左边，不是有效划分
        # 与 j 交换，即使 j = left，交换也不会出错
        nums[left], nums[j] = nums[j], nums[left]

        # 交换后
        # [ <=pivot | pivot | >=pivot ]
        #               ^
        #               j

        # 返回 pivot 的下标
        return j

    def findKthLargest(self, nums: list[int], k: int) -> int:
        n = len(nums)
        target_index = n - k  # 第 k 大元素在升序数组中的下标是 n - k
        left, right = 0, n - 1  # 闭区间
        while True:
            i = self.partition(nums, left, right)
            if i == target_index:
                # 找到第 k 大元素
                return nums[i]
            if i > target_index:
                # 第 k 大元素在 [left, i - 1] 中
                right = i - 1
            else:
                # 第 k 大元素在 [i + 1, right] 中
                left = i + 1

排序

算法复杂度

时间复杂度：

快速排序：期望 $O(N\log(N))$
归并排序：期望 $O(N\log(N))$

空间复杂度：

快速排序：期望 $O(1)$
归并排序：期望 $O(N)$

使用场景

需要将数据排序后再处理（如二分查找、合并区间等）(90%)
快速选择/Top K问题（快速排序的变种）(80%)
逆序对数量统计（归并排序的变种）(90%)
合并多个有序数组/链表 (归并排序思想)(80%)
需要稳定排序时使用归并排序 (100%)

代码模板

# 快速排序
def quick_sort(nums: List[int], left: int, right: int) -> None:
    """原地排序，平均时间 O(NlogN)，最坏 O(N^2)，空间 O(1)"""
    if left >= right:
        return
    # 随机选择 pivot 避免最坏情况
    pivot_idx = random.randint(left, right)
    nums[left], nums[pivot_idx] = nums[pivot_idx], nums[left]
    pivot = nums[left]

    i, j = left + 1, right
    while True:
        while i <= j and nums[i] < pivot:
            i += 1
        while i <= j and nums[j] > pivot:
            j -= 1
        if i >= j:
            break
        nums[i], nums[j] = nums[j], nums[i]
        i += 1
        j -= 1
    nums[left], nums[j] = nums[j], nums[left]

    quick_sort(nums, left, j - 1)
    quick_sort(nums, j + 1, right)

# 归并排序
def merge_sort(nums: List[int], left: int, right: int) -> None:
    """稳定排序，时间 O(NlogN)，空间 O(N)"""
    if left >= right:
        return
    mid = left + (right - left) // 2
    merge_sort(nums, left, mid)
    merge_sort(nums, mid + 1, right)
    merge(nums, left, mid, right)

def merge(nums: List[int], left: int, mid: int, right: int) -> None:
    temp = []
    i, j = left, mid + 1
    while i <= mid and j <= right:
        if nums[i] <= nums[j]:
            temp.append(nums[i])
            i += 1
        else:
            temp.append(nums[j])
            j += 1
    while i <= mid:
        temp.append(nums[i])
        i += 1
    while j <= right:
        temp.append(nums[j])
        j += 1
    nums[left:right+1] = temp

# 堆排序
def heap_sort(nums: List[int]) -> None:
    """原地排序，时间 O(NlogN)，空间 O(1)"""
    n = len(nums)
    # 建堆
    for i in range(n // 2 - 1, -1, -1):
        heapify(nums, n, i)
    # 逐个取出堆顶
    for i in range(n - 1, 0, -1):
        nums[0], nums[i] = nums[i], nums[0]
        heapify(nums, i, 0)

def heapify(nums: List[int], n: int, i: int) -> None:
    largest = i
    left, right = 2 * i + 1, 2 * i + 2
    if left < n and nums[left] > nums[largest]:
        largest = left
    if right < n and nums[right] > nums[largest]:
        largest = right
    if largest != i:
        nums[i], nums[largest] = nums[largest], nums[i]
        heapify(nums, n, largest)

例题

148. 排序链表

"""
📖描述：给你链表的头结点 `head`，请将其按升序排列并返回排序后的链表。
🧪样例：
    输入：head = [4,2,1,3]
    输出：[1,2,3,4]
💡重点：
    1. 要求 O(NlogN) 时间复杂度和常数级空间复杂度
    2. 使用归并排序：找到中点 -> 递归排序 -> 合并
"""
def sortList(head: ListNode) -> ListNode:
    if not head or not head.next:
        return head

    # 找到中点（快慢指针）
    slow, fast = head, head.next
    while fast and fast.next:
        slow = slow.next
        fast = fast.next.next

    # 断开
    mid = slow.next
    slow.next = None

    # 递归排序
    left = sortList(head)
    right = sortList(mid)

    # 合并两个有序链表
    dummy = ListNode(0)
    p = dummy
    while left and right:
        if left.val < right.val:
            p.next = left
            left = left.next
        else:
            p.next = right
            right = right.next
        p = p.next
    p.next = left if left else right
    return dummy.next

剑指 Offer 51. 数组中的逆序对

"""
📖描述：在数组中的两个数字，如果前面一个数字大于后面的数字，则这两个数字组成一个逆序对。
    输入一个数组，求出这个数组中的逆序对的总数。
🧪样例：
    输入：[7,5,6,4]
    输出：5
    解释：逆序对 (7,5), (7,6), (7,4), (5,4), (6,4)
💡重点：
    1. 利用归并排序的过程统计逆序对
    2. 当左边元素 > 右边元素时，左边剩余元素都与右边当前元素构成逆序对
"""
def reversePairs(nums: List[int]) -> int:
    def merge_sort(nums, left, right):
        if left >= right:
            return 0
        mid = left + (right - left) // 2
        count = merge_sort(nums, left, mid) + merge_sort(nums, mid + 1, right)

        # 合并并统计逆序对
        temp = []
        i, j = left, mid + 1
        while i <= mid and j <= right:
            if nums[i] <= nums[j]:
                temp.append(nums[i])
                i += 1
            else:
                temp.append(nums[j])
                count += mid - i + 1  # 关键：左边剩余元素都与 nums[j] 构成逆序对
                j += 1
        while i <= mid:
            temp.append(nums[i])
            i += 1
        while j <= right:
            temp.append(nums[j])
            j += 1
        nums[left:right+1] = temp
        return count

    return merge_sort(nums, 0, len(nums) - 1)

动态规划

动态规划四要素：

状态 (State) – 递归的定义
方程 (Function) – 递归的拆解
初始化 (Initialization) – 递归的出口
答案 (Answer) – 递归的调用

常见的动态规划类型：

背包型：给出 n 个物品及其大小，能否挑选出一些物品装满大小为 m 的背包
- 通常用二维的状态数组dp[i][j]，表示？？？
区间型：题目中有 subarray / substring 的信息，通常大区间依赖小区间
- dp[i][j] 表示数组/字符串中 i, j 这一段区间的最优值/可行性/方案总数
匹配型：通常两个字符串的匹配值依赖于两个字符串前缀的匹配值
- dp[i][j] 表示第一个字符串的前 i 个字符与第二个字符串的前 j 个字符的状态(max/min/sum/or)
接龙型：给一个接龙规则，求最长的龙有多长
- dp[i] 表示以坐标为 i 的元素结尾的最长龙的长度

算法复杂度

时间复杂度：O(状态总数 * 每个状态的处理耗费)

空间复杂度：O(状态总数)

使用场景

求方案总数(90%)
求最值(80%)
求可行性(80%)

不适用的场景：

找所有具体的方案（准确率 99%）
输入数据无序(除了背包问题外，准确率 60%~70%)
暴力算法已经是多项式时间复杂度（准确率 80%）

代码模板

# 一维动态规划模板
def dp_1d(n):
    # 1. 定义状态数组
    dp = [0] * (n + 1)
    # 2. 初始化
    dp[0] = base_case
    # 3. 状态转移
    for i in range(1, n + 1):
        dp[i] = 状态转移方程
    # 4. 返回答案
    return dp[n]

# 二维动态规划模板
def dp_2d(m, n):
    # 1. 定义状态数组
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    # 2. 初始化
    for i in range(m + 1):
        dp[i][0] = ...
    for j in range(n + 1):
        dp[0][j] = ...
    # 3. 状态转移
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            dp[i][j] = 状态转移方程
    # 4. 返回答案
    return dp[m][n]

# 背包问题模板（0-1背包）
def knapsack(n, W, weights, values):
    # dp[i][j] 表示前i个物品，容量为j时的最大价值
    dp = [[0] * (W + 1) for _ in range(n + 1)]
    for i in range(1, n + 1):
        for j in range(1, W + 1):
            if j >= weights[i-1]:
                dp[i][j] = max(dp[i-1][j], dp[i-1][j-weights[i-1]] + values[i-1])
            else:
                dp[i][j] = dp[i-1][j]
    return dp[n][W]

# 空间优化版本（一维数组）
def knapsack_optimized(n, W, weights, values):
    dp = [0] * (W + 1)
    for i in range(n):
        for j in range(W, weights[i] - 1, -1):  # 逆序遍历
            dp[j] = max(dp[j], dp[j - weights[i]] + values[i])
    return dp[W]

例题

322. 零钱兑换

"""
📖描述：给你一个整数数组 `coins`，表示不同面额的硬币；以及一个整数 `amount`，表示总金额。
    计算并返回可以凑成总金额所需的**最少**的硬币个数。如果没有任何一种硬币组合能组成总金额，返回 `-1`。
🧪样例：
    输入：coins = [1, 2, 5], amount = 11
    输出：3
    解释：11 = 5 + 5 + 1
💡重点：
    1. 完全背包问题，每种硬币可以使用多次
    2. dp[i] 表示凑成金额 i 需要的最少硬币数
"""
def coinChange(coins: List[int], amount: int) -> int:
    dp = [float('inf')] * (amount + 1)
    dp[0] = 0
    for i in range(1, amount + 1):
        for coin in coins:
            if i >= coin:
                dp[i] = min(dp[i], dp[i - coin] + 1)
    return dp[amount] if dp[amount] != float('inf') else -1

300. 最长递增子序列

"""
📖描述：给你一个整数数组 `nums`，找到其中最长严格递增子序列的长度。
🧪样例：
    输入：nums = [10,9,2,5,3,7,101,18]
    输出：4
    解释：最长递增子序列是 [2,3,7,101]，长度为 4
💡重点：
    1. dp[i] 表示以 nums[i] 结尾的最长递增子序列长度
    2. 时间 O(N^2)，可以用二分优化到 O(NlogN)
"""
def lengthOfLIS(nums: List[int]) -> int:
    n = len(nums)
    dp = [1] * n
    for i in range(n):
        for j in range(i):
            if nums[i] > nums[j]:
                dp[i] = max(dp[i], dp[j] + 1)
    return max(dp)

# 二分优化版本 O(NlogN)
def lengthOfLIS(nums: List[int]) -> int:
    tails = []  # tails[i] 表示长度为 i+1 的子序列的最小末尾
    for num in nums:
        # 二分查找第一个 >= num 的位置
        left, right = 0, len(tails)
        while left < right:
            mid = left + (right - left) // 2
            if tails[mid] < num:
                left = mid + 1
            else:
                right = mid
        if left == len(tails):
            tails.append(num)
        else:
            tails[left] = num
    return len(tails)

1143. 最长公共子序列

"""
📖描述：给定两个字符串 `text1` 和 `text2`，返回这两个字符串的最长公共子序列的长度。
🧪样例：
    输入：text1 = "abcde", text2 = "ace"
    输出：3
    解释：最长公共子序列是 "ace"
💡重点：
    1. dp[i][j] 表示 text1 前 i 个字符和 text2 前 j 个字符的 LCS 长度
    2. 匹配型动态规划
"""
def longestCommonSubsequence(text1: str, text2: str) -> int:
    m, n = len(text1), len(text2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if text1[i-1] == text2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[m][n]

贪心算法

贪心算法是一种在每一步选择中都采取在当前状态下最好/最优的选择，从而希望导致结果是全局最好/最优的算法。

使用场景

区间调度问题（按结束时间排序）(90%)
跳跃游戏类问题 (80%)
分发糖果/分配问题 (70%)
股票买卖问题（只能买卖一次或多次）(80%)
Huffman编码 (100%)

例题

55. 跳跃游戏

"""
📖描述：给你一个非负整数数组 `nums`，你最初位于数组的第一个下标。数组中的每个元素代表你在该位置可以跳跃的最大长度。
    判断你是否能够到达最后一个下标。
🧪样例：
    输入：nums = [2,3,1,1,4]
    输出：true
    解释：可以先跳 1 步，从下标 0 到达下标 1，然后再从下标 1 跳 3 步到达最后一个下标。
💡重点：
    1. 贪心维护最远可达位置
    2. 如果当前位置超过了最远可达位置，则无法继续
"""
def canJump(nums: List[int]) -> bool:
    max_reach = 0
    for i, jump in enumerate(nums):
        if i > max_reach:  # 当前位置不可达
            return False
        max_reach = max(max_reach, i + jump)
        if max_reach >= len(nums) - 1:
            return True
    return True

435. 无重叠区间

"""
📖描述：给定一个区间的集合，找到需要移除区间的最小数量，使剩余区间互不重叠。
🧪样例：
    输入：intervals = [[1,2],[2,3],[3,4],[1,3]]
    输出：1
    解释：移除 [1,3] 后，剩下的区间没有重叠。
💡重点：
    1. 按结束时间排序，贪心选择结束最早的区间
    2. 等价于求最多能保留多少个不重叠区间
"""
def eraseOverlapIntervals(intervals: List[List[int]]) -> int:
    if not intervals:
        return 0
    intervals.sort(key=lambda x: x[1])  # 按结束时间排序
    count = 1  # 保留的区间数
    end = intervals[0][1]
    for i in range(1, len(intervals)):
        if intervals[i][0] >= end:  # 不重叠
            count += 1
            end = intervals[i][1]
    return len(intervals) - count

455. 分发饼干

"""
📖描述：假设你是一位很棒的家长，想要给你的孩子们一些小饼干。但是，每个孩子最多只能给一块饼干。
    对每个孩子 `i`，都有一个胃口值 `g[i]`，每块饼干 `j`，都有一个尺寸 `s[j]`。
    只有当 `s[j] >= g[i]` 时，我们才可以将这个饼干 `j` 分配给孩子 `i`。
    目标是尽可能满足越多数量的孩子。
🧪样例：
    输入：g = [1,2,3], s = [1,1]
    输出：1
    解释：你有三个孩子和两块小饼干，3 个孩子的胃口值分别是：1, 2, 3。虽然你有两块小饼干，但只能满足胃口值为 1 的孩子。
💡重点：
    1. 排序后双指针贪心匹配
    2. 小饼干优先满足小胃口的孩子
"""
def findContentChildren(g: List[int], s: List[int]) -> int:
    g.sort()
    s.sort()
    i, j = 0, 0
    while i < len(g) and j < len(s):
        if s[j] >= g[i]:
            i += 1  # 满足一个孩子
        j += 1  # 尝试下一块饼干
    return i

宽度优先搜索 BFS

算法复杂度

时间复杂度：$O(n + m)$, n 是点数, m 是边数

空间复杂度：$O(n)$

使用场景

拓扑排序(100%)
出现连通块的关键词(100%)
分层遍历(100%)
简单图最短路径(100%)
给定一个变换规则，从初始状态变到终止状态最少几步(100%)

代码模板

from collections import deque

# 基本BFS模板
def bfs(start, target):
    queue = deque([start])
    visited = set([start])
    step = 0

    while queue:
        size = len(queue)
        for _ in range(size):  # 分层遍历
            cur = queue.popleft()
            if cur == target:
                return step
            for next_node in get_neighbors(cur):
                if next_node not in visited:
                    visited.add(next_node)
                    queue.append(next_node)
        step += 1
    return -1

# 网格BFS模板
def bfs_grid(grid, start, end):
    m, n = len(grid), len(grid[0])
    queue = deque([(start[0], start[1])])
    visited = set([(start[0], start[1])])
    directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
    step = 0

    while queue:
        size = len(queue)
        for _ in range(size):
            x, y = queue.popleft()
            if (x, y) == end:
                return step
            for dx, dy in directions:
                nx, ny = x + dx, y + dy
                if 0 <= nx < m and 0 <= ny < n and (nx, ny) not in visited and grid[nx][ny] != '#':
                    visited.add((nx, ny))
                    queue.append((nx, ny))
        step += 1
    return -1

例题

102. 二叉树的层序遍历

"""
📖描述：给你二叉树的根节点 `root`，返回其节点值的层序遍历（即逐层地，从左到右访问所有节点）。
🧪样例：
    输入：root = [3,9,20,null,null,15,7]
    输出：[[3],[9,20],[15,7]]
💡重点：
    1. 使用队列进行 BFS
    2. 需要记录每层的节点数量
"""
def levelOrder(root: TreeNode) -> List[List[int]]:
    if not root:
        return []
    from collections import deque
    queue = deque([root])
    result = []
    while queue:
        level = []
        size = len(queue)
        for _ in range(size):
            node = queue.popleft()
            level.append(node.val)
            if node.left:
                queue.append(node.left)
            if node.right:
                queue.append(node.right)
        result.append(level)
    return result

127. 单词接龙

"""
📖描述：给定两个单词 `beginWord` 和 `endWord`，以及一个字典 `wordList`，
    找到从 `beginWord` 到 `endWord` 的最短转换序列的长度。
    转换规则：每次只能改变一个字母。
🧪样例：
    输入：beginWord = "hit", endWord = "cog", wordList = ["hot","dot","dog","lot","log","cog"]
    输出：5
    解释："hit" -> "hot" -> "dot" -> "dog" -> "cog"
💡重点：
    1. BFS求最短路径
    2. 双向BFS可以优化效率
"""
def ladderLength(beginWord: str, endWord: str, wordList: List[str]) -> int:
    if endWord not in wordList:
        return 0

    word_set = set(wordList)
    from collections import deque
    queue = deque([(beginWord, 1)])
    visited = set([beginWord])

    while queue:
        word, level = queue.popleft()
        if word == endWord:
            return level
        # 尝试所有可能的变换
        for i in range(len(word)):
            for c in 'abcdefghijklmnopqrstuvwxyz':
                new_word = word[:i] + c + word[i+1:]
                if new_word in word_set and new_word not in visited:
                    visited.add(new_word)
                    queue.append((new_word, level + 1))
    return 0

200. 岛屿数量

"""
📖描述：给你一个由 '1'（陆地）和 '0'（水）组成的的二维网格，请你计算网格中岛屿的数量。
🧪样例：
    输入：grid = [
      ["1","1","1","1","0"],
      ["1","1","0","1","0"],
      ["1","1","0","0","0"],
      ["0","0","0","0","0"]
    ]
    输出：1
💡重点：
    1. 遍历网格，遇到 '1' 就 BFS/DFS 标记整个岛屿
    2. 标记过的格子改为 '0' 避免重复访问
"""
def numIslands(grid: List[List[str]]) -> int:
    if not grid:
        return 0
    m, n = len(grid), len(grid[0])
    count = 0

    def bfs(i, j):
        from collections import deque
        queue = deque([(i, j)])
        grid[i][j] = '0'
        directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
        while queue:
            x, y = queue.popleft()
            for dx, dy in directions:
                nx, ny = x + dx, y + dy
                if 0 <= nx < m and 0 <= ny < n and grid[nx][ny] == '1':
                    grid[nx][ny] = '0'
                    queue.append((nx, ny))

    for i in range(m):
        for j in range(n):
            if grid[i][j] == '1':
                bfs(i, j)
                count += 1
    return count

深度优先搜索 DFS

DFS使用递归或栈实现，沿着一条路径走到底再回溯，常用于遍历所有方案。

算法复杂度

时间复杂度：O(方案个数 * 构造每个方案的时间)

树的遍历： $O(N)$
排列问题： $O(N! * N)$
组合问题： $O(2^N * N)$

使用场景

找满足某个条件的所有方案 (99%)
二叉树 Binary Tree 的问题 (90%)
组合问题(95%)
- 问题模型：求出所有满足条件的”组合”
- 判断条件：组合中的元素是顺序无关的
排列问题 (95%)
- 问题模型：求出所有满足条件的”排列”
- 判断条件：组合中的元素是顺序”相关”的。

不要用 DFS 的场景:

连通块问题（一定要用 BFS，否则 StackOverflow）
拓扑排序（一定要用 BFS，否则 StackOverflow）
一切 BFS 可以解决的问题

代码模板

# 基本DFS模板（回溯）
result = []
def dfs(参数列表):
    if 递归出口:
        result.append(当前方案)
        return
    for 所有的拆解可能性:
        做选择
        dfs(参数列表)
        撤销选择

# 组合问题模板
def combine(n: int, k: int) -> List[List[int]]:
    result = []
    def backtrack(start, path):
        if len(path) == k:
            result.append(path[:])
            return
        for i in range(start, n + 1):
            path.append(i)
            backtrack(i + 1, path)
            path.pop()
    backtrack(1, [])
    return result

# 排列问题模板
def permute(nums: List[int]) -> List[List[int]]:
    result = []
    def backtrack(path, used):
        if len(path) == len(nums):
            result.append(path[:])
            return
        for i in range(len(nums)):
            if used[i]:
                continue
            used[i] = True
            path.append(nums[i])
            backtrack(path, used)
            path.pop()
            used[i] = False
    backtrack([], [False] * len(nums))
    return result

例题

46. 全排列

"""
📖描述：给定一个不含重复数字的数组 `nums`，返回其所有可能的全排列。
🧪样例：
    输入：nums = [1,2,3]
    输出：[[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]]
💡重点：
    1. 排列问题，需要使用 used 数组记录已使用的元素
    2. 回溯时记得撤销选择
"""
def permute(nums: List[int]) -> List[List[int]]:
    result = []

    def backtrack(path, used):
        if len(path) == len(nums):
            result.append(path[:])
            return
        for i in range(len(nums)):
            if used[i]:
                continue
            used[i] = True
            path.append(nums[i])
            backtrack(path, used)
            path.pop()
            used[i] = False

    backtrack([], [False] * len(nums))
    return result

78. 子集

"""
📖描述：给你一个整数数组 `nums`，数组中的元素互不相同。返回该数组所有可能的子集（幂集）。
🧪样例：
    输入：nums = [1,2,3]
    输出：[[],[1],[2],[1,2],[3],[1,3],[2,3],[1,2,3]]
💡重点：
    1. 组合问题，每个元素选或不选
    2. 每个节点都是答案的一部分
"""
def subsets(nums: List[int]) -> List[List[int]]:
    result = []

    def backtrack(start, path):
        result.append(path[:])  # 每个节点都是一个子集
        for i in range(start, len(nums)):
            path.append(nums[i])
            backtrack(i + 1, path)
            path.pop()

    backtrack(0, [])
    return result

22. 括号生成

"""
📖描述：数字 `n` 代表生成括号的对数，请你设计一个函数，用于能够生成所有可能的并且有效的括号组合。
🧪样例：
    输入：n = 3
    输出：["((()))","(()())","(())()","()(())","()()()"]
💡重点：
    1. 左括号数量必须小于 n，右括号数量必须小于左括号数量
    2. 剪枝优化：不满足条件提前终止
"""
def generateParenthesis(n: int) -> List[str]:
    result = []

    def backtrack(s, left, right):
        if len(s) == 2 * n:
            result.append(s)
            return
        if left < n:
            backtrack(s + '(', left + 1, right)
        if right < left:
            backtrack(s + ')', left, right + 1)

    backtrack('', 0, 0)
    return result

79. 单词搜索

"""
📖描述：给定一个 `m x n` 二维字符网格 `board` 和一个字符串单词 `word`。
    如果 `word` 存在于网格中，返回 `true`；否则，返回 `false`。
🧪样例：
    输入：board = [["A","B","C","E"],["S","F","C","S"],["A","D","E","E"]], word = "ABCCED"
    输出：true
💡重点：
    1. DFS + 回溯，注意标记已访问的格子
    2. 找到一条路径即可返回 true
"""
def exist(board: List[List[str]], word: str) -> bool:
    m, n = len(board), len(board[0])

    def dfs(i, j, k):
        if k == len(word):
            return True
        if i < 0 or i >= m or j < 0 or j >= n or board[i][j] != word[k]:
            return False

        temp = board[i][j]
        board[i][j] = '#'  # 标记已访问
        found = (dfs(i+1, j, k+1) or dfs(i-1, j, k+1) or
                 dfs(i, j+1, k+1) or dfs(i, j-1, k+1))
        board[i][j] = temp  # 恢复
        return found

    for i in range(m):
        for j in range(n):
            if dfs(i, j, 0):
                return True
    return False

参考资源

Hello, Notion!

2026-04-23T00:00:00+08:00

Text and Typography

Headings

H1 - heading

H2 - heading

H3 - heading

H4 - heading

Paragraph

Quisque egestas convallis ipsum, ut sollicitudin risus tincidunt a. Maecenas interdum malesuada egestas. Duis consectetur porta risus, sit amet vulputate urna facilisis ac. Phasellus semper dui non purus ultrices sodales. Aliquam ante lorem, ornare a feugiat ac, finibus nec mauris. Vivamus ut tristique nisi. Sed vel leo vulputate, efficitur risus non, posuere mi. Nullam tincidunt bibendum rutrum. Proin commodo ornare sapien. Vivamus interdum diam sed sapien blandit, sit amet aliquam risus mattis. Nullam arcu turpis, mollis quis laoreet at, placerat id nibh. Suspendisse venenatis eros eros.

Lists

Ordered list

Firstly
Secondly
Thirdly

Unordered list

Chapter
- Section
  - Paragraph

ToDo list

Job
- Step 1
- Step 2
- Step 3

Description list

Sun: the star around which the earth orbits
Moon: the natural satellite of the earth, visible by reflected light from the sun

Block Quote

This line shows the block quote.

Prompts

An example showing the tip type prompt.

An example showing the info type prompt.

An example showing the warn type prompt.

An example showing the danger type prompt.

Tables

Company	Contact	Country
Alfreds Futterkiste	Maria Anders	Germany
Island Trading	Helen Bennett	UK
Magazzini Alimentari Riuniti	Giovanni Rovelli	Italy

Links

http://127.0.0.1:4000

Footnote

Click the hook will locate the footnote¹, and here is another footnote².

Inline code

This is an example of Inline Code.

Filepath

Here is the /path/to/the/filename.

Code blocks

Code blocks support syntax highlighting, one-click copying, collapsing-expanding, and dark/light theme switching.

Common

A plain code block with no language specified:

This is a common code snippet, without syntax highlight and line number.

Specific Language

Specify the language after the opening backticks (e.g. ````bash`) to enable syntax highlighting and line numbers.

if [ $? -ne 0 ]; then
  echo "The command was not successful.";
  #do the needful / exit
fi;

Specific filename

Add {: file='filename'} right after the closing backticks to display a filename label on the code block.

@import
  "colors/light-typography",
  "colors/dark-typography";

Jupyter notebook

Embed a Jupyter notebook directly into the post using the jupyter_notebook tag.

Mathematics

The mathematics powered by MathJax:

\[\begin{equation} \sum_{n=1}^\infty 1/n^2 = \frac{\pi^2}{6} \label{eq:series} \end{equation}\]

We can reference the equation as \eqref{eq:series}.

When $a \ne 0$, there are two solutions to $ax^2 + bx + c = 0$ and they are

\[x = {-b \pm \sqrt{b^2-4ac} \over 2a}\]

Mermaid SVG

 gantt
  title  Adding GANTT diagram functionality to mermaid
  apple :a, 2017-07-20, 1w
  banana :crit, b, 2017-07-23, 1d
  cherry :active, c, after b a, 1d

Mindmap

View markmap.js.org for configuration details.

Images

Default (with caption)

Full screen width and center alignment

Left aligned

Float to left

Praesent maximus aliquam sapien. Sed vel neque in dolor pulvinar auctor. Maecenas pharetra, sem sit amet interdum posuere, tellus lacus eleifend magna, ac lobortis felis ipsum id sapien. Proin ornare rutrum metus, ac convallis diam volutpat sit amet. Phasellus volutpat, elit sit amet tincidunt mollis, felis mi scelerisque mauris, ut facilisis leo magna accumsan sapien. In rutrum vehicula nisl eget tempor. Nullam maximus ullamcorper libero non maximus. Integer ultricies velit id convallis varius. Praesent eu nisl eu urna finibus ultrices id nec ex. Mauris ac mattis quam. Fusce aliquam est nec sapien bibendum, vitae malesuada ligula condimentum.

Float to right

Dark/Light mode & Shadow

The image below will toggle dark/light mode based on theme preference, notice it has shadows.

Video

Reverse Footnote

The footnote source ↩︎
The 2nd footnote source ↩︎

Jinchao Li

Thinking with Visual Primitives (转)

一、从感知鸿沟到指代鸿沟：问题重新定义

二、架构与训练 pipeline：效率与专项能力的平衡

2.1 架构设计

2.2 五阶段后训练流程

三、冷启动数据构造：四大推理场景的精细化设计

3.1 计数（Counting）

3.2 空间推理与通用VQA（Spatial Reasoning & General VQA）

3.3 迷宫导航（Maze Navigation）

3.4 路径追踪（Path Tracing）

四、Reward Model设计：让强化学习”看懂”视觉推理

五、定性分析：视觉基元如何重塑推理体验

5.1 边界框作为基元

5.2 点作为基元

DeepSeek-V4与流形撕裂(转)

Single Token Geometry: DeepSeek V4 and Manifold Tearing

单标几何：DeepSeek V4 与流形撕裂

What Is a Manifold Tear?

什么是流形撕裂？

The Geometry of the Residual Stream

残差流的几何

The Failure Cascade

失效级联

Mitigation 1: SwiGLU Clamping, Bounding Local Curvature

缓解方法一：SwiGLU Clamping, 约束局部曲率

Mitigation 2: Anticipatory Routing, Preventing Chart Inconsistency

缓解方法二：Anticipatory Routing, 防止坐标图不一致

Mitigation 3: mHC — Bounding Tear Propagation

缓解方法三：mHC——约束撕裂传播

The Unified Picture

统一图景

What the Paper Leaves Open

论文留下了什么开放问题

Manifold Tearing Is Not DeepSeek’s Problem Alone

流形撕裂并不只是 DeepSeek 的问题

Python基础

Lists

Tuples

Sets

Strings

Queues

Heaps

Built-in Functions

Math

References

算法基础

框架概括

整体框架

各类数据结构的遍历

算法复杂度

主定理（Master Theorem）

常见算法复杂度

双指针

算法复杂度

使用场景

代码模板

例题

滑动窗口

例题

查找

算法复杂度

使用场景

代码模板

例题

排序

算法复杂度

使用场景

代码模板

例题

动态规划

算法复杂度

使用场景

代码模板

例题

贪心算法

使用场景

例题

宽度优先搜索 BFS

算法复杂度