聊聊拉长LLaMA的一些经验

Sequence Length是指LLM能够处理的文本的最大长度，越长，自然越有优势：

更强的记忆性。更多轮的历史对话被拼接到对话中，减少出现遗忘现象
长文本场景下体验更佳。比如文档问答、小说续写等

当今开源LLM中的当红炸子鸡——LLaMA，第一版上下文长度是2048，第二版长度是4096。相比之下ChatGPT、GPT4已经支持到16k，Claude甚至支持到了100k。足以见得将LLaMA拉长是如此的任重而道远。本文将会介绍三种在旋转位置编码（RoPE）基础上扩充上下文的高性价比方案，在文末会介绍我的实践经验。

线性插值法

Kaiokendev的博客[1]中提到了方法，和Meta的一篇工作[2]不谋而合，其思想主要是将目标长度压缩到原始长度。如下图所示，LLaMA-1预训练的长度为2048，如果我们想把它拉长到4096：

方法一：推理时直接拉长到4096。这考虑位置编码的外推性（即在短文本上训练，长文本上推理的能力[2]），而RoPE的外推性则是相当一般[2]。由于训练时长度都是小于2048的，超过2048部分Attention分数会飙升，导致困惑度急剧上升。
方法二：在原始模型基础上做长度为4096的继续训练。这里先岔开介绍另一款模型——MPT-30B的做法，根据官方博客[3]的介绍：

As mentioned earlier, MPT-30B was trained with a long context window of 8k tokens (vs. 2k for LLaMa and Falcon) and can handle arbitrarily long context windows via ALiBi or with fine-tuning. To build 8k support into MPT-30B efficiently, we first pre-trained on 1T tokens using sequences that were 2k tokens long, and continued training for an additional 50B tokens using sequences that were 8k tokens long.

MPT-30B采用ALiBi位置编码（外推性优于RoPE），在2k的长度进行1T token的训练，然后在8k长度上进行50B token的预训练——这是在外推性强于RoPE的ALiBi上的情况。LLaMA-1预训练的token数是1T以上，想要在长度为4096样本上效果不下降，那需要训练足够多的token数才行，这就需要较大的成本了。
方法三：另一种思路则是将4096的位置编码通过线性插值法压缩到2048内，这样只需要在少量的长度为4096的数据上进行继续预训练，便可达到不错的效果。

来自论文[2]

代码实现

线性插值法的实现代码相当的简单，这需要在原始RoPE上进行微小的改动，即加上下图的scale参数。

来自[7]，scaled_rope/LlamaLinearScaledRotaryEmbedding.py

效果

Meta的工作[2]中进行了充足实验和公式推导证明，如果想看具体的代码，建议看lmsys.org（Vicuna的出品方）的一篇工作[4]：How Long Can Open-Source LLMs Truly Promise on Context Length? 他们对比了商用模型、号称支持长文本的开源模型和“Vicuna+线性插值法”的效果，并给出了几个结论：

商用模型在长文本的效果上很能打！而那些号称支持长文本的开源模型，在长文本上则表现不佳。
随着文本长度的增加，越接近边界，Vicuna+线性插值法的效果降低越明显。这可能是因为训练数据存在短文本的情况。

来自[4]

写这篇文章的同时，ChatGLM团队更新了ChatGLM2-6B-32K，也是使用了插值法。同时推出了长文本的中英评测集LongBench，在这个评测集上ChatGLM2-6B-32K展示了强大的实力，但值得注意的是，该评测集的评测方式是使用ChatGLM2-6B来进行评估的。

NTK插值法

NTK插值法的提出于一篇Reddit帖子[5]，它提出使用Neural Tangent Kernel (NTK)来解决这个问题。

if you apply Neural Tangent Kernel (NTK) theory to this problem, it becomes clear that simply interpolating the RoPE's fourier space "linearly" is very sub-optimal, as it prevents the network to distinguish the order and positions of tokens that are very close by. Borrowing from NTK literature, scaling down the fourier features too much will eventually even prevent succesful finetunes (this is corroborated by the recent paper by Meta that suggests an upper bound of ~600x)

Instead of the simple linear interpolation scheme, I've tried to design a nonlinear interpolation scheme using tools from NTK literature. Basically this interpolation scheme changes the base of the RoPE instead of the scale, which intuitively changes the "spinning" speed which each of the RoPE's dimension vectors compared to the next. Because it does not scale the fourier features directly, all the positions are perfectly distinguishable from eachother, even when taken to the extreme (eg. streched 1million times, which is effectively a context size of 2 Billion)

帖子中作者还用了时钟的例子来解释线性插值和NTK插值的异同：

RoPE behaves like a clock. Your 12 hours wall clock is basically a RoPE of dimension 3 with a base of 60. So for each second, the minute hand turns 1/60th of a minute, and for each minute, the hour hand turns 1/60th.

Now if you slowed down time by a factor of 4x, that is a linear RoPE scaling used in SuperHOT. Unfortunately now it is really really hard to distinguish each second, because now the seconds hand barely moves each second. So if someone gave you two different times, which is only different by a single second, you won't be able to distinguish them from afar (let's say the NNs have myopia because that's basically what NTK predicts)

Now NTK-Aware RoPE scaling does not slow down the seconds. One second is still one second, but it slows down minutes by a factor of let's say 1.5, and the hours by a factor of 2. This way you can fit 90 minutes in a hour, and fit 24 hours in half a day. So now you basically have a clock that can measure 129.6k seconds instead of 43.2k seconds.

Because you don't need a precise measurement of the hour hand when looking at the time, scaling the hours more compared to seconds is crucial. You don't want to lose the precision of the seconds hand, but you can afford to lose precision on the minutes hand and even more on the hours hand.

Then, it's just a matter of deriving the base change formula in order to obtain such a scaling. (where less precise dimensions are scaled more and more)

代码实现

NTK的实现则更加简单了，根据超参数alpha，对应修改base变量即可：

来自[7]，scaled-rope/scaled_rope/LlamaNTKScaledRotaryEmbedding.py

效果

在效果上，帖子中也给出了NTK插值法和线性插值法的PPL比较，可以看到，在二者都不做Finetune的情况下，NTK插值法具备更低的PPL。

来自[5]

动态插值法

动态插值法同样出自于一篇Reddit帖子[6]，它的出发点很简单：

My idea was to use the exact position values for the first 2k context (after all, why mess with a good thing?) and then re-calculate the position vector for every new sequence length as the model generates token by token.

这种做法可以和先前的两种方法相结合，[7]中也给出了详细的代码实现。

效果

作者和前两种方法做了对比，展示了动态插值法在PPL下降上的优势。

实践经验

我在实践的过程中，评估效果主要使用longChat[4]中使用的评估方式，以下是一些takeaway tips，欢迎大家一起交流。

线性插值法具备完整的理论支持和大量的实验证明，在我的实践中，“线性插值法+Finetune”取得了最佳效果。
NTK插值法的实验中，对比的是不做Finetune的情况，在我的实践中，“NTK插值+Finetune”效果会明显优于单独的“NTK插值”，但它的收敛速度会慢于“线性插值法+Finetune”。
动态插值法的实验同样是在不做Finetune的情况对比的，目前为止我并没有尝试过这种方法。在Reddit的评论区有人提出一个很好的问题：如果采取这种方法，逐token推理时，文本的长度是在变化的，则导致无法使用kv-cache，这会对性能产生很大的影响。

最后，拉长LLaMA的方案可以不从RoPE入手（如：LongLLaMA[8]），但“线性插值法+Finetune”无疑是一种性价比很高的方案，推荐大家尝试！

——2023.07.31