admin管理员组文章数量:1130349
Contents
- Introduction
- Method
- Experiments
- References
Introduction
- 作者提出 PrefixQuant,基于 QuaRot,通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调,能在 W4A4 static quantization 上做到比较好的量化效果;但和 CusionCache 一样,PrefixQuant 尽管可以保持所有关键词元无损,但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响
Method
- 作者发现,对于 static quantization,由于关键词元与其他词元的激活值分布显著不同,如果不对关键词元做特殊处理,校准得到的量化参数会损害非关键词元的量化精度,例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦;如果能保持关键词元无损,对其他 tokens 做校准,就能得到更小的量化范围,提升量化精度
- Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中, η = 64 \eta=64 η=64
- Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 o = ⌈ max ( O ) ⌉ o=\lceil\max(\mathbf O)\rceil o=⌈max(O)⌉,其中 O ∈ R b \mathbf O\in\R^b O∈Rb 为所有 transformer block 中统计的关键词元数量
- Which Tokens to Prefix? top-
o
o
o high-frequency outlier tokens + [BOS]
- Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights
Experiments
- Settings. 权重 per-channel symmetric quantization,KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit,激活值 per-tensor static quantization;校准数据集为 8 Pile samples with a 1024 sequence length,通过 grid search 找到 scale 初始值;微调数据集为 512 samples from Pile with a 1024 context length
- Comparison Results.
- Results on weight-only quantization.
- Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
- Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
- Quantization Time.
References
- Chen, Mengzhao, et al. “PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.” arXiv preprint arXiv:2410.05265 (2024).
- code: https://github/chenmnz/prefixquant
Contents
- Introduction
- Method
- Experiments
- References
Introduction
- 作者提出 PrefixQuant,基于 QuaRot,通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调,能在 W4A4 static quantization 上做到比较好的量化效果;但和 CusionCache 一样,PrefixQuant 尽管可以保持所有关键词元无损,但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响
Method
- 作者发现,对于 static quantization,由于关键词元与其他词元的激活值分布显著不同,如果不对关键词元做特殊处理,校准得到的量化参数会损害非关键词元的量化精度,例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦;如果能保持关键词元无损,对其他 tokens 做校准,就能得到更小的量化范围,提升量化精度
- Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中, η = 64 \eta=64 η=64
- Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 o = ⌈ max ( O ) ⌉ o=\lceil\max(\mathbf O)\rceil o=⌈max(O)⌉,其中 O ∈ R b \mathbf O\in\R^b O∈Rb 为所有 transformer block 中统计的关键词元数量
- Which Tokens to Prefix? top-
o
o
o high-frequency outlier tokens + [BOS]
- Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights
Experiments
- Settings. 权重 per-channel symmetric quantization,KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit,激活值 per-tensor static quantization;校准数据集为 8 Pile samples with a 1024 sequence length,通过 grid search 找到 scale 初始值;微调数据集为 512 samples from Pile with a 1024 context length
- Comparison Results.
- Results on weight-only quantization.
- Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
- Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
- Quantization Time.
References
- Chen, Mengzhao, et al. “PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.” arXiv preprint arXiv:2410.05265 (2024).
- code: https://github/chenmnz/prefixquant
本文标签: staticQuantizationPrefixQuantarxivOutliers
版权声明:本文标题:[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://it.en369.cn/jiaocheng/1755027008a2755143.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。


发表评论