[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs-369IT编程

admin管理员组
文章数量:1130349

Introduction
Method
Experiments
References

Introduction

作者提出 PrefixQuant，基于 QuaRot，通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调，能在 W4A4 static quantization 上做到比较好的量化效果；但和 CusionCache 一样，PrefixQuant 尽管可以保持所有关键词元无损，但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响

Method

作者发现，对于 static quantization，由于关键词元与其他词元的激活值分布显著不同，如果不对关键词元做特殊处理，校准得到的量化参数会损害非关键词元的量化精度，例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦；如果能保持关键词元无损，对其他 tokens 做校准，就能得到更小的量化范围，提升量化精度
Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中， η = 64 \eta=64 η=64
Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 o = ⌈ max ⁡ ( O ) ⌉ o=\lceil\max(\mathbf O)\rceil o=⌈max(O)⌉，其中 O ∈ R b \mathbf O\in\R^b O∈Rb 为所有 transformer block 中统计的关键词元数量
Which Tokens to Prefix? top- o o o high-frequency outlier tokens + [BOS]

Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights

Experiments

Settings. 权重 per-channel symmetric quantization，KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit，激活值 per-tensor static quantization；校准数据集为 8 Pile samples with a 1024 sequence length，通过 grid search 找到 scale 初始值；微调数据集为 512 samples from Pile with a 1024 context length

Comparison Results.
Results on weight-only quantization.
Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
Quantization Time.

References

Chen, Mengzhao, et al. “PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.” arXiv preprint arXiv:2410.05265 (2024).
code: https://github/chenmnz/prefixquant

Introduction
Method
Experiments
References

Introduction

作者提出 PrefixQuant，基于 QuaRot，通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调，能在 W4A4 static quantization 上做到比较好的量化效果；但和 CusionCache 一样，PrefixQuant 尽管可以保持所有关键词元无损，但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响

Method

作者发现，对于 static quantization，由于关键词元与其他词元的激活值分布显著不同，如果不对关键词元做特殊处理，校准得到的量化参数会损害非关键词元的量化精度，例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦；如果能保持关键词元无损，对其他 tokens 做校准，就能得到更小的量化范围，提升量化精度
Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中， η = 64 \eta=64 η=64
Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 o = ⌈ max ⁡ ( O ) ⌉ o=\lceil\max(\mathbf O)\rceil o=⌈max(O)⌉，其中 O ∈ R b \mathbf O\in\R^b O∈Rb 为所有 transformer block 中统计的关键词元数量
Which Tokens to Prefix? top- o o o high-frequency outlier tokens + [BOS]

Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights

Experiments

Settings. 权重 per-channel symmetric quantization，KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit，激活值 per-tensor static quantization；校准数据集为 8 Pile samples with a 1024 sequence length，通过 grid search 找到 scale 初始值；微调数据集为 512 samples from Pile with a 1024 context length

Comparison Results.
Results on weight-only quantization.
Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
Quantization Time.

References

Chen, Mengzhao, et al. “PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.” arXiv preprint arXiv:2410.05265 (2024).
code: https://github/chenmnz/prefixquant

本文标签： static Quantization PrefixQuant arxiv Outliers

版权声明：本文标题：[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://it.en369.cn/jiaocheng/1755027008a2755143.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Contents

Introduction

Method

Experiments

References

Contents

Introduction

Method

Experiments

References

更多相关文章

static, const, static const 与 const static

【C++】类与对象 第三篇(初始化列表,explicit,static,友元,内部类)

static修饰的函数只能在本文件中调用，其他文件想调用怎么办？

[arxiv论文阅读] LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

arXiv AI 综述列表（2024.05.13~2024.05.17）

java static 详解

2、OpenMP的任务调度schedule(static|dynamic|guided|runtime[size])

openmp 任务调度 for schedule static dynamic guided runtime

[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

深度学习论文: Data-Free Quantization Through Weight Equalization and Bias Correction及其PyTorch实现

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

发表评论

推荐文章

分享几个appstore之外的iOS软件下载网址

腾讯会议直播回放如何进行保存下载

如何重置（或重新安装）系统

H.266VVC 新发布的参考软件

【大模型 AI 学习】大模型 AI 部署硬件配置方案（本地硬件配置 | 在线GPU）

热门文章

为什么macbook 自带的safari浏览器无法打开惠普技术支持网站，而edge浏览器，为何？

手机软件管家 V0.12 申请手机证书、下载软件免签名自动安装

Windows10 PowerShell无法激活conda环境的解决方法

听力：How to Promote Social Change

Railway Construction Promote Crusher Development

Flutter瘦身减脂运动锻炼APP源码|ChatGPT集成|多语言支持|作者亲测有效

【免费下载】 SouthMap平台版：集成ZWCAD2022，简化CAD安装流程

测评总结八款好用的远程监控电脑软件

国内外翻译API价格对比

JetBrains PyCharm windows32位 安装

最新文章

Sublime 32位 激活码

windows下载安装远程桌面工具RealVNC-Server教程(RealVNC_E4_6_1版带注册码)

【亲测免费】 抖音直播伴侣推流密钥获取工具使用教程

【亲测免费】 Proxifer 安装包与注册码

Royal TSX许可证密钥(6.x后所有版本都可以用)

程序员刚毕业，先去大厂镀金还是先去小厂攒经验？

万象2008清空boss账户密码

【Tools】GitBook简明教程

oracle exadata celldisk 闪存盘受损导致性能下降

SDUT 2138 图结构练习——BFSDFS——判断可达性

WordPress get parent category taxonomy

Omit specific product categories from WooCommerce shortcode

Updating Posts table in database without overwriting user generated content

php - Use wp_get_recent_posts with search term

responsive - How to exclude an image size from the Wordpress srcset

【C++】类与对象第三篇(初始化列表,explicit,static,友元,内部类)

JetBrains PyCharm windows32位安装

Sublime 32位激活码

【亲测免费】抖音直播伴侣推流密钥获取工具使用教程