Key-Value Cache Quantization in Large Language Models: A Safety Benchmark

Timothy Liu

doi:10.62051/ijcsit.v4n2.16

Authors

Timothy Liu

DOI:

https://doi.org/10.62051/ijcsit.v4n2.16

Keywords:

Large language models (LLMs), Key-value (KV), Safety Benchmark

Abstract

The misuse of large language models (LLMs) is an ongoing concern with the expansion of general public access to LLMs. One reason for expanded access is the development of key-value (KV) cache quantization, a technique that significantly reduces both the large computing resource requirements and memory bottlenecks that are characteristic of LLMs. As more developers and vendors begin to prioritize efficiency in LLM training, protective measures against the misuse of language models become more of an afterthought. To address the expected increase of LLM misuse accompanying KV cache quantization, this paper covers a proof-of-concept benchmark to evaluate the proficiency of LLMs in response safety when tested against a sample of unsafe questions consisting of 13 different question categories. Response safety is a model’s ability to both clearly deny providing a response to a given question and avoid providing any additional information that attempts to provide an accurate answer. By testing the sample against the Meta Llama-2-7B pretrained chat model, we determine response safety fine-tuning considerations that address performance bias among the 13 question categories. We hope this study brings attention to not only the sacrifices of accuracy but also response safety of KV cache quantization in large language models. (Code and data are available at https://github.com/TimochiL/llm_benchmark.) Disclaimer: This paper contains examples of harmful language. Reader discretion is recommended.

Downloads

Download data is not yet available.

References

[1] Wayne Xin Zhao et al. “A Survey of Large Language Models”. In: (2023). arXiv:2303.18223 [cs.CL]. url: https://arxiv.org/abs/2303.18223.

[2] Xiao Wang et al. “Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning”. In: (2024). arXiv: 2404.10552 [cs.CL]. url: https://arxiv.org/abs/2404.10552.

[3] Zirui Liu et al. “KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization”. en. In: (2023). doi: 10.13140/RG.2.2.28167.37282.url: https://rgdoi.net/10.13140/RG.2.2.28167.37282.

[4] Raushan Turganbay. Unlocking Longer Generation with Key-Value Cache Quantization. 2024. url: https://huggingface.co/blog/kv-cache-quantization.

[5] Xinyue Shen et al. “”Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models”. In: (2024). arXiv: 2308.03825[cs.CR]. url: https://arxiv.org/abs/2308.03825

[6] Xiangru Tang et al. “Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science”. In: (2024). arXiv: 2402.04247 [cs.CY]. url: https://arxiv.org/abs/2402.04247.

[7] Yefei He et al. “ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification”. In: (2024). arXiv: 2405.14256 [cs.LG]. url: https://arxiv.org/abs/2405.14256.

[8] Coleman Hooper et al. “KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization”. In: (2024). arXiv: 2401.18079 [cs.LG]. url: https: //arxiv.org/abs/2401.18079.

[9] Hugo Touvron et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In:(2023). arXiv: 2307.09288 [cs.CL]. url: https://arxiv.org/abs/2307.09288.

[10] Tejaswi Kashyap. Memory Optimization in LLMs: Leveraging KV Cache Quantization for Efficient Inference. 2024. url: https://medium.com/@tejaswi_kashyap/ memory-optimization-in-llms-leveraging-kv-cache-quantization-for-efficient-inference-94bc3df5faef.