Crunching the Numbers: How Quantization Makes Large Models Byte-sized

9 min readJul 11, 2023

Introduction

LLM-driven agents can help us do any number of tasks more efficiently, whether that is code review, writing emails, etc. Some tasks that we want them to do require more domain knowledge and expertise than others, so we might want to fine-tune our own model (more to come on this). There are 40 billion parameter models now open-sourced, so why not fine-tune these with your data and knowledge without feeding it into a third-party environment?

However, there are significant hardware (and financial) challenges to fully leveraging the power of LLMs, especially if we want to fine-tune them on our own proprietary data and expertise. Not everyone has several thousand (or tens of thousands of) dollars to invest in GPUs readily available, and they can be hard to provision on cloud service providers because of high demand.

Since we have been on the topic of memory lately, I want to share another area I am excited to see developments in related to a different area of memory, which is quantization. This is a very practical technique that can be used to help everyone take advantage of fine-tuning their own models and reduce barriers caused by hardware limitations.

What is Quantization?

In the context of machine learning, quantization is a process that reduces the precision of the numerical weights used in a model. The aim of quantization is to make the model smaller and more computationally efficient, at the cost of a slight reduction in accuracy.

In a typical neural network, the weights are represented as 32-bit floating-point numbers. However, these high-precision numbers can be memory-intensive and computationally demanding, especially for large models with billions of parameters.

Quantization tackles this problem by reducing the precision of these weights. For example, the weights might be rounded off to the nearest 16-bit, 8-bit, or even 4-bit numbers, as in the case of QLoRA, which we will explore in further detail.

This process greatly reduces the memory requirements and computational complexity of the model, which can make it possible to run large models on hardware with limited memory and computational capacity, such as mobile devices or single GPUs. Research shows that “moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications.”

It’s important to note that while quantization can slightly degrade the performance of the model, research and engineering efforts such as QLoRA aim to mitigate this reduction in accuracy, so the quantized model performs nearly as well as the fully fine-tuned model that has no quantization. We should also note that quantization tends to work better with LLMs than it does with foundational computer vision models for various reasons, even though quantization was historically popularized more by the computer vision world.

Historical Developments of Quantization

Quantization in machine learning has evolved significantly over the years, driven by the need to deploy powerful deep learning models on devices with limited memory and computational power, like mobile phones or edge devices. In the early days of machine learning, models were small enough that quantization wasn’t really necessary.

However, as models started to grow, the need for more memory-efficient representations became evident. Researchers started exploring lower-precision arithmetic, starting with 16-bit half-precision floating points. Below are some of the different types of quantization that can happen at different points of the model development process.

Fixed-Point Quantization: Fixed-point quantization, which involves representing weights as integers rather than floating-point numbers, was one of the earliest techniques used. One of the key papers here was the first to propose this technique to quantize deep convolutional networks (DCNs) and showed that fixed-point implementations of DCNs combined with optimization of bit-allocation across layers actually improved both memory and accuracy of them when benchmarked against CIFAR-10. However, the limited dynamic range of fixed-point numbers can make this type of quantization less suitable for other networks.
Floating-Point Quantization: Researchers also continued using lower-precision floating-point numbers, like the 16-bit half-precision floating points mentioned earlier. One of the early publications to develop this technique was in using an 8-bit floating point representation (FP8) for training deep neural networks (DNNs), where they created a hybrid FP8 format to match different precision requirements in the model training process and a new end-to-end distributed training method. This allowed for a larger dynamic range than fixed-point numbers and applicability to more DNN models.
Weight Quantization: In weight quantization, weights between neurons are approximated and constrained, unlike traditional stochastic gradient descent, which evaluates the gradient on the current iterate. For example, BinaryConnect acknowledges that typical DNN computations involve a lot of multiplication and addition operations, also known as multiply-accumulate operations. In digital hardware, multipliers consume the most power and take up the most space. So, if the weights between neurons are binary during forward and backpropagation, these multiply-accumulate operations can be simplified into mere addition operations. This results in efficient computation and improved performance.
Activation Quantization: Researchers then started quantizing not just the weights but also the activations in the network. This allows for further memory savings and can also significantly speed up computation, especially on hardware designed for lower-precision arithmetic.
Post-Training vs Training-Time Quantization: Initially, most quantization was done post-training, meaning the model was first trained at full precision and then quantized. However, this can lead to significant accuracy loss. To address this, researchers started to use training-time quantization, where the model is trained with quantization from the start. One example of this is the Data Quality-aware Mixed-precision Quantization (DQMQ) framework, which adapts the quantization bit-width dynamically according to the quality of the data. This adaptation is guided by a bit-width decision policy that can be learned during training and is modeled as a hybrid reinforcement learning task. This can allow the model to better adapt to the quantization, rewarding outcomes that improve performance.
Advanced Techniques: More recently, researchers have been exploring more advanced quantization techniques. For example, this paper from NVIDIA on mixed-precision training proposes using both single and half precisions. They keep a single-precision copy of the weights that accumulate the gradients after each optimizer step; this copy is then rounded to half-precision for the forward and backward propagation steps. Loss scaling helps ensure smaller half-precision values do not vanish. QLoRA uses a novel double quantization technique to further reduce memory usage.

It is no surprise that with the development of foundation models, such as LLMs, this area has regained much focus, and new research on these various methods continues to be published. One of the recent publications that has shown significant promise at the implementation level is QLoRa, and by proxy LoRA, which we will discuss next in the context of LLMs specifically.

Quantizing LLMs: LoRa and QLoRa

QLoRA is an efficient fine-tuning technique for machine learning models, specifically for large language models. It is designed to allow LLMs to be fine-tuned on a single GPU with limited memory, even when the model itself has a huge number of parameters, up to 65 billion.

The key technique QLoRA uses is backpropagation of gradients through a 4-bit quantized version of the pre-trained model into structures called Low Rank Adapters (LoRA). In practice, an adapter is a small module that is inserted into each layer of a pre-existing model. The adapter is trained to adapt the inputs of the layer to the specific task at hand, while the pre-existing parameters of the model are kept fixed (or “frozen”). They are referred to as “Low Rank” because of their origins in linear algebra — a matrix that can be factored into a product of two matrices with smaller dimensions. This factorization significantly reduces the number of parameters that need to be learned and stored.

This quantization process allows the model to drastically reduce memory usage without a significant drop in task performance. The paper cites that QLoRA reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance compared to a 16-bit fully finetuned baseline.

Illustration of how the QLoRA process works compared to others from the original paper here. The key differences include quantizing to 4-bits and adding in optimizers that reduce memory spikes.

The QLoRA-based model, named Guanaco, has performed very well on the Vicuna benchmark, which measures the performance of conversational AI models. Despite its more compact size and the reduced memory required for its fine-tuning, it achieves nearly the same performance as ChatGPT.

Benchmark figures from the paper based on a tournament they conducted on best responses to a prompt with human raters and GPT-4. Full details here.

QLoRA introduces three novel techniques to save memory and achieve this performance:

4-bit NormalFloat (NF4): A new data type that’s optimal for normally distributed weights. By using only 4 bits, it saves a lot of memory compared to higher precision data types.
Double quantization: An extra layer of quantization on top of the 4-bit quantization. This involves quantizing the quantization constants using the LoRA method, leading to even further memory savings.
Paged optimizers: These manage memory usage by handling memory spikes that might otherwise cause problems. They make sure that the memory usage stays within the limit of the GPU memory.

QLoRA was used to fine-tune more than 1,000 models, including different types of models and models of different scales, some of which would be unmanageable using traditional fine-tuning techniques. They found that QLoRA could produce state-of-the-art results even with smaller models and small high-quality datasets.

However, they also found that current chatbot benchmarks aren’t always reliable for accurately measuring chatbot performance. They provided a detailed breakdown, or “lemon-picked analysis”, of where Guanaco doesn’t perform as well as ChatGPT e.g., mathematics, obscure questions, and some types of assumptions as part of the preamble. The authors also recognize that there is a need for more testing at larger scales, as this is still a computationally-intensive task.

Finally, they released all of their models and the code used to implement QLoRA, including the CUDA kernels required for 4-bit training. Check out their GitHub repository here for details.

I personally tried implementing their code to quantize several fine-tuned LLMs in the 7B (works even on a free-tier single instance GPU in Google Colab or Kaggle) to the 40B parameter range (check out Vast.ai for cheaper GPU alternatives, or if you have some you can rent out!) with significantly reduced hardware requirements and minimal impact on performance for specific use cases. Techniques like QLoRA (and LoRA) should be further developed and can truly unlock the potential of LLMs for a larger pool of people.

Conclusion

With the development of increasingly larger foundation models, and LLMs in particular, the area of quantization and its role in memory-efficient processing has become critical to truly democratizing the power of LLMs to a broader audience. These techniques, developed initially to make deep learning more efficient on mobile and edge devices, are now helping to make LLMs and specifically using fine-tuned LLMs more accessible to all.

References

Courbariaux, M., Bengio, Y., & David, J. P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28. Link.

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314. Link.

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., & Keutzer, K. (2021). A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630. Link.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. Link.

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., … & Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2704–2713). Link.

Lin, D., Talathi, S., & Annapureddy, S. (2016, June). Fixed point quantization of deep convolutional networks. In International conference on machine learning (pp. 2849–2858). PMLR. Link.

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., … & Wu, H. (2017). Mixed precision training. arXiv preprint arXiv:1710.03740. Link.

Sun, X., Choi, J., Chen, C. Y., Wang, N., Venkataramani, S., Srinivasan, V. V., … & Gopalakrishnan, K. (2019). Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. Advances in neural information processing systems, 32. Link.

Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30. Link.

Wang, Y., Guo, J., Guo, S., & Zhang, W. (2023). Data Quality-aware Mixed-precision Quantization via Hybrid Reinforcement Learning. arXiv preprint arXiv:2302.04453. Link.

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., … & Levy, O. (2023). Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206. Link.