Small Tweaks, Big Gains: Adapting Large Language Models with LoRA

Michelle Yi (Yulle)
6 min readJul 28, 2023
Photo by Mila Tovar on Unsplash

Imagine you have a very complex language machine. This machine has many internal pieces (called weights) that were optimized for general use.

Now you want to tweak this machine a bit to do a specific new task, like answering questions about a given subject area in-depth, that branches off from the main general purpose trunk. But if you try to adjust every single part of the machine separately, it will take a huge amount of effort and be really slow.

The key idea of the Low-Rank Adapter (LoRA) research is that you likely only need to tweak a few key parts of the machine to get it to do that new task well. You don’t need to adjust every single part, but rather just a few in a very efficient way.

Background

In my last article on quantization, we discussed how low-rank adapters reduce barriers to hardware and increase performance through improved memory usage.

However, it is important to delve into the specifics of the original Low-Rank Adaptation (LoRA) paper in order to gain a deeper understanding of the overall concept that the authors are conveying, which extends beyond just performance and even beyond LLMs.

In general, it is very common for people to rely on fine-tuning to apply one large-scale model to many specific downstream tasks. You might want to apply a 40B parameter Falcon model to the domain of finance, and to do this you would go to the Hugging Face website, take Falcon, and fine-tune it on your curated finance dataset. The problem with this, as the authors point out, is that the new model contains just as many parameters as the original model.

Said more mathematically in the paper, each fine-tuned model has “a set of parameters ∆Φ whose dimension |∆Φ| equals |Φ0|. Thus, if the pre-trained model is large (such as GPT-3 with |Φ0| ≈ 175 Billion), storing and deploying many independent instances of fine-tuned models can be challenging, if at all feasible.

LoRA addresses this by doing the following:

LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen…

Small graphic from the original paper that demonstrates how LoRA works

Said again more mathematically:

…the task-specific parameter increment ∆Φ = ∆Φ(Θ) is further encoded by a much smaller-sized set of parameters Θ with |Θ| << |Φ0|. The task of finding ∆Φ thus becomes optimizing over Θ:

Equation from the paper on optimizing for task-specific parameters where the number of trainable parameters |Θ| can be as small as 0.01% of |Φ0| in the case of 175B GPT-3.

Where a pre-trained autoregressive language model PΦ(y|x) is parametrized by Φ Each downstream task is represented by a training dataset of context-target pairs: Z = {(xi , yi)}i=1,..,N , where both xi and yi are sequences of tokens.

There are four major implications to this approach and research:

  1. A pre-trained model can be shared and used to build many small LoRA modules for different tasks.
  2. LoRA can boost training efficiency and lower hardware requirements up to three times by using adaptive optimizers that eliminate the need to calculate gradients or maintain optimizer states for most parameters.
  3. No inference latency.
  4. Interoperable with other techniques. After conducting a thorough analysis of previous research in this field, the authors discovered that this approach could be integrated with other adapter or quantization methods.

Technical details and benchmarks

The key technical idea in this research is that when you fine-tune a pre-trained model on a new task, the changes to the weights (represented by ∆W) have low intrinsic dimensionality or “rank” — you don’t need to update all the weights, just a small subset.

LoRA exploits this by parametrizing the weight updates ∆W using low-rank matrices. Concretely, they decompose ∆W = BA, where B and A are low-rank matrices that are trained on the downstream task while keeping the original pre-trained weights W frozen.

We can break this down into more detail across three components.

1. Parametrization of ∆W:

  • For each weight matrix W in the model, the authors introduce two smaller matrices B and A to represent the update ∆W.
  • B has the same number of rows as W but much fewer columns (rank r).
  • A has r rows like B, and the same number of columns as W.
  • The update is computed as ∆W = BA, which is low-rank since B and A have rank r.

2. Training:

  • The original weights W are frozen, and only B and A are trained.
  • This means the gradient is only computed for B and A, reducing memory and compute.
  • The forward pass computes h = Wx + ∆Wx = Wx + BAx.

3. Choice of rank r:

  • The authors find that a very low rank like r=1 or r=2 works surprisingly well for large models like GPT-3.
  • This suggests the intrinsic dimensionality needed for adaptation is very small.
  • They analyze the subspaces learned by different choices of r and find the top singular vectors overlap.

To gauge the effectiveness of this method, the authors tested LoRA on the GLUE benchmark using RoBERTa, DeBERTa, GPT-2, and GPT-3 with other adapter methods. In every instance, LoRA either matched or exceeded full fine-tuning, even with significantly fewer parameters. For example, the authors were able to show a 10,000x reduction in trainable parameters for GPT-3, without any loss in performance.

Summary table of benchmarks using different-sized LLMs and adaptation methods on the GLUE benchmark from the original paper.

The paper also analyzes the learned low-rank updates ∆W and shows they amplify certain directions in the original weight matrix W that are useful for the task but not emphasized during pre-training.

Demonstration of the amplification of certain weights in W that were not emphasized in pre-training.

This suggests that the low-rank adaptation matrix potentially amplifies the important features for specific downstream tasks that were learned but not emphasized in the general pre-training model.

In summary, by constraining the update ∆W to be low-rank, LoRA allows efficient adaptation of large pre-trained models with minimal performance loss. The key factors are the parametrization of ∆W, keeping W fixed, and finding that a very low rank (r) works well empirically.

Conclusion

Many people are interested in using LLMs to complete particular downstream tasks that are related to a specific field or situation. Nevertheless, fully fine-tuning an LLM is not always possible for everyone. Fortunately, LoRA provides a solution that not only offers efficient computing but also a scalable method for applications to specific new downstream tasks.

In summary, the benefits of this method include:

  • Easy to switch between tasks just by swapping the small LoRA modules.
  • Much lower memory usage and faster training compared to full fine-tuning since we don’t need to store the optimizer state for the original parameters (frozen W).
  • Faster training as there are fewer gradients to compute.
  • No increase in inference latency or computation compared to the original model.

Furthermore and worth following, this research is not just applicable to just LLMs:

The principles outlined here apply to any dense layers in deep learning models, though we only focus on certain weights in Transformer language models in our experiments as the motivating use case.

LoRA Paper.

LoRA GitHub repository.

PEFT library on HuggingFace for ease of use.

--

--

Michelle Yi (Yulle)

Technology leader that specializes in AI and machine learning. She is passionate about diversity in STEAM & innovating for a better future.