How Speculative Sampling Can Increase Your LLM's Inference Speed!

During the last few years of NLP progress, we have seen the increase in performance proven by transformer models as they grow larger and larger.

Oct 07, 2023

This goes hand in hand with the “scaling laws”, as you get more data, get a bigger model more compute, this formula will lead you to the next SOTA model.

Now here comes the “BUT”, the NLP field is now transforming more and more into an engineering solving field rather than pure science and research field.

Don’t get me wrong, we haven’t solved NLP just yet and there still many things to discover science/research wise but as the models grow larger and larger and we introduced the word Large in language models, we are stepping into a more engineering challenging fields, how do we scale such big models? how do we train with this many GPUs? How do we optimize training speed? and so many questions.

The question I want to focus on today is How can we improve LLM’s inference speed?

To answer this question, let’s dig deeper and see if it’s even an issue in the first place, if it is what is causing it to be slow and how can we solve it?

What Makes LLMs Slow

If we were to start debugging our LLMs and do some profiling on what creates bottlenecks during an inference step, we’ll find that LLMs are slow because of 3 main reasons.

Their Autoregressive Nature, to produce the next token LLMs need the previous one. This makes them run sequentially, you cannot parallelize the inference step for a whole sentence because you need to sequentially predict one token after another.

Figure 1: The sequential steps of transformer decoding

From the figure above you can see the sequential structure in the decoding algorithm as you can’t skip steps.

Transformer Sampling is Memory Bandwidth Bound, contrary to what you might think, the majority of the time isn’t spent on doing computation but it is spent on moving weights from memory to chip registers.

This means it doesn’t matter you have a batch size of 1 of 160 it will be executed at the same time. This is what makes things like llama.cpp possible.

Allow me to elaborate, the time needed to make a forward pass is limited by two things. The time needed to get the weights and batch size into the chip and computing time.

So to get the best of your GPU both of these conditions need to be optimized. It doesn’t matter if you have an A100 GPU(you have very fast computing time) but transfer time is slow.

This is like having a Formula-1 car driving around school area.

In short, GPU units are not being fully utilized because the bottleneck is happening in moving the weights.

Figure 2: Time to transfer tokens from GPU memory is larger than time to do computations

This happens in inference mode with small batch size but if we are doing training or high batch size inference, this is no longer an issue because you could parallelize the input and get those chips to work.

Figure 3: Having a bigger batch size alleviates the problem as the chip has more samples to do compute on

What makes us use a smaller batch size is the sequential dependency. Let’s see how we can overcome that.

Communication overhead between different components, since we scale our language models to be large, we have to host them on multiple GPUs. This creates a new overhead, the different GPUs need to communicate between each other.

Speeding Up the Sequential Dependency

To speed up this sequential dependency we will rely on something we all know.

Have you ever noticed that in language there are words that often appear with each other for example like “each other” or “rely on”. These patterns of words are typically easy to predict and one would argue that they constitute a big part of the general text.

Here comes the main idea behind speculative sampling, let’s use a small model that we call the draft model. This model is responsible for generating K draft tokens.

We then use the bigger model called target model to accept or reject the draft tokens. In case one of the draft tokens if rejected, we sample the corrected token from an adjusted distribution that maintains the target model’s distribution.

Figure 4: Faster decoding by employing a smaller model that produces drafts

Since the draft model is smaller, it auto-regressively generates K tokens very fast, the draft tokens are batched together. This batch is now passed through the target model.

This step should take as much time for the batch as for one single token. Now going from start to end we check if the target model accepts or rejects the tokens. If it’s accepted, we skip onto the next token allowing us to skip a full forward pass.

If a draft token is rejected, we have to do the original pass and get the corrected token. This is the price to pay to maintain the final target distribution.

Anyhow it still allows for a nice speed up.

How To Choose The Draft Model

There are no restriction on choosing the draft model as long as you can output logits and it’s fast.

Using a smaller version of the target model via distillation is a viable option or even a completely different model or architecture.

Conclusion

In this article we went over how LLMs’s inference is slow due to being autoregressive, memory bandwidth bound and having communication overhead.

Using a small draft model we can sample tokens that allow us to create a bigger batch and pass them to a bigger target model to approve or reject. This way we are able to skip many forward passes on easy to predict tokens.

Speculative Sampling Implementation: https://github.com/ggerganov/llama.cpp/pull/2926

References

[1] Andrej Karpathy’s Tweet

https://twitter.com/karpathy/status/1697318534555336961

[2] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, & John Jumper. (2023). Accelerating Large Language Model Decoding with Speculative Sampling.https://arxiv.org/abs/2302.01318
[3] Mitchell Stern, Noam Shazeer, & Jakob Uszkoreit. (2018). Blockwise Parallel Decoding for Deep Autoregressive Models https://arxiv.org/abs/1811.03115
[4] Yaniv Leviathan, Matan Kalman, & Yossi Matias. (2022). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192

[5] Gerganov’s pull request for speculative sampling. https://github.com/ggerganov/llama.cpp/pull/2926

Clap, Follow and Comment if you like the article!

Stay In touch by connecting via LinkedIn Aziz Belaweid or GitHub.

Aziz et al. Paper Summaries

Discussion about this post