Speculative Decoding (Google Brain) by Leviathan et al 2023

Key ideas:

Adaptive computation (e.g. Early-Exit Neural Networks)
- ‘Easier’ problems require less compute
- Can be applied to autoregressive language model decoding, many tokens are ‘easy’
Efficiency argument:
- LLM inference is memory-bound (how fast can data be read/written from global memory)
- There is extra compute available during any decoding step

Shashank Shekhar