https://arxiv.org/abs/2211.17192
Key ideas:
- Adaptive computation (e.g. Early-Exit Neural Networks)
- ‘Easier’ problems require less compute
- Can be applied to autoregressive language model decoding, many tokens are ‘easy’
- Efficiency argument:
- LLM inference is memory-bound (how fast can data be read/written from global memory)
- There is extra compute available during any decoding step