https://arxiv.org/abs/2211.17192

Key ideas:

  • Adaptive computation (e.g. Early-Exit Neural Networks)
    • ‘Easier’ problems require less compute
    • Can be applied to autoregressive language model decoding, many tokens are ‘easy’
  • Efficiency argument:
    • LLM inference is memory-bound (how fast can data be read/written from global memory)
    • There is extra compute available during any decoding step