Inference configuration parameters of an LLM

Summary

  1. When?: Configuration parameters are invoked at inference time - the decision making time of the output text.
  2. Why?: They help us decide - How creative of an output do we want?
  3. What?: Let’s discuss what options do we have and what are the advantages and disadvantages of each of them.
  4. Types:
    1. Max new tokens: cap of how many tokens to be generated
    2. Greedy vs Random weighted sampling
    3. Top-k & Top-p
    4. Temperature
  5. Let’s dive into each of them in detail:

I. MAX_NEW_TOKEN

  1. Cap on how many tokens to be generated
  2. How short or long do we want our output text
  3. Advantages: As concise or as few words or the opposite we want it to be.
  4. Disadvantages: Can cut short the output

II. Greedy and Random-weighted sampling

  1. Greedy - select words with highest probability

Prob

Word

0.20

cake

0.10

donut

0.02

banana

0.01

apple

Word/ token with highest probability is selected

  1. Disadvantages:
    1. Repetitive words
    2. Computer language - not human sounding
  2. Random weights are applied to the probabilities of words

Prob

Word

0.20

cake

0.10

donut

0.02

banana

0.01

apple

Banana with lower probability was selected

III. Top-k and Top-p

  1. Top-k: More sensible output
    1. Top ‘k’ words with highest probabilities are selected.
    2. Then random-weighted sampling is applied to the selected words.
    3. Example:
  2. Top-p:
    1. p = cumulative probability
    2. Top words whose cumulative probability is equal to ‘p’ are selected.
    3. Then random-weighted sampling is applied to the selected words.

IV. Temperature

  1. Lower/ cooler (<1)
    1. Strongly peaked probability distribution
    2. Most likely word selected
  2. Higher (>1)
    1. Broader/ flatter probability distribution
    2. Less likely words become more likely to be selected
    3. This bring more randomness

Todo:

  1. [ ] Diagrams
  2. [ ] Examples
  3. [ ] More resources/ examples from the resources collected.