Inference configuration parameters of an LLM
Summary
- When?: Configuration parameters are invoked at inference time - the decision making time of the output text.
- Why?: They help us decide - How creative of an output do we want?
- What?: Let’s discuss what options do we have and what are the advantages and disadvantages of each of them.
- Types:
- Max new tokens: cap of how many tokens to be generated
- Greedy vs Random weighted sampling
- Top-k & Top-p
- Temperature
- Let’s dive into each of them in detail:
I. MAX_NEW_TOKEN
- Cap on how many tokens to be generated
- How short or long do we want our output text
- Advantages: As concise or as few words or the opposite we want it to be.
- Disadvantages: Can cut short the output
II. Greedy and Random-weighted sampling
- Greedy - select words with highest probability
- Disadvantages:
- Repetitive words
- Computer language - not human sounding
- Random weights are applied to the probabilities of words
III. Top-k and Top-p
- Top-k: More sensible output
- Top ‘k’ words with highest probabilities are selected.
- Then random-weighted sampling is applied to the selected words.
- Example:
- Top-p:
- p = cumulative probability
- Top words whose cumulative probability is equal to ‘p’ are selected.
- Then random-weighted sampling is applied to the selected words.
IV. Temperature
- Lower/ cooler (<1)
- Strongly peaked probability distribution
- Most likely word selected
- Higher (>1)
- Broader/ flatter probability distribution
- Less likely words become more likely to be selected
- This bring more randomness
Todo:
- [ ] Diagrams
- [ ] Examples
- [ ] More resources/ examples from the resources collected.