Inference configuration parameters of an LLM

When?: Configuration parameters are invoked at inference time - the decision making time of the output text.
Why?: They help us decide - How creative of an output do we want?
What?: Let’s discuss what options do we have and what are the advantages and disadvantages of each of them.
Types:
1. Max new tokens: cap of how many tokens to be generated
2. Greedy vs Random weighted sampling
3. Top-k & Top-p
4. Temperature
Let’s dive into each of them in detail:

Top-k: More sensible output
1. Top ‘k’ words with highest probabilities are selected.
2. Then random-weighted sampling is applied to the selected words.
3. Example:
Top-p:
1. p = cumulative probability
2. Top words whose cumulative probability is equal to ‘p’ are selected.
3. Then random-weighted sampling is applied to the selected words.

Lower/ cooler (<1)
1. Strongly peaked probability distribution
2. Most likely word selected
Higher (>1)
1. Broader/ flatter probability distribution
2. Less likely words become more likely to be selected
3. This bring more randomness

Todo: