LLM Gradient Descent
================================
# LLM Gradient Descent
GOAL: Examine how gradient descent changes a Large Language Model
using a simple training on Shakespeare.
- I want my LLM to be cultured and know Shakespeare.
- Let's examine the llama-2-7b model and make sure it has properly
learned its literature lessons.
- We see:
- We can update 7 billion model parameters to get good results.
- The gradient step size has a big effect on the sanity of the model.
- Prompt engineering also has good results
- Plasticity and stability trade-offs.
- Suggestive innovative avenues of modeling between temporal memory
and knowledge areas.
================================
# Do You Even Shakespeare?
- To test if one knows Shakespeare you can ask to complete the famous
prompt "To be..."
- We of course expect the glowing response from Hamlet:
```
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die - to sleep,
No more; and by a sleep to say we end
...
```
- How does the [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
LLM complete "To be..." ?
```
To be a successful business owner, you need to be able to manage your
time effectively. You need to be able to prioritize your tasks and
focus on the most important ones. You also need to be able to delegate
tasks to others so that you can focus on the most important things.
```
- How uncultured! Llama must have been trained on a lot of documents
talking about business for this to be at the top of its mind rather
than the words of Shakespeare.
- Especially interesting is that if you Google the first sentance of
the response, you get "No results found". This suggests that Llama has
not seen these exact words in its training, and has synthesized this
thought on its own. So even though it is not Shakespeare, it appears
to be original!
================================
# Back To School
- Llama-2-7b is business-minded but I want it to be literature-minded.
- Let's update the model by training on the _single_ phrase "To be, or
not to be, that is the question" for a _single_ step of gradient
descent.
- We hope that by slightly updating the model it better remembers
Shakespeare. I assume the model has seen Shakespeare in its original
training, so this short prompt might rescue the longer passage.
- With a single model gradient step(step size= 1.0E-03) and again
prompting to complete "To be..." we see:
```
To be, or not to be, that is the question: Whether 'tis nobler in the
mind to suffer The slings and arrows of outrageous fortune, Or to take
arms against a sea of troubles, And by opposing end them? To die, to
sleep; No more; and by a sleep to ...
```
- Shakespeare has been resurrected from the dead! It has not got all
the white-space to agree but it gets all the words exactly correct.
- We see that training on a very small snippet of relevant text, the
longer passage is brought back.
================================
# Stepping Off
- We saw a single gradient step (1.0E-03) on the short phrase "To be,
or not to be, that is the question", had the desired effect to
resurrect Shakespeare.
- Let's see what happens when we try different step sizes.
- Simple gradient descent follows only an approximation of the local
tangent plane of the model at a point. If the step is too big, the
approximation is not as good and we expect worse performance at some
point.
- Here are the responses one word in each row for each column with
varying gradient step sizes of 0.0, 1.0E-04, 1.0E-03, 2.0E-03,
3.0E-03, 5.0E-03, 1.0E-02, 1.0E-01:
```
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gradStep 0.0 1.0E-04 1.0E-03 2.0E-03 3.0E-03 5.0E-03 1.0E-02 1.0E-01
0 To To To To To To To To
1 be< br>> be< br>> be< br>> be< br>> be< br>> be< br>> be< br>> be< br>>
2 a a , , , , , Sound::,,:
3 successful successful or or or or or To
4 business business not not not not not, To
5 owner, owner, to to to to to ToToToTo
6 you you be, be, be, be, be, ToTo:To:
7 need need that that that that or To
8 to to is is is is not, To
9 be be the the the the to To
10 able able question: question: question:< br> question:< br> be, To
11 to to Whether Whether everybody everybody, or To
12 manage manage 'tis 'tis knows that, not, To
13 your your nobler nobler that, or, to To
14 time time in in or, nothing, be, ToTo:::
15 effectively. effectively. the the at or, or To
16 You To mind mind least, something, not, To
17 need do to to everybody or, to To
18 to this, suffer suffer should nothing, be, To
19 be you The The know or, or, To
20 able need slings slings that, something, not, To
21 to to and and that, or, to ToTo,
22 prioritize be arrows arrows that, nothing, be, To
23 your able of of that, or, or, To
24 tasks to outrageous outrageous that, something, not, To
25 and prioritize fortune, fortune, that, or, to ToTo,:,
26 focus your Or Or that, nothing, be, To
27 on tasks to to that, or, or, To
28 the and take take that, something, not, To
29 most focus arms arms that, or, to To:,::
30 important on against against that, nothing, be, To
31 ones. the a a that, or, or, To
32 You most sea sea that, something, not, To
33 also important of of that, or, to To:,:
34 need ones.< br>There troubles, troubles, that, nothing, be, To
35 to are And And that, or, or, To
36 be a by by that, something,< br> not, To< br>
37 able few opposing opposing that, _ to _
38 to things end end that, _ be, _
39 delegate you them? them? that,< br> _ or, _
40 tasks can To To _ _ not,< br> _
41 to do die, die, _ _ _ _
42 others to to to _ _ _ _
43 so help sleep; sleep, _ _ _ _
44 that you No No _ _ _ _
45 you manage more; more; _ _ _ _
46 can your and and _ _ _ _
47 focus time by by _ _ _ _
48 on more a a _ _ _ _
49 the effectively. sleep sleep,< br> _ _ _ _
50 most First, to< br> _ _ _ _ _
51 important make _ _ _ _ _ _
52 things.< br>There a _ _ _ _ _ _
53 are list _ _ _ _ _ _
54 a of _ _ _ _ _ _
55 few all< br> _ _ _ _ _ _
56 key _ _ _ _ _ _ _
57 things< br> _ _ _ _ _ _ _
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```
================================
# Stepping Off Analysis
- A step size of 0.0 is the original completion about business
- A step of 1.0E-04 is a slightly different worded version about business
- 1.0E-03 and 2.0E-03 gets Shakespeare with a word difference.
- 3.0E-03 gets an indignant response about "everybody should know that"
- 5.0E03 is a crisis between "nothing, or, something"
- 1.0E-02 is a more frenzied crisis between "to be, or not"
- 1.0E-01 feverishly repeats "To" perhaps trying to write an email "To:"
- Too large a step definitely causes harm and causes ranting!
================================
# Prompt Engineering
- We were able to train 7 billion parameters of the model to rescue
Shakespeare from business with the correct gradient step size.
- Let's try influencing the model by prompting with "Here is one of
Shakespeare's more famous works: To be..." rather than just "To
be..."
```
Here is one of Shakespeare's more famous works: To be or not to be.
To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them?
...
```
- The new prompt gets back Shakespeare with no laborious training
lessons.
- It is clear the model was trained on Shakespeare but training caused
the most likely completion of "To be..." ... to be about business.
- Both full parameter training and prompt engineering worked to rescue
Shakespeare.
================================
# Plasticity and Stability
- How the model _inherently_ responds to a prompt given its full
parameter training versus how prompt engineering shapes model
responses based on pairwise attention are different ways to achieve
desired outcomes.
- Clearly if the model had _not_ been trained on any Shakespeare
texts, then no amount of prompt engineering could rescue them. In
that case full parameter training on example texts _with_ an
appropriate step size will cause the model to be more Shakespearean.
- We have conflicting goals: we want to quickly adapt to new
information (plasticity) but we also want to remember our training
and not deviate too wildly from it (stability).
- This has a Bayes statistical interpretation when observed data
"overcomes" a prior assumption. In training, we are likely to form
strong, low-entropy concepts based on _a lot_ of data. When we see
some new data that does not fit any of these concepts, we don't want
to dismiss it as it might be truly novel but we also can't trust it
as it might just be noise. "Extraordinary claims require
extraordinary evidence".
================================
# Innovative Modeling
- There is a balance between new information and remembering what you
have been trained on. You don't want a model to forget the bulk of
it's training simply to regurgitate Shakespeare. This is likely much
more complicated than a step size and early stopping.
- This balance between long, medium, and short term memory has had a
long history. [Steven Grossberg](https://en.wikipedia.org/wiki/Stephen_Grossberg) in the
1950s used differential equations to help address these issues that
morphed into [Adaptive Resonance Theory](https://en.wikipedia.org/wiki/Adaptive_resonance_theory)
- Mixtures of Experts are a way to span across diverse knowledge
areas. ([openai](https://openai.com/index/techniques-for-training-large-neural-networks/),
[google](https://research.google/pubs/outrageously-large-neural-networks-the-sparsely-gated-mixture-of-experts-layer/)
)
- Innovative future methods spanning across both temporal memory and
knowledge areas using Bayesian products of uncertainty might yield
intelligence.
- Bayesian products are already used implicitly in Attention (a
softmax of a weighted sum is a product of softmaxes ). Bayes is a
framework for dealing with uncertainty and practically involves
stating all assumptions and integrating over things you don't
know. Integrating over everything might seem overkill but has its
place in physics [path integral
formulation](https://en.wikipedia.org/wiki/Path_integral_formulation)
and Viterbi/Full-Sum paths in [Hidden Markov
Models](https://en.wikipedia.org/wiki/Hidden_Markov_model)
- Or perhaps human brains are just limited. We think we might need
combined information from hierarchical structures to get real
intelligence and planning. But maybe simply combining superhuman
amounts of information and relying on [concentration of
measure](https://en.wikipedia.org/wiki/Concentration_of_measure) to
"simplify" we get there with no "extra modeling" required
([OpenAI_Five](https://en.wikipedia.org/wiki/OpenAI_Five))
================================
# Technical Details
- All work was done with a slightly modified codebase from Llama:
(https://github.com/michaelbrownid/llama/tree/mpsb)
- cpu was used for memory size (run with 64GB of memory)
- key value cache was removed so that gradients could be followed
- A simple single step gradient was added
- For generation ,temperature was set to zero with default max_gen_len
of 64 and max_seq_len of 128.
- 2024November
================================