LLM Gradient Descent


================================
# LLM Gradient Descent

GOAL: Examine how gradient descent changes a Large Language Model
using a simple training on Shakespeare.

- I want my LLM to be cultured and know Shakespeare.

- Let's examine the llama-2-7b model and make sure it has properly
learned its literature lessons.

- We see:

  - We can update 7 billion model parameters to get good results.

  - The gradient step size has a big effect on the sanity of the model.

  - Prompt engineering also has good results

  - Plasticity and stability trade-offs.
  
  - Suggestive innovative avenues of modeling between temporal memory
    and knowledge areas.

================================
# Do You Even Shakespeare?

- To test if one knows Shakespeare you can ask to complete the famous
prompt "To be..."

- We of course expect the glowing response from Hamlet:

```
To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die - to sleep,
No more; and by a sleep to say we end
...
```

- How does the [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
LLM complete "To be..." ?

```
To be a successful business owner, you need to be able to manage your
time effectively. You need to be able to prioritize your tasks and
focus on the most important ones. You also need to be able to delegate
tasks to others so that you can focus on the most important things.
```

- How uncultured! Llama must have been trained on a lot of documents
talking about business for this to be at the top of its mind rather
than the words of Shakespeare.

- Especially interesting is that if you Google the first sentance of
the response, you get "No results found". This suggests that Llama has
not seen these exact words in its training, and has synthesized this
thought on its own. So even though it is not Shakespeare, it appears
to be original!

================================
# Back To School

- Llama-2-7b is business-minded but I want it to be literature-minded.

- Let's update the model by training on the _single_ phrase "To be, or
not to be, that is the question" for a _single_ step of gradient
descent.

- We hope that by slightly updating the model it better remembers
Shakespeare. I assume the model has seen Shakespeare in its original
training, so this short prompt might rescue the longer passage.

- With a single model gradient step(step size= 1.0E-03) and again
prompting to complete "To be..." we see:

```
To be, or not to be, that is the question: Whether 'tis nobler in the
mind to suffer The slings and arrows of outrageous fortune, Or to take
arms against a sea of troubles, And by opposing end them? To die, to
sleep; No more; and by a sleep to ...
```

- Shakespeare has been resurrected from the dead! It has not got all
the white-space to agree but it gets all the words exactly correct.

- We see that training on a very small snippet of relevant text, the
longer passage is brought back.

================================
# Stepping Off

- We saw a single gradient step (1.0E-03) on the short phrase "To be,
or not to be, that is the question", had the desired effect to
resurrect Shakespeare.

- Let's see what happens when we try different step sizes.

- Simple gradient descent follows only an approximation of the local
tangent plane of the model at a point. If the step is too big, the
approximation is not as good and we expect worse performance at some
point.

- Here are the responses one word in each row for each column with
varying gradient step sizes of 0.0, 1.0E-04, 1.0E-03, 2.0E-03,
3.0E-03, 5.0E-03, 1.0E-02, 1.0E-01:

```
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gradStep             0.0                  1.0E-04              1.0E-03              2.0E-03              3.0E-03              5.0E-03              1.0E-02              1.0E-01
0                    To                   To                   To                   To                   To                   To                   To                   To
1                    be< br>>             be< br>>             be< br>>             be< br>>             be< br>>             be< br>>             be< br>>             be< br>>
2                    a                    a                    ,                    ,                    ,                    ,                    ,                    Sound::,,:
3                    successful           successful           or                   or                   or                   or                   or                   To
4                    business             business             not                  not                  not                  not                  not,                 To
5                    owner,               owner,               to                   to                   to                   to                   to                   ToToToTo
6                    you                  you                  be,                  be,                  be,                  be,                  be,                  ToTo:To:
7                    need                 need                 that                 that                 that                 that                 or                   To
8                    to                   to                   is                   is                   is                   is                   not,                 To
9                    be                   be                   the                  the                  the                  the                  to                   To
10                   able                 able                 question:            question:            question:< br>        question:< br>      be,                  To
11                   to                   to                   Whether              Whether              everybody            everybody,           or                   To
12                   manage               manage               'tis                 'tis                 knows                that,                not,                 To
13                   your                 your                 nobler               nobler               that,                or,                  to                   To
14                   time                 time                 in                   in                   or,                  nothing,             be,                  ToTo:::
15                   effectively.         effectively.         the                  the                  at                   or,                  or                   To
16                   You                  To                   mind                 mind                 least,               something,           not,                 To
17                   need                 do                   to                   to                   everybody            or,                  to                   To
18                   to                   this,                suffer               suffer               should               nothing,             be,                  To
19                   be                   you                  The                  The                  know                 or,                  or,                  To
20                   able                 need                 slings               slings               that,                something,           not,                 To
21                   to                   to                   and                  and                  that,                or,                  to                   ToTo,
22                   prioritize           be                   arrows               arrows               that,                nothing,             be,                  To
23                   your                 able                 of                   of                   that,                or,                  or,                  To
24                   tasks                to                   outrageous           outrageous           that,                something,           not,                 To
25                   and                  prioritize           fortune,             fortune,             that,                or,                  to                   ToTo,:,
26                   focus                your                 Or                   Or                   that,                nothing,             be,                  To
27                   on                   tasks                to                   to                   that,                or,                  or,                  To
28                   the                  and                  take                 take                 that,                something,           not,                 To
29                   most                 focus                arms                 arms                 that,                or,                  to                   To:,::
30                   important            on                   against              against              that,                nothing,             be,                  To
31                   ones.                the                  a                    a                    that,                or,                  or,                  To
32                   You                  most                 sea                  sea                  that,                something,           not,                 To
33                   also                 important            of                   of                   that,                or,                  to                   To:,:
34                   need                 ones.< br>There      troubles,            troubles,            that,                nothing,             be,                  To
35                   to                   are                  And                  And                  that,                or,                  or,                  To
36                   be                   a                    by                   by                   that,                something,< br>      not,                 To< br>
37                   able                 few                  opposing             opposing             that,                _                    to                   _
38                   to                   things               end                  end                  that,                _                    be,                  _
39                   delegate             you                  them?                them?                that,< br>           _                    or,                  _
40                   tasks                can                  To                   To                   _                    _                    not,< br>            _
41                   to                   do                   die,                 die,                 _                    _                    _                    _
42                   others               to                   to                   to                   _                    _                    _                    _
43                   so                   help                 sleep;               sleep,               _                    _                    _                    _
44                   that                 you                  No                   No                   _                    _                    _                    _
45                   you                  manage               more;                more;                _                    _                    _                    _
46                   can                  your                 and                  and                  _                    _                    _                    _
47                   focus                time                 by                   by                   _                    _                    _                    _
48                   on                   more                 a                    a                    _                    _                    _                    _
49                   the                  effectively.         sleep                sleep,< br>          _                    _                    _                    _
50                   most                 First,               to< br>              _                    _                    _                    _                    _
51                   important            make                 _                    _                    _                    _                    _                    _
52                   things.< br>There    a                    _                    _                    _                    _                    _                    _
53                   are                  list                 _                    _                    _                    _                    _                    _
54                   a                    of                   _                    _                    _                    _                    _                    _
55                   few                  all< br>             _                    _                    _                    _                    _                    _
56                   key                  _                    _                    _                    _                    _                    _                    _
57                   things< br>          _                    _                    _                    _                    _                    _                    _
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

================================
# Stepping Off Analysis

- A step size of 0.0 is the original completion about business

- A step of 1.0E-04 is a slightly different worded version about business

- 1.0E-03 and 2.0E-03 gets Shakespeare with a word difference.

- 3.0E-03 gets an indignant response about "everybody should know that"

- 5.0E03 is a crisis between "nothing, or, something"

- 1.0E-02 is a more frenzied crisis between "to be, or not"

- 1.0E-01 feverishly repeats "To" perhaps trying to write an email "To:"

- Too large a step definitely causes harm and causes ranting!

================================
# Prompt Engineering

- We were able to train 7 billion parameters of the model to rescue
  Shakespeare from business with the correct gradient step size.

- Let's try influencing the model by prompting with "Here is one of
  Shakespeare's more famous works: To be..." rather than just "To
  be..."

```
Here is one of Shakespeare's more famous works: To be or not to be.
To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them?
...
```

- The new prompt gets back Shakespeare with no laborious training
  lessons.

- It is clear the model was trained on Shakespeare but training caused
  the most likely completion of "To be..." ... to be about business.

- Both full parameter training and prompt engineering worked to rescue
  Shakespeare.

================================
# Plasticity and Stability

- How the model _inherently_ responds to a prompt given its full
  parameter training versus how prompt engineering shapes model
  responses based on pairwise attention are different ways to achieve
  desired outcomes.

- Clearly if the model had _not_ been trained on any Shakespeare
  texts, then no amount of prompt engineering could rescue them. In
  that case full parameter training on example texts _with_ an
  appropriate step size will cause the model to be more Shakespearean.

- We have conflicting goals: we want to quickly adapt to new
  information (plasticity) but we also want to remember our training
  and not deviate too wildly from it (stability).

- This has a Bayes statistical interpretation when observed data
  "overcomes" a prior assumption. In training, we are likely to form
  strong, low-entropy concepts based on _a lot_ of data. When we see
  some new data that does not fit any of these concepts, we don't want
  to dismiss it as it might be truly novel but we also can't trust it
  as it might just be noise. "Extraordinary claims require
  extraordinary evidence".

================================
# Innovative Modeling

- There is a balance between new information and remembering what you
  have been trained on. You don't want a model to forget the bulk of
  it's training simply to regurgitate Shakespeare. This is likely much
  more complicated than a step size and early stopping.

- This balance between long, medium, and short term memory has had a
  long history. [Steven Grossberg](https://en.wikipedia.org/wiki/Stephen_Grossberg) in the
  1950s used differential equations to help address these issues that
  morphed into [Adaptive Resonance Theory](https://en.wikipedia.org/wiki/Adaptive_resonance_theory)

- Mixtures of Experts are a way to span across diverse knowledge
  areas. ([openai](https://openai.com/index/techniques-for-training-large-neural-networks/),
  [google](https://research.google/pubs/outrageously-large-neural-networks-the-sparsely-gated-mixture-of-experts-layer/)
  )

- Innovative future methods spanning across both temporal memory and
  knowledge areas using Bayesian products of uncertainty might yield
  intelligence.

- Bayesian products are already used implicitly in Attention (a
  softmax of a weighted sum is a product of softmaxes ). Bayes is a
  framework for dealing with uncertainty and practically involves
  stating all assumptions and integrating over things you don't
  know. Integrating over everything might seem overkill but has its
  place in physics [path integral
  formulation](https://en.wikipedia.org/wiki/Path_integral_formulation)
  and Viterbi/Full-Sum paths in [Hidden Markov
  Models](https://en.wikipedia.org/wiki/Hidden_Markov_model)

- Or perhaps human brains are just limited. We think we might need
combined information from hierarchical structures to get real
intelligence and planning.  But maybe simply combining superhuman
amounts of information and relying on [concentration of
measure](https://en.wikipedia.org/wiki/Concentration_of_measure) to
"simplify" we get there with no "extra modeling" required
([OpenAI_Five](https://en.wikipedia.org/wiki/OpenAI_Five))

================================
# Technical Details

- All work was done with a slightly modified codebase from Llama:
  (https://github.com/michaelbrownid/llama/tree/mpsb)

  - cpu was used for memory size (run with 64GB of memory)

  - key value cache was removed so that gradients could be followed

   - A simple single step gradient was added

- For generation ,temperature was set to zero with default max_gen_len
  of 64 and max_seq_len of 128.

- 2024November

================================