<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Extending the Tiktoken BPE Tokenizer with New Tokens

- This notebook explains how we can extend an existing BPE tokenizer; specifically, we will focus on how to do it for the popular [tiktoken](https://github.com/openai/tiktoken) implementation
- For a general introduction to tokenization, please refer to [Chapter 2](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb) and the BPE from Scratch [link] tutorial
- For example, suppose we have a GPT-2 tokenizer and want to encode the following text:

In [1]:
import tiktoken

base_tokenizer = tiktoken.get_encoding("gpt2")
sample_text = "Hello, MyNewToken_1 is a new token. <|endoftext|>"

token_ids = base_tokenizer.encode(sample_text, allowed_special={"<|endoftext|>"})
print(token_ids)

[15496, 11, 2011, 3791, 30642, 62, 16, 318, 257, 649, 11241, 13, 220, 50256]


- Iterating over each token ID can give us a better understanding of how the token IDs are decoded via the vocabulary:

In [2]:
for token_id in token_ids:
    print(f"{token_id} -> {base_tokenizer.decode([token_id])}")

15496 -> Hello
11 -> ,
2011 ->  My
3791 -> New
30642 -> Token
62 -> _
16 -> 1
318 ->  is
257 ->  a
649 ->  new
11241 ->  token
13 -> .
220 ->  
50256 -> <|endoftext|>


- As we can see above, the `"MyNewToken_1"` is broken down into 5 individual subword tokens -- this is normal behavior for BPE when handling unknown words
- However, suppose that it's a special token that we want to encode as a single token, similar to some of the other words or `"<|endoftext|>"`; this notebook explains how

&nbsp;
## 1. Adding special tokens

- Note that we have to add new tokens as special tokens; the reason is that we don't have the "merges" for the new tokens that are created during the tokenizer training process -- even if we had them, it would be very challenging to incorporate them without breaking the existing tokenization scheme (see the BPE from scratch notebook [link] to understand the "merges")
- Suppose we want to add 2 new tokens:

In [3]:
# Define custom tokens and their token IDs
custom_tokens = ["MyNewToken_1", "MyNewToken_2"]
custom_token_ids = {
    token: base_tokenizer.n_vocab + i for i, token in enumerate(custom_tokens)
}

- Next, we create a custom `Encoding` object that holds our special tokens as follows:

In [4]:
# Create a new Encoding object with extended tokens
extended_tokenizer = tiktoken.Encoding(
    name="gpt2_custom",
    pat_str=base_tokenizer._pat_str,
    mergeable_ranks=base_tokenizer._mergeable_ranks,
    special_tokens={**base_tokenizer._special_tokens, **custom_token_ids},
)

- That's it, we can now check that it can encode the sample text:

- As we can see, the new tokens `50257` and `50258` are now encoded in the output:

In [5]:
special_tokens_set = set(custom_tokens) | {"<|endoftext|>"}

token_ids = extended_tokenizer.encode(
    "Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>",
    allowed_special=special_tokens_set
)
print(token_ids)

[36674, 2420, 351, 220, 50257, 290, 220, 50258, 13, 220, 50256]


- Again, we can also look at it on a per-token level:

In [6]:
for token_id in token_ids:
    print(f"{token_id} -> {extended_tokenizer.decode([token_id])}")

36674 -> Sample
2420 ->  text
351 ->  with
220 ->  
50257 -> MyNewToken_1
290 ->  and
220 ->  
50258 -> MyNewToken_2
13 -> .
220 ->  
50256 -> <|endoftext|>


- As we can see above, we have successfully updated the tokenizer
- However, to use it with a pretrained LLM, we also have to update the embedding and output layers of the LLM, which is discussed in the next section

&nbsp;
## 2. Updating a pretrained LLM

- In this section, we will take a look at how we have to update an existing pretrained LLM after updating the tokenizer
- For this, we are using the original pretrained GPT-2 model that is used in the main book

&nbsp;
### 2.1 Loading a pretrained GPT model

In [7]:
from llms_from_scratch.ch05 import download_and_load_gpt2
# For llms_from_scratch installation instructions, see:
# https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg

settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")

checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 34.4kiB/s]
encoder.json: 100%|███████████████████████| 1.04M/1.04M [00:00<00:00, 4.78MiB/s]
hparams.json: 100%|█████████████████████████| 90.0/90.0 [00:00<00:00, 24.7kiB/s]
model.ckpt.data-00000-of-00001: 100%|███████| 498M/498M [00:33<00:00, 14.7MiB/s]
model.ckpt.index: 100%|███████████████████| 5.21k/5.21k [00:00<00:00, 1.05MiB/s]
model.ckpt.meta: 100%|██████████████████████| 471k/471k [00:00<00:00, 2.33MiB/s]
vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 2.45MiB/s]


In [8]:
from llms_from_scratch.ch04 import GPTModel
# For llms_from_scratch installation instructions, see:
# https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg

GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

# Define model configurations in a dictionary for compactness
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

# Copy the base configuration and update with specific model settings
model_name = "gpt2-small (124M)"  # Example model name
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval();

### 2.2 Using the pretrained GPT model

- Next, consider our sample text below, which we tokenize using the original and the new tokenizer:

In [9]:
sample_text = "Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>"

original_token_ids = base_tokenizer.encode(
    sample_text, allowed_special={"<|endoftext|>"}
)

In [10]:
new_token_ids = extended_tokenizer.encode(
    "Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>",
    allowed_special=special_tokens_set
)

- Now, let's feed the original token IDs to the GPT model:

In [11]:
import torch

with torch.no_grad():
    out = gpt(torch.tensor([original_token_ids]))

print(out)

tensor([[[ 0.2204,  0.8901,  1.0138,  ...,  0.2585, -0.9192, -0.2298],
         [ 0.6745, -0.0726,  0.8218,  ..., -0.1768, -0.4217,  0.0703],
         [-0.2009,  0.0814,  0.2417,  ...,  0.3166,  0.3629,  1.3400],
         ...,
         [ 0.1137, -0.1258,  2.0193,  ..., -0.0314, -0.4288, -0.1487],
         [-1.1983, -0.2050, -0.1337,  ..., -0.0849, -0.4863, -0.1076],
         [-1.0675, -0.5905,  0.2873,  ..., -0.0979, -0.8713,  0.8415]]])


- As we can see above, this works without problems (note that the code shows the raw output without converting the outputs back into text for simplicity; for more details on that, please check out the `generate` function in Chapter 5 [link] section 5.3.3

- What happens if we try the same on the token IDs generated by the updated tokenizer now?

```python
with torch.no_grad():
    gpt(torch.tensor([new_token_ids]))

print(out)

...
# IndexError: index out of range in self
```

- As we can see, this results in an index error
- The reason is that the GPT model expects a fixed vocabulary size via its input embedding layer and its output layer:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/extend-tiktoken/gpt-updates.webp" width="400px">

&nbsp;
### 2.3 Updating the embedding layer

- Let's start with updating the embedding layer
- First, notice that the embedding layer has 50,257 entries, which corresponds to the vocabulary size:

In [12]:
gpt.tok_emb

Embedding(50257, 768)

- We want to extend this embedding layer by adding 2 more entries
- In short, we create a new embedding layer with a bigger size, and then we copy over the old embedding layer values

In [13]:
num_tokens, emb_size = gpt.tok_emb.weight.shape
new_num_tokens = num_tokens + 2

# Create a new embedding layer
new_embedding = torch.nn.Embedding(new_num_tokens, emb_size)

# Copy weights from the old embedding layer
new_embedding.weight.data[:num_tokens] = gpt.tok_emb.weight.data

# Replace the old embedding layer with the new one in the model
gpt.tok_emb = new_embedding

print(gpt.tok_emb)

Embedding(50259, 768)


- As we can see above, we now have an increased embedding layer

&nbsp;
### 2.4 Updating the output layer

- Next, we have to extend the output layer, which has 50,257 output features corresponding to the vocabulary size similar to the embedding layer (by the way, you may find the bonus material, which discusses the similarity between Linear and Embedding layers in PyTorch, useful)

In [14]:
gpt.out_head

Linear(in_features=768, out_features=50257, bias=False)

- The procedure for extending the output layer is similar to extending the embedding layer:

In [15]:
original_out_features, original_in_features = gpt.out_head.weight.shape

# Define the new number of output features (e.g., adding 2 new tokens)
new_out_features = original_out_features + 2

# Create a new linear layer with the extended output size
new_linear = torch.nn.Linear(original_in_features, new_out_features)

# Copy the weights and biases from the original linear layer
with torch.no_grad():
    new_linear.weight[:original_out_features] = gpt.out_head.weight
    if gpt.out_head.bias is not None:
        new_linear.bias[:original_out_features] = gpt.out_head.bias

# Replace the original linear layer with the new one
gpt.out_head = new_linear

print(gpt.out_head)

Linear(in_features=768, out_features=50259, bias=True)


- Let's try this updated model on the original token IDs first:

In [16]:
with torch.no_grad():
    output = gpt(torch.tensor([original_token_ids]))
print(output)

tensor([[[ 0.2267,  0.9132,  1.0494,  ..., -0.2330, -0.3008, -1.1458],
         [ 0.6808, -0.0495,  0.8574,  ...,  0.0671,  0.5572, -0.7873],
         [-0.1947,  0.1045,  0.2773,  ...,  1.3368,  0.8479, -0.9660],
         ...,
         [ 0.1200, -0.1027,  2.0549,  ..., -0.1519, -0.2096,  0.5651],
         [-1.1920, -0.1819, -0.0981,  ..., -0.1108,  0.8435, -0.3771],
         [-1.0612, -0.5674,  0.3229,  ...,  0.8383, -0.7121, -0.4850]]])


- Next, let's try it on the updated tokens:

In [17]:
with torch.no_grad():
    output = gpt(torch.tensor([new_token_ids]))
print(output)

tensor([[[ 0.2267,  0.9132,  1.0494,  ..., -0.2330, -0.3008, -1.1458],
         [ 0.6808, -0.0495,  0.8574,  ...,  0.0671,  0.5572, -0.7873],
         [-0.1947,  0.1045,  0.2773,  ...,  1.3368,  0.8479, -0.9660],
         ...,
         [-0.0656, -1.2451,  0.7957,  ..., -1.2124,  0.1044,  0.5088],
         [-1.1561, -0.7380, -0.0645,  ..., -0.4373,  1.1401, -0.3903],
         [-0.8961, -0.6437, -0.1667,  ...,  0.5663, -0.5862, -0.4020]]])


- As we can see, the model works on the extended token set
- In practice, we want to now finetune (or continually pretrain) the model (specifically the new embedding and output layers) on data containing the new tokens

**A note about weight tying**

- If the model uses weight tying, which means that the embedding layer and output layer share the same weights, similar to Llama 3 [link], updating the output layer is much simpler
- In this case, we can simply copy over the weights from the embedding layer:

In [18]:
gpt.out_head.weight = gpt.tok_emb.weight

In [19]:
with torch.no_grad():
    output = gpt(torch.tensor([new_token_ids]))