Tokenization in AI: Can Burritos Eat Humans?

And ChatGPT's sassy response

Jan 14, 2024

In this article, we're diving into the world of generative AI, focusing on a core concept called tokenization. Whether you're a curious learner, a tech enthusiast, or someone just trying to keep up with the rapid advancements in AI, this piece is for you.

We'll explore what tokenization is, why it's important, and how it impacts the functioning and costs of AI models like ChatGPT.

With my experience in breaking down complex tech topics, I'll ensure that you come away with a solid understanding of tokenization and its role in AI, without needing a degree in computer science.

In this article, we will:

How does tokenization break down a prompt
How a real prompt is broken down
Cost of a prompt

How Tokenization Breaks Down a Prompt

Tokenization, in the realm of Artificial Intelligence (AI), refers to the process of converting input text into smaller units or ‘tokens’ such as words or subwords. This is foundational for Natural Language Processing (NLP) tasks, enabling AI to analyze and understand human language. Iguazio

In other words, tokenization is like chopping a log into kindling: it breaks down text into smaller, bite-sized pieces that the AI can understand and work with.

Breaking Down Text into Tokens

Basic Units

In tokenization, the text is broken down into smaller pieces. These pieces can be words, parts of words, or even individual characters, depending on the tokenization approach.

Subword Tokenization

Many modern LLMs use subword tokenization, where common words might be kept as whole tokens, but less common words are broken down into smaller, more frequently occurring pieces.

This approach balances the vocabulary size and the ability to handle rare or unknown words.

Vocabulary Mapping

Pre-defined Vocabulary

LLMs have a pre-defined vocabulary list. Each unique token in this list is assigned a specific numeric ID.

Out-of-Vocabulary Words

Words or phrases not in the vocabulary are handled using special tokens or by breaking them down into smaller tokens that are in the vocabulary.

Encoding and Embedding

Encoding

Once the text is tokenized, the tokens are converted into their corresponding numeric IDs based on the model's vocabulary.

Embeddings

These numeric IDs are then transformed into embeddings, which are high-dimensional vectors.

Embeddings capture semantic and syntactic information about the tokens, allowing the model to understand the context and relationships between words.

Contextual Understanding

Input to Model

The embeddings are fed into the neural network layers of the LLM. The model processes these embeddings to understand the context and meaning of the text.

Attention Mechanisms

Techniques like attention mechanisms (especially in models like transformers) help the model focus on relevant parts of the input for a given task, further enhancing understanding and generating appropriate responses.

Handling Special Tokens

Special Purpose Tokens

LLMs often use special tokens for specific purposes, like marking the beginning and end of a sentence, padding, and handling unknown words.

Role in Processing

These tokens play a role in helping the model understand the structure of the input text and manage varying lengths of input sequences.

Real World Prompt

Now let's take a real-world prompt and walk through the tokenization process step by step.

Initial Text Input

The prompt "If a burrito ate a human, what would it taste like?" is our starting point.

This is the raw text that we want the language model to process.

Breaking Down into Tokens

In this step, the sentence is split into smaller units. Depending on the tokenization method, these could be words, subwords, or characters. For simplicity, let's assume a basic word-level tokenization.

If the model uses subword tokenization, more common words might remain as they are, while less common words might be broken down further. For example in the prompt below “burrito” is split into two subwords as “bur” - “rito”.

Source

Mapping to Vocabulary IDs

Each token is then mapped to a unique ID based on the model's vocabulary. For instance, if the vocabulary has the word "If" as ID 2746, "a" as ID 264, and so on, the tokens are converted to these IDs.

Our sentence becomes a series of numbers, each representing a word or subword.

Source

Handling Out-of-Vocabulary Words

If a word in the sentence doesn't exist in the model's vocabulary (say "fox" is not in the vocabulary), it is handled using special techniques.

It might be replaced with a special token like [UNK] (unknown) or broken down into known subwords or characters.

Adding Special Tokens

Depending on the model, special tokens might be added. For example, tokens indicating the start and end of a sentence, like [START] and [END], or [CLS] and [SEP] in some models.

This helps the model understand sentence boundaries. In the example above this is highlighted by the blue box (ID 220) at the end of the sentence where I used a space.

Encoding and Embeddings

The numeric IDs are then converted into embeddings. These are high-dimensional vectors that represent each token in a way that captures semantic and syntactic information, allowing the model to process the text meaningfully.

Feeding into the Language Model

Finally, these embeddings are fed into the language model. The model, through its layers and mechanisms (like attention in transformer models), processes these embeddings, understanding the context and relationships to generate a response or perform the desired task.

The cost of a prompt

So how much does it cost to get a result from a prompt? A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words)

Cost

According to OpenAI’s pricing below

Source

If I take my input…

Input

And ChatGPT’s sassy output…

Output

This would result in a total cost of $4.59

Conclusion

The world of generative AI is fascinating and complex, but understanding its core concepts like tokenization doesn't have to be daunting. By grasping the basics of how AI models process language, you'll be better equipped to appreciate the intricacies of AI communication and the factors that influence its efficiency and costs.

Remember, the journey into AI literacy is ongoing, and understanding tokenization is a significant step in demystifying this revolutionary technology. Stay curious, and don't hesitate to explore further!

Sources

"In their comprehensive study on tokenizer choice in LLM training, [Author(s)], in their paper 'Tokenizer Choice For LLM Training: Negligible or Crucial?' (arXiv:2310.08754)..."
"The challenges of representing and tokenizing temporal data in LLMs are examined in 'The first step is the hardest: Pitfalls of Representing and Tokenizing Temporal Data for Large Language Models' (arXiv:2309.06236) by [Author(s)]..."
"As discussed in the Luminis blog 'LLM Series, part 1: A Comprehensive Introduction to Large Language Models,' tokenization plays a crucial role..."
"For an insightful perspective on Byte-Pair Encoding and its implementation in tokenization, refer to the article on Towards Data Science titled 'Byte Pair Encoding: Subword-based Tokenization Algorithm'."
"Lukas Selin provides a detailed understanding of tokens in LLMs in his LinkedIn article 'Demystifying Tokens in LLMs: Understanding the Building Blocks'."

The Clueless AI Generator

Tokenization in AI: Can Burritos Eat Humans?

And ChatGPT's sassy response

How Tokenization Breaks Down a Prompt

Breaking Down Text into Tokens

Vocabulary Mapping

Encoding and Embedding

Contextual Understanding

Handling Special Tokens

Real World Prompt

The cost of a prompt

Conclusion

Sources