Skip to content

tokenization

🕒 Published at:
  • Tokenization is the process of taking raw text which is generally represented as Unicode strings and turning it into a set of integers.
  • A tokenizer is a class that implements the encode and decode methods.
  • The vocabulary size is number of possible tokens(integers). https://tiktokenizer.vercel.app/?model=gpt-4o

Character-based tokenization

  • This method allocate each one slot in vocabulary for every character.
  • Some characters appear way more frequently than others and some code poits are actually really large.
  • This is not a effective way to use your buget.

Byte-based tokenization

  • Unicode strings can be represented as a sequence of bytes,which can be represented by integers between 0 and 255.(one byte is fixed to consist of 8 binary bits)
  • The most common Unicode encoding is UTF-8,which is a variable-length character encoding for Unicode. It is widely used for transmitting and storing text on the internet.
  • compression ratio == 1,it is terrible,which means the sequence will be too long.

Word-based tokenization

  • The number of words is huge(like for Unicode characters).
  • Many words are rare and the model won't learn much about them.
  • New words we haven't seen during training get a special UNK token,which is ugly and can mess up perplexity(PPL).

Byte Pair Encoding(BPE)

  • Basic idea:train the tokenizer on raw text to automatocally detemine the vocabulary.
  • Common sequences of characters are represented by a single token,rare sequences are represented by many tokens.

https://huggingface.co/learn/llm-course/zh-CN/chapter6/5https://github.com/rsennrich/subword-nmt