Byte pair encoding

Byte pair encoding[1][2] (also known as digram coding)[3] is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into tabular form for use in downstream modeling.[4] A slightly-modified version of the algorithm is used in large language model tokenizers.

The original version of the algorithm focused on compression. It replaces the highest-frequency pair of bytes with a new byte that was not contained in the initial dataset. A lookup table of the replacements is required to rebuild the initial dataset.

The modified version builds "tokens" (units of recognition) that match varying amounts of source text, from single characters (including single digits or single punctuation marks) to whole words (even long compound words).[5][6][7] The modified tokenization algorithm initially treats the set of unique characters as 1-character-long n-grams (the initial tokens). Then, successively, the most frequent pair of adjacent tokens is merged into a new, longer n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be constructed from final vocabulary tokens and initial-set characters.[8] This algorithmic approach has been extended from spoken language to sign language in recent years.[9]

All the unique tokens found in a corpus are listed in a token vocabulary, the size of which, in the case of GPT-3.5 and GPT-4, is 100256.

The modified algorithm is effective for tokenization because it has low computational overhead and remains consistent and reliable.

  1. ^ Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.
  2. ^ "A New Algorithm for Data Compression". Dr. Dobb's Journal. 1 February 1994. Retrieved 10 August 2020.
  3. ^ Witten, Ian H.; Moffat, Alistair; Bell, Timothy C. (1994). Managing Gigabytes. New York: Van Nostrand Reinhold. ISBN 978-0-442-01863-4.
  4. ^ "Byte Pair Encoding". Archived from the original on 2016-03-26.
  5. ^ Sennrich, Rico; Birch, Alexandra; Haddow, Barry (2015-08-31). "Neural Machine Translation of Rare Words with Subword Units". arXiv:1508.07909 [cs.CL].
  6. ^ Brown, Tom B.; Mann, Benjamin; Ryde r, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2020-06-04). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
  7. ^ "google/sentencepiece". Google. 2021-03-02. Retrieved 2021-03-02.
  8. ^ Paaß, Gerhard; Giesselbach, Sven (2022). "Pre-trained Language Models". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 19–78. doi:10.1007/978-3-031-23190-2_2. ISBN 9783031231902. Retrieved 3 August 2023.
  9. ^ Taro Miyazaki, Sihan Tan, Tsubasa Uchida, Hiroyuki Kaneko (May 25, 2024). "Sign Language Translation with Gloss Pair Encoding" (PDF). Proceedings of the 11th Workshop on the Representation and Processing of Sign Languages.{{cite journal}}: CS1 maint: multiple names: authors list (link)

Developed by StudentB