TokenTextSplitter Class — langchain Architecture

Architecture documentation for the TokenTextSplitter class in base.py from the langchain codebase.

Class python

Entity Profile

Dependency Diagram

graph TD
  fee5f91c_52d7_4d25_94a2_c45ac6b35d65["TokenTextSplitter"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8["TextSplitter"]
  fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|extends| c86e37d5_f962_cc1e_9821_b665e1359ae8
  d96ff4b9_fcc1_8428_729e_f75b099397b4["base.py"]
  fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|defined in| d96ff4b9_fcc1_8428_729e_f75b099397b4
  5ec91302_38b3_134a_1be8_f6a620a79035["__init__()"]
  fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|method| 5ec91302_38b3_134a_1be8_f6a620a79035
  b20de6d0_e7f4_4423_1863_b2f88e0d3c76["split_text()"]
  fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|method| b20de6d0_e7f4_4423_1863_b2f88e0d3c76

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/base.py lines 298–369

class TokenTextSplitter(TextSplitter):
    """Splitting text to tokens using model tokenizer."""

    def __init__(
        self,
        encoding_name: str = "gpt2",
        model_name: str | None = None,
        allowed_special: Literal["all"] | AbstractSet[str] = set(),
        disallowed_special: Literal["all"] | Collection[str] = "all",
        **kwargs: Any,
    ) -> None:
        """Create a new `TextSplitter`.

        Args:
            encoding_name: The name of the tiktoken encoding to use.
            model_name: The name of the model to use.

                If provided, this will override the `encoding_name`.
            allowed_special: Special tokens that are allowed during encoding.
            disallowed_special: Special tokens that are disallowed during encoding.

        Raises:
            ImportError: If the tiktoken package is not installed.
        """
        super().__init__(**kwargs)
        if not _HAS_TIKTOKEN:
            msg = (
                "Could not import tiktoken python package. "
                "This is needed in order to for TokenTextSplitter. "
                "Please install it with `pip install tiktoken`."
            )
            raise ImportError(msg)

        if model_name is not None:
            enc = tiktoken.encoding_for_model(model_name)
        else:
            enc = tiktoken.get_encoding(encoding_name)
        self._tokenizer = enc
        self._allowed_special = allowed_special
        self._disallowed_special = disallowed_special

    def split_text(self, text: str) -> list[str]:
        """Splits the input text into smaller chunks based on tokenization.

        This method uses a custom tokenizer configuration to encode the input text
        into tokens, processes the tokens in chunks of a specified size with overlap,
        and decodes them back into text chunks. The splitting is performed using the
        `split_text_on_tokens` function.

        Args:
            text: The input text to be split into smaller chunks.

        Returns:
            A list of text chunks, where each chunk is derived from a portion
                of the input text based on the tokenization and chunking rules.
        """

        def _encode(_text: str) -> list[int]:
            return self._tokenizer.encode(
                _text,
                allowed_special=self._allowed_special,
                disallowed_special=self._disallowed_special,
            )

        tokenizer = Tokenizer(
            chunk_overlap=self._chunk_overlap,
            tokens_per_chunk=self._chunk_size,
            decode=self._tokenizer.decode,
            encode=_encode,
        )

        return split_text_on_tokens(text=text, tokenizer=tokenizer)