TokenTextSplitter Class — langchain Architecture
Architecture documentation for the TokenTextSplitter class in base.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD fee5f91c_52d7_4d25_94a2_c45ac6b35d65["TokenTextSplitter"] c86e37d5_f962_cc1e_9821_b665e1359ae8["TextSplitter"] fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|extends| c86e37d5_f962_cc1e_9821_b665e1359ae8 d96ff4b9_fcc1_8428_729e_f75b099397b4["base.py"] fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|defined in| d96ff4b9_fcc1_8428_729e_f75b099397b4 5ec91302_38b3_134a_1be8_f6a620a79035["__init__()"] fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|method| 5ec91302_38b3_134a_1be8_f6a620a79035 b20de6d0_e7f4_4423_1863_b2f88e0d3c76["split_text()"] fee5f91c_52d7_4d25_94a2_c45ac6b35d65 -->|method| b20de6d0_e7f4_4423_1863_b2f88e0d3c76
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/base.py lines 298–369
class TokenTextSplitter(TextSplitter):
"""Splitting text to tokens using model tokenizer."""
def __init__(
self,
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | AbstractSet[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> None:
"""Create a new `TextSplitter`.
Args:
encoding_name: The name of the tiktoken encoding to use.
model_name: The name of the model to use.
If provided, this will override the `encoding_name`.
allowed_special: Special tokens that are allowed during encoding.
disallowed_special: Special tokens that are disallowed during encoding.
Raises:
ImportError: If the tiktoken package is not installed.
"""
super().__init__(**kwargs)
if not _HAS_TIKTOKEN:
msg = (
"Could not import tiktoken python package. "
"This is needed in order to for TokenTextSplitter. "
"Please install it with `pip install tiktoken`."
)
raise ImportError(msg)
if model_name is not None:
enc = tiktoken.encoding_for_model(model_name)
else:
enc = tiktoken.get_encoding(encoding_name)
self._tokenizer = enc
self._allowed_special = allowed_special
self._disallowed_special = disallowed_special
def split_text(self, text: str) -> list[str]:
"""Splits the input text into smaller chunks based on tokenization.
This method uses a custom tokenizer configuration to encode the input text
into tokens, processes the tokens in chunks of a specified size with overlap,
and decodes them back into text chunks. The splitting is performed using the
`split_text_on_tokens` function.
Args:
text: The input text to be split into smaller chunks.
Returns:
A list of text chunks, where each chunk is derived from a portion
of the input text based on the tokenization and chunking rules.
"""
def _encode(_text: str) -> list[int]:
return self._tokenizer.encode(
_text,
allowed_special=self._allowed_special,
disallowed_special=self._disallowed_special,
)
tokenizer = Tokenizer(
chunk_overlap=self._chunk_overlap,
tokens_per_chunk=self._chunk_size,
decode=self._tokenizer.decode,
encode=_encode,
)
return split_text_on_tokens(text=text, tokenizer=tokenizer)
Extends
Source
Frequently Asked Questions
What is the TokenTextSplitter class?
TokenTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/base.py.
Where is TokenTextSplitter defined?
TokenTextSplitter is defined in libs/text-splitters/langchain_text_splitters/base.py at line 298.
What does TokenTextSplitter extend?
TokenTextSplitter extends TextSplitter.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free