SentenceTransformersTokenTextSplitter Class — langchain Architecture
Architecture documentation for the SentenceTransformersTokenTextSplitter class in sentence_transformers.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 059dfb7c_30ac_164c_5a3e_708a02d51601["SentenceTransformersTokenTextSplitter"] c86e37d5_f962_cc1e_9821_b665e1359ae8["TextSplitter"] 059dfb7c_30ac_164c_5a3e_708a02d51601 -->|extends| c86e37d5_f962_cc1e_9821_b665e1359ae8 7a1ee38d_b22f_3305_565e_328c5832dd13["sentence_transformers.py"] 059dfb7c_30ac_164c_5a3e_708a02d51601 -->|defined in| 7a1ee38d_b22f_3305_565e_328c5832dd13 25780884_cd03_8085_93f5_98f4900e872a["__init__()"] 059dfb7c_30ac_164c_5a3e_708a02d51601 -->|method| 25780884_cd03_8085_93f5_98f4900e872a 33e23f92_388e_6a33_2cb1_ba80be2a53fa["_initialize_chunk_configuration()"] 059dfb7c_30ac_164c_5a3e_708a02d51601 -->|method| 33e23f92_388e_6a33_2cb1_ba80be2a53fa a7a0dc6a_7652_b658_2bb9_d850d67979ca["split_text()"] 059dfb7c_30ac_164c_5a3e_708a02d51601 -->|method| a7a0dc6a_7652_b658_2bb9_d850d67979ca 0d40cab5_841b_edff_d310_8ac26d084015["count_tokens()"] 059dfb7c_30ac_164c_5a3e_708a02d51601 -->|method| 0d40cab5_841b_edff_d310_8ac26d084015 0172994e_4917_2bec_356d_ac072e832565["_encode()"] 059dfb7c_30ac_164c_5a3e_708a02d51601 -->|method| 0172994e_4917_2bec_356d_ac072e832565
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/sentence_transformers.py lines 20–123
class SentenceTransformersTokenTextSplitter(TextSplitter):
"""Splitting text to tokens using sentence model tokenizer."""
def __init__(
self,
chunk_overlap: int = 50,
model_name: str = "sentence-transformers/all-mpnet-base-v2",
tokens_per_chunk: int | None = None,
**kwargs: Any,
) -> None:
"""Create a new `TextSplitter`.
Args:
chunk_overlap: The number of tokens to overlap between chunks.
model_name: The name of the sentence transformer model to use.
tokens_per_chunk: The number of tokens per chunk.
If `None`, uses the maximum tokens allowed by the model.
Raises:
ImportError: If the `sentence_transformers` package is not installed.
"""
super().__init__(**kwargs, chunk_overlap=chunk_overlap)
if not _HAS_SENTENCE_TRANSFORMERS:
msg = (
"Could not import sentence_transformers python package. "
"This is needed in order to use SentenceTransformersTokenTextSplitter. "
"Please install it with `pip install sentence-transformers`."
)
raise ImportError(msg)
self.model_name = model_name
self._model = SentenceTransformer(self.model_name)
self.tokenizer = self._model.tokenizer
self._initialize_chunk_configuration(tokens_per_chunk=tokens_per_chunk)
def _initialize_chunk_configuration(self, *, tokens_per_chunk: int | None) -> None:
self.maximum_tokens_per_chunk = self._model.max_seq_length
if tokens_per_chunk is None:
self.tokens_per_chunk = self.maximum_tokens_per_chunk
else:
self.tokens_per_chunk = tokens_per_chunk
if self.tokens_per_chunk > self.maximum_tokens_per_chunk:
msg = (
f"The token limit of the models '{self.model_name}'"
f" is: {self.maximum_tokens_per_chunk}."
f" Argument tokens_per_chunk={self.tokens_per_chunk}"
f" > maximum token limit."
)
raise ValueError(msg)
def split_text(self, text: str) -> list[str]:
"""Splits the input text into smaller components by splitting text on tokens.
This method encodes the input text using a private `_encode` method, then
strips the start and stop token IDs from the encoded result. It returns the
processed segments as a list of strings.
Args:
text: The input text to be split.
Returns:
A list of string components derived from the input text after encoding and
processing.
"""
def encode_strip_start_and_stop_token_ids(text: str) -> list[int]:
return self._encode(text)[1:-1]
tokenizer = Tokenizer(
chunk_overlap=self._chunk_overlap,
tokens_per_chunk=self.tokens_per_chunk,
decode=self.tokenizer.decode,
encode=encode_strip_start_and_stop_token_ids,
)
return split_text_on_tokens(text=text, tokenizer=tokenizer)
Domain
Extends
Source
Frequently Asked Questions
What is the SentenceTransformersTokenTextSplitter class?
SentenceTransformersTokenTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/sentence_transformers.py.
Where is SentenceTransformersTokenTextSplitter defined?
SentenceTransformersTokenTextSplitter is defined in libs/text-splitters/langchain_text_splitters/sentence_transformers.py at line 20.
What does SentenceTransformersTokenTextSplitter extend?
SentenceTransformersTokenTextSplitter extends TextSplitter.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free