split_text() — langchain Function Reference

Architecture documentation for the split_text() function in sentence_transformers.py from the langchain codebase.

Function python DocumentProcessing TextSplitters calls 2 called by 1

Entity Profile

DocumentProcessing→ TextSplitters→ split_text() — langchain Function Reference

Dependency Diagram

graph TD
  a7a0dc6a_7652_b658_2bb9_d850d67979ca["split_text()"]
  059dfb7c_30ac_164c_5a3e_708a02d51601["SentenceTransformersTokenTextSplitter"]
  a7a0dc6a_7652_b658_2bb9_d850d67979ca -->|defined in| 059dfb7c_30ac_164c_5a3e_708a02d51601
  a7a0dc6a_7652_b658_2bb9_d850d67979ca["split_text()"]
  a7a0dc6a_7652_b658_2bb9_d850d67979ca -->|calls| a7a0dc6a_7652_b658_2bb9_d850d67979ca
  0172994e_4917_2bec_356d_ac072e832565["_encode()"]
  a7a0dc6a_7652_b658_2bb9_d850d67979ca -->|calls| 0172994e_4917_2bec_356d_ac072e832565
  a7a0dc6a_7652_b658_2bb9_d850d67979ca["split_text()"]
  a7a0dc6a_7652_b658_2bb9_d850d67979ca -->|calls| a7a0dc6a_7652_b658_2bb9_d850d67979ca
  style a7a0dc6a_7652_b658_2bb9_d850d67979ca fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/sentence_transformers.py lines 74–99

    def split_text(self, text: str) -> list[str]:
        """Splits the input text into smaller components by splitting text on tokens.

        This method encodes the input text using a private `_encode` method, then
        strips the start and stop token IDs from the encoded result. It returns the
        processed segments as a list of strings.

        Args:
            text: The input text to be split.

        Returns:
            A list of string components derived from the input text after encoding and
                processing.
        """

        def encode_strip_start_and_stop_token_ids(text: str) -> list[int]:
            return self._encode(text)[1:-1]

        tokenizer = Tokenizer(
            chunk_overlap=self._chunk_overlap,
            tokens_per_chunk=self.tokens_per_chunk,
            decode=self.tokenizer.decode,
            encode=encode_strip_start_and_stop_token_ids,
        )

        return split_text_on_tokens(text=text, tokenizer=tokenizer)