TextSplitter Class — langchain Architecture

Architecture documentation for the TextSplitter class in base.py from the langchain codebase.

Class python

Entity Profile

Dependency Diagram

graph TD
  c86e37d5_f962_cc1e_9821_b665e1359ae8["TextSplitter"]
  91ea4f6e_168e_8d34_bca6_53e61cdc1840["BaseDocumentTransformer"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|extends| 91ea4f6e_168e_8d34_bca6_53e61cdc1840
  fee5f91c_52d7_4d25_94a2_c45ac6b35d65["TokenTextSplitter"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|extends| fee5f91c_52d7_4d25_94a2_c45ac6b35d65
  d96ff4b9_fcc1_8428_729e_f75b099397b4["base.py"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|defined in| d96ff4b9_fcc1_8428_729e_f75b099397b4
  c7195a4f_fa21_284a_6792_0765592ec3fd["__init__()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| c7195a4f_fa21_284a_6792_0765592ec3fd
  01cef059_4479_0a04_53ff_2c366fd5c5bf["split_text()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 01cef059_4479_0a04_53ff_2c366fd5c5bf
  a4cdf08b_5d25_7d6b_a425_7a96372e8666["create_documents()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| a4cdf08b_5d25_7d6b_a425_7a96372e8666
  d14f3e1b_dd57_6268_5d47_c8b53356440d["split_documents()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| d14f3e1b_dd57_6268_5d47_c8b53356440d
  20289806_e8d6_9514_562e_2bd46282553b["_join_docs()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 20289806_e8d6_9514_562e_2bd46282553b
  38fe665f_16f3_7590_557b_a39c4678e7f6["_merge_splits()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 38fe665f_16f3_7590_557b_a39c4678e7f6
  e3cb5fd5_0149_e230_e0f1_05d16edbd1ed["from_huggingface_tokenizer()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| e3cb5fd5_0149_e230_e0f1_05d16edbd1ed
  1eee98c3_ae63_e2ab_49ab_943e1721d020["from_tiktoken_encoder()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 1eee98c3_ae63_e2ab_49ab_943e1721d020
  e269bebc_7f11_3715_6f4a_f53fba4229cb["transform_documents()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| e269bebc_7f11_3715_6f4a_f53fba4229cb

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/base.py lines 44–295

class TextSplitter(BaseDocumentTransformer, ABC):
    """Interface for splitting text into chunks."""

    def __init__(
        self,
        chunk_size: int = 4000,
        chunk_overlap: int = 200,
        length_function: Callable[[str], int] = len,
        keep_separator: bool | Literal["start", "end"] = False,  # noqa: FBT001,FBT002
        add_start_index: bool = False,  # noqa: FBT001,FBT002
        strip_whitespace: bool = True,  # noqa: FBT001,FBT002
    ) -> None:
        """Create a new `TextSplitter`.

        Args:
            chunk_size: Maximum size of chunks to return
            chunk_overlap: Overlap in characters between chunks
            length_function: Function that measures the length of given chunks
            keep_separator: Whether to keep the separator and where to place it
                in each corresponding chunk `(True='start')`
            add_start_index: If `True`, includes chunk's start index in metadata
            strip_whitespace: If `True`, strips whitespace from the start and end of
                every document

        Raises:
            ValueError: If `chunk_size` is less than or equal to 0
            ValueError: If `chunk_overlap` is less than 0
            ValueError: If `chunk_overlap` is greater than `chunk_size`
        """
        if chunk_size <= 0:
            msg = f"chunk_size must be > 0, got {chunk_size}"
            raise ValueError(msg)
        if chunk_overlap < 0:
            msg = f"chunk_overlap must be >= 0, got {chunk_overlap}"
            raise ValueError(msg)
        if chunk_overlap > chunk_size:
            msg = (
                f"Got a larger chunk overlap ({chunk_overlap}) than chunk size "
                f"({chunk_size}), should be smaller."
            )
            raise ValueError(msg)
        self._chunk_size = chunk_size
        self._chunk_overlap = chunk_overlap
        self._length_function = length_function
        self._keep_separator = keep_separator
        self._add_start_index = add_start_index
        self._strip_whitespace = strip_whitespace

    @abstractmethod
    def split_text(self, text: str) -> list[str]:
        """Split text into multiple components.

        Args:
            text: The text to split.

        Returns:
            A list of text chunks.
        """

    def create_documents(
        self, texts: list[str], metadatas: list[dict[Any, Any]] | None = None
    ) -> list[Document]:
        """Create a list of `Document` objects from a list of texts.

        Args:
            texts: A list of texts to be split and converted into documents.
            metadatas: Optional list of metadata to associate with each document.

        Returns:
            A list of `Document` objects.
        """
        metadatas_ = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = 0
            previous_chunk_len = 0
            for chunk in self.split_text(text):
                metadata = copy.deepcopy(metadatas_[i])
                if self._add_start_index:
                    offset = index + previous_chunk_len - self._chunk_overlap
                    index = text.find(chunk, max(0, offset))

Defined In

libs/text-splitters/langchain_text_splitters/base.py

Extends

Source

View on GitHub

Frequently Asked Questions

What is the TextSplitter class?

TextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/base.py.

Where is TextSplitter defined?

TextSplitter is defined in libs/text-splitters/langchain_text_splitters/base.py at line 44.

What does TextSplitter extend?

TextSplitter extends BaseDocumentTransformer, TokenTextSplitter.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free