TextSplitter Class — langchain Architecture
Architecture documentation for the TextSplitter class in base.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD c86e37d5_f962_cc1e_9821_b665e1359ae8["TextSplitter"] 91ea4f6e_168e_8d34_bca6_53e61cdc1840["BaseDocumentTransformer"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|extends| 91ea4f6e_168e_8d34_bca6_53e61cdc1840 fee5f91c_52d7_4d25_94a2_c45ac6b35d65["TokenTextSplitter"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|extends| fee5f91c_52d7_4d25_94a2_c45ac6b35d65 d96ff4b9_fcc1_8428_729e_f75b099397b4["base.py"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|defined in| d96ff4b9_fcc1_8428_729e_f75b099397b4 c7195a4f_fa21_284a_6792_0765592ec3fd["__init__()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| c7195a4f_fa21_284a_6792_0765592ec3fd 01cef059_4479_0a04_53ff_2c366fd5c5bf["split_text()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 01cef059_4479_0a04_53ff_2c366fd5c5bf a4cdf08b_5d25_7d6b_a425_7a96372e8666["create_documents()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| a4cdf08b_5d25_7d6b_a425_7a96372e8666 d14f3e1b_dd57_6268_5d47_c8b53356440d["split_documents()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| d14f3e1b_dd57_6268_5d47_c8b53356440d 20289806_e8d6_9514_562e_2bd46282553b["_join_docs()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 20289806_e8d6_9514_562e_2bd46282553b 38fe665f_16f3_7590_557b_a39c4678e7f6["_merge_splits()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 38fe665f_16f3_7590_557b_a39c4678e7f6 e3cb5fd5_0149_e230_e0f1_05d16edbd1ed["from_huggingface_tokenizer()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| e3cb5fd5_0149_e230_e0f1_05d16edbd1ed 1eee98c3_ae63_e2ab_49ab_943e1721d020["from_tiktoken_encoder()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| 1eee98c3_ae63_e2ab_49ab_943e1721d020 e269bebc_7f11_3715_6f4a_f53fba4229cb["transform_documents()"] c86e37d5_f962_cc1e_9821_b665e1359ae8 -->|method| e269bebc_7f11_3715_6f4a_f53fba4229cb
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/base.py lines 44–295
class TextSplitter(BaseDocumentTransformer, ABC):
"""Interface for splitting text into chunks."""
def __init__(
self,
chunk_size: int = 4000,
chunk_overlap: int = 200,
length_function: Callable[[str], int] = len,
keep_separator: bool | Literal["start", "end"] = False, # noqa: FBT001,FBT002
add_start_index: bool = False, # noqa: FBT001,FBT002
strip_whitespace: bool = True, # noqa: FBT001,FBT002
) -> None:
"""Create a new `TextSplitter`.
Args:
chunk_size: Maximum size of chunks to return
chunk_overlap: Overlap in characters between chunks
length_function: Function that measures the length of given chunks
keep_separator: Whether to keep the separator and where to place it
in each corresponding chunk `(True='start')`
add_start_index: If `True`, includes chunk's start index in metadata
strip_whitespace: If `True`, strips whitespace from the start and end of
every document
Raises:
ValueError: If `chunk_size` is less than or equal to 0
ValueError: If `chunk_overlap` is less than 0
ValueError: If `chunk_overlap` is greater than `chunk_size`
"""
if chunk_size <= 0:
msg = f"chunk_size must be > 0, got {chunk_size}"
raise ValueError(msg)
if chunk_overlap < 0:
msg = f"chunk_overlap must be >= 0, got {chunk_overlap}"
raise ValueError(msg)
if chunk_overlap > chunk_size:
msg = (
f"Got a larger chunk overlap ({chunk_overlap}) than chunk size "
f"({chunk_size}), should be smaller."
)
raise ValueError(msg)
self._chunk_size = chunk_size
self._chunk_overlap = chunk_overlap
self._length_function = length_function
self._keep_separator = keep_separator
self._add_start_index = add_start_index
self._strip_whitespace = strip_whitespace
@abstractmethod
def split_text(self, text: str) -> list[str]:
"""Split text into multiple components.
Args:
text: The text to split.
Returns:
A list of text chunks.
"""
def create_documents(
self, texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]:
"""Create a list of `Document` objects from a list of texts.
Args:
texts: A list of texts to be split and converted into documents.
metadatas: Optional list of metadata to associate with each document.
Returns:
A list of `Document` objects.
"""
metadatas_ = metadatas or [{}] * len(texts)
documents = []
for i, text in enumerate(texts):
index = 0
previous_chunk_len = 0
for chunk in self.split_text(text):
metadata = copy.deepcopy(metadatas_[i])
if self._add_start_index:
offset = index + previous_chunk_len - self._chunk_overlap
index = text.find(chunk, max(0, offset))
Source
Frequently Asked Questions
What is the TextSplitter class?
TextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/base.py.
Where is TextSplitter defined?
TextSplitter is defined in libs/text-splitters/langchain_text_splitters/base.py at line 44.
What does TextSplitter extend?
TextSplitter extends BaseDocumentTransformer, TokenTextSplitter.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free