_merge_splits() — langchain Function Reference

Architecture documentation for the _merge_splits() function in base.py from the langchain codebase.

Function python DocumentProcessing TextSplitters calls 1

Entity Profile

DocumentProcessing→ TextSplitters→ _merge_splits() — langchain Function Reference

Dependency Diagram

graph TD
  38fe665f_16f3_7590_557b_a39c4678e7f6["_merge_splits()"]
  c86e37d5_f962_cc1e_9821_b665e1359ae8["TextSplitter"]
  38fe665f_16f3_7590_557b_a39c4678e7f6 -->|defined in| c86e37d5_f962_cc1e_9821_b665e1359ae8
  20289806_e8d6_9514_562e_2bd46282553b["_join_docs()"]
  38fe665f_16f3_7590_557b_a39c4678e7f6 -->|calls| 20289806_e8d6_9514_562e_2bd46282553b
  style 38fe665f_16f3_7590_557b_a39c4678e7f6 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/base.py lines 152–194

    def _merge_splits(self, splits: Iterable[str], separator: str) -> list[str]:
        # We now want to combine these smaller pieces into medium size
        # chunks to send to the LLM.
        separator_len = self._length_function(separator)

        docs = []
        current_doc: list[str] = []
        total = 0
        for d in splits:
            len_ = self._length_function(d)
            if (
                total + len_ + (separator_len if len(current_doc) > 0 else 0)
                > self._chunk_size
            ):
                if total > self._chunk_size:
                    logger.warning(
                        "Created a chunk of size %d, which is longer than the "
                        "specified %d",
                        total,
                        self._chunk_size,
                    )
                if len(current_doc) > 0:
                    doc = self._join_docs(current_doc, separator)
                    if doc is not None:
                        docs.append(doc)
                    # Keep on popping if:
                    # - we have a larger chunk than in the chunk overlap
                    # - or if we still have any chunks and the length is long
                    while total > self._chunk_overlap or (
                        total + len_ + (separator_len if len(current_doc) > 0 else 0)
                        > self._chunk_size
                        and total > 0
                    ):
                        total -= self._length_function(current_doc[0]) + (
                            separator_len if len(current_doc) > 1 else 0
                        )
                        current_doc = current_doc[1:]
            current_doc.append(d)
            total += len_ + (separator_len if len(current_doc) > 1 else 0)
        doc = self._join_docs(current_doc, separator)
        if doc is not None:
            docs.append(doc)
        return docs