aggregate_lines_to_chunks() — langchain Function Reference

Architecture documentation for the aggregate_lines_to_chunks() function in markdown.py from the langchain codebase.

Function python DocumentProcessing TextSplitters called by 1

Entity Profile

DocumentProcessing→ TextSplitters→ aggregate_lines_to_chunks() — langchain Function Reference

Dependency Diagram

graph TD
  cd7326ce_b97a_382e_8cf3_d4647f3a82a6["aggregate_lines_to_chunks()"]
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672["MarkdownHeaderTextSplitter"]
  cd7326ce_b97a_382e_8cf3_d4647f3a82a6 -->|defined in| 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672
  b18c92c3_4d24_0e77_6322_b71c795c08ff["split_text()"]
  b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| cd7326ce_b97a_382e_8cf3_d4647f3a82a6
  style cd7326ce_b97a_382e_8cf3_d4647f3a82a6 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/markdown.py lines 88–132

    def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]:
        """Combine lines with common metadata into chunks.

        Args:
            lines: Line of text / associated header metadata

        Returns:
            List of `Document` objects with common metadata aggregated.
        """
        aggregated_chunks: list[LineType] = []

        for line in lines:
            if (
                aggregated_chunks
                and aggregated_chunks[-1]["metadata"] == line["metadata"]
            ):
                # If the last line in the aggregated list
                # has the same metadata as the current line,
                # append the current content to the last lines's content
                aggregated_chunks[-1]["content"] += "  \n" + line["content"]
            elif (
                aggregated_chunks
                and aggregated_chunks[-1]["metadata"] != line["metadata"]
                # may be issues if other metadata is present
                and len(aggregated_chunks[-1]["metadata"]) < len(line["metadata"])
                and aggregated_chunks[-1]["content"].split("\n")[-1][0] == "#"
                and not self.strip_headers
            ):
                # If the last line in the aggregated list
                # has different metadata as the current line,
                # and has shallower header level than the current line,
                # and the last line is a header,
                # and we are not stripping headers,
                # append the current content to the last line's content
                aggregated_chunks[-1]["content"] += "  \n" + line["content"]
                # and update the last line's metadata
                aggregated_chunks[-1]["metadata"] = line["metadata"]
            else:
                # Otherwise, append the current line to the aggregated list
                aggregated_chunks.append(line)

        return [
            Document(page_content=chunk["content"], metadata=chunk["metadata"])
            for chunk in aggregated_chunks
        ]