split_text() — langchain Function Reference

Architecture documentation for the split_text() function in markdown.py from the langchain codebase.

Function python DocumentProcessing TextSplitters calls 3 called by 1

Entity Profile

DocumentProcessing→ TextSplitters→ split_text() — langchain Function Reference

Dependency Diagram

graph TD
  b18c92c3_4d24_0e77_6322_b71c795c08ff["split_text()"]
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672["MarkdownHeaderTextSplitter"]
  b18c92c3_4d24_0e77_6322_b71c795c08ff -->|defined in| 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13["split_text()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| b18c92c3_4d24_0e77_6322_b71c795c08ff
  c0f7b205_386f_81f7_010d_c9dac195bb30["_is_custom_header()"]
  b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| c0f7b205_386f_81f7_010d_c9dac195bb30
  cd7326ce_b97a_382e_8cf3_d4647f3a82a6["aggregate_lines_to_chunks()"]
  b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| cd7326ce_b97a_382e_8cf3_d4647f3a82a6
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13["split_text()"]
  b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| ca4b44a0_217b_9ee3_738c_a86f47cf5d13
  style b18c92c3_4d24_0e77_6322_b71c795c08ff fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/markdown.py lines 134–280

    def split_text(self, text: str) -> list[Document]:
        """Split markdown file.

        Args:
            text: Markdown file

        Returns:
            List of `Document` objects.
        """
        # Split the input text by newline character ("\n").
        lines = text.split("\n")

        # Final output
        lines_with_metadata: list[LineType] = []

        # Content and metadata of the chunk currently being processed
        current_content: list[str] = []

        current_metadata: dict[str, str] = {}

        # Keep track of the nested header structure
        header_stack: list[HeaderType] = []

        initial_metadata: dict[str, str] = {}

        in_code_block = False

        opening_fence = ""

        for line in lines:
            stripped_line = line.strip()
            # Remove all non-printable characters from the string, keeping only visible
            # text.
            stripped_line = "".join(filter(str.isprintable, stripped_line))
            if not in_code_block:
                # Exclude inline code spans
                if stripped_line.startswith("```") and stripped_line.count("```") == 1:
                    in_code_block = True
                    opening_fence = "```"
                elif stripped_line.startswith("~~~"):
                    in_code_block = True
                    opening_fence = "~~~"
            elif stripped_line.startswith(opening_fence):
                in_code_block = False
                opening_fence = ""

            if in_code_block:
                current_content.append(stripped_line)
                continue

            # Check each line against each of the header types (e.g., #, ##)
            for sep, name in self.headers_to_split_on:
                is_standard_header = stripped_line.startswith(sep) and (
                    # Header with no text OR header is followed by space
                    # Both are valid conditions that sep is being used a header
                    len(stripped_line) == len(sep) or stripped_line[len(sep)] == " "
                )
                is_custom_header = self._is_custom_header(stripped_line, sep)

                # Check if line matches either standard or custom header pattern
                if is_standard_header or is_custom_header:
                    # Ensure we are tracking the header as metadata
                    if name is not None:
                        # Get the current header level
                        if sep in self.custom_header_patterns:
                            current_header_level = self.custom_header_patterns[sep]
                        else:
                            current_header_level = sep.count("#")

                        # Pop out headers of lower or same level from the stack
                        while (
                            header_stack
                            and header_stack[-1]["level"] >= current_header_level
                        ):
                            # We have encountered a new header
                            # at the same or higher level
                            popped_header = header_stack.pop()
                            # Clear the metadata for the
                            # popped header in initial_metadata
                            if popped_header["name"] in initial_metadata:
                                initial_metadata.pop(popped_header["name"])