MarkdownHeaderTextSplitter Class — langchain Architecture

Architecture documentation for the MarkdownHeaderTextSplitter class in markdown.py from the langchain codebase.

Class python

Entity Profile

Dependency Diagram

graph TD
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672["MarkdownHeaderTextSplitter"]
  06b29b39_6256_6eec_e45e_4ff2a6e60f25["markdown.py"]
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|defined in| 06b29b39_6256_6eec_e45e_4ff2a6e60f25
  e7138765_af2e_2047_7003_e6f5f9952e55["__init__()"]
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| e7138765_af2e_2047_7003_e6f5f9952e55
  c0f7b205_386f_81f7_010d_c9dac195bb30["_is_custom_header()"]
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| c0f7b205_386f_81f7_010d_c9dac195bb30
  cd7326ce_b97a_382e_8cf3_d4647f3a82a6["aggregate_lines_to_chunks()"]
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| cd7326ce_b97a_382e_8cf3_d4647f3a82a6
  b18c92c3_4d24_0e77_6322_b71c795c08ff["split_text()"]
  6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| b18c92c3_4d24_0e77_6322_b71c795c08ff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/markdown.py lines 23–280

class MarkdownHeaderTextSplitter:
    """Splitting markdown files based on specified headers."""

    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        return_each_line: bool = False,  # noqa: FBT001,FBT002
        strip_headers: bool = True,  # noqa: FBT001,FBT002
        custom_header_patterns: dict[str, int] | None = None,
    ) -> None:
        """Create a new `MarkdownHeaderTextSplitter`.

        Args:
            headers_to_split_on: Headers we want to track
            return_each_line: Return each line w/ associated headers
            strip_headers: Strip split headers from the content of the chunk
            custom_header_patterns: Optional dict mapping header patterns to their
                levels.

                For example: `{"**": 1, "***": 2}` to treat `**Header**` as level 1 and
                `***Header***` as level 2 headers.
        """
        # Output line-by-line or aggregated into chunks w/ common headers
        self.return_each_line = return_each_line
        # Given the headers we want to split on,
        # (e.g., "#, ##, etc") order by length
        self.headers_to_split_on = sorted(
            headers_to_split_on, key=lambda split: len(split[0]), reverse=True
        )
        # Strip headers split headers from the content of the chunk
        self.strip_headers = strip_headers
        # Custom header patterns with their levels
        self.custom_header_patterns = custom_header_patterns or {}

    def _is_custom_header(self, line: str, sep: str) -> bool:
        """Check if line matches a custom header pattern.

        Args:
            line: The line to check
            sep: The separator pattern to match

        Returns:
            `True` if the line matches the custom pattern format
        """
        if sep not in self.custom_header_patterns:
            return False

        # Escape special regex characters in the separator
        escaped_sep = re.escape(sep)
        # Create regex pattern to match exactly one separator at start and end
        # with content in between
        pattern = (
            f"^{escaped_sep}(?!{escaped_sep})(.+?)(?<!{escaped_sep}){escaped_sep}$"
        )

        match = re.match(pattern, line)
        if match:
            # Extract the content between the patterns
            content = match.group(1).strip()
            # Valid header if there's actual content (not just whitespace or separators)
            # Check that content doesn't consist only of separator characters
            if content and not all(c in sep for c in content.replace(" ", "")):
                return True
        return False

    def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]:
        """Combine lines with common metadata into chunks.

        Args:
            lines: Line of text / associated header metadata

        Returns:
            List of `Document` objects with common metadata aggregated.
        """
        aggregated_chunks: list[LineType] = []

        for line in lines:
            if (
                aggregated_chunks
                and aggregated_chunks[-1]["metadata"] == line["metadata"]
            ):

Defined In

libs/text-splitters/langchain_text_splitters/markdown.py

Source

View on GitHub

Frequently Asked Questions

What is the MarkdownHeaderTextSplitter class?

MarkdownHeaderTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/markdown.py.

Where is MarkdownHeaderTextSplitter defined?

MarkdownHeaderTextSplitter is defined in libs/text-splitters/langchain_text_splitters/markdown.py at line 23.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free