ExperimentalMarkdownSyntaxTextSplitter Class — langchain Architecture

Architecture documentation for the ExperimentalMarkdownSyntaxTextSplitter class in markdown.py from the langchain codebase.

Class python

Entity Profile

Dependency Diagram

graph TD
  cd7394a9_9856_dc15_cb00_078cf42f0529["ExperimentalMarkdownSyntaxTextSplitter"]
  06b29b39_6256_6eec_e45e_4ff2a6e60f25["markdown.py"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|defined in| 06b29b39_6256_6eec_e45e_4ff2a6e60f25
  fc2b67ed_e223_c683_5ac5_a115772a6829["__init__()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| fc2b67ed_e223_c683_5ac5_a115772a6829
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13["split_text()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| ca4b44a0_217b_9ee3_738c_a86f47cf5d13
  d853fccb_3a1e_9745_c402_faf93b6c62b2["_resolve_header_stack()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| d853fccb_3a1e_9745_c402_faf93b6c62b2
  7401677c_acf5_67de_69f0_eb4e531b0f66["_resolve_code_chunk()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| 7401677c_acf5_67de_69f0_eb4e531b0f66
  b10a7aa1_da71_adf1_c53e_7aa30fa1d45c["_complete_chunk_doc()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| b10a7aa1_da71_adf1_c53e_7aa30fa1d45c
  e4272ad6_fa6c_2270_2b87_1b76f6930a95["_match_header()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| e4272ad6_fa6c_2270_2b87_1b76f6930a95
  a98db0b3_1cd6_99eb_e5f6_1cfcd76b1885["_match_code()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| a98db0b3_1cd6_99eb_e5f6_1cfcd76b1885
  79e4075c_b84b_1d21_02bd_fab2d78f732e["_match_horz()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| 79e4075c_b84b_1d21_02bd_fab2d78f732e

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/markdown.py lines 298–481

class ExperimentalMarkdownSyntaxTextSplitter:
    """An experimental text splitter for handling Markdown syntax.

    This splitter aims to retain the exact whitespace of the original text while
    extracting structured metadata, such as headers. It is a re-implementation of the
    `MarkdownHeaderTextSplitter` with notable changes to the approach and additional
    features.

    Key Features:

    * Retains the original whitespace and formatting of the Markdown text.
    * Extracts headers, code blocks, and horizontal rules as metadata.
    * Splits out code blocks and includes the language in the "Code" metadata key.
    * Splits text on horizontal rules (`---`) as well.
    * Defaults to sensible splitting behavior, which can be overridden using the
        `headers_to_split_on` parameter.

    Example:
        ```python
        headers_to_split_on = [
            ("#", "Header 1"),
            ("##", "Header 2"),
        ]
        splitter = ExperimentalMarkdownSyntaxTextSplitter(
            headers_to_split_on=headers_to_split_on
        )
        chunks = splitter.split(text)
        for chunk in chunks:
            print(chunk)
        ```

    This class is currently experimental and subject to change based on feedback and
    further development.
    """

    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]] | None = None,
        return_each_line: bool = False,  # noqa: FBT001,FBT002
        strip_headers: bool = True,  # noqa: FBT001,FBT002
    ) -> None:
        """Initialize the text splitter with header splitting and formatting options.

        This constructor sets up the required configuration for splitting text into
        chunks based on specified headers and formatting preferences.

        Args:
            headers_to_split_on: A list of tuples, where each tuple contains a header
                tag (e.g., "h1") and its corresponding metadata key.

                If `None`, default headers are used.
            return_each_line: Whether to return each line as an individual chunk.

                Defaults to `False`, which aggregates lines into larger chunks.
            strip_headers: Whether to exclude headers from the resulting chunks.
        """
        self.chunks: list[Document] = []
        self.current_chunk = Document(page_content="")
        self.current_header_stack: list[tuple[int, str]] = []
        self.strip_headers = strip_headers
        if headers_to_split_on:
            self.splittable_headers = dict(headers_to_split_on)
        else:
            self.splittable_headers = {
                "#": "Header 1",
                "##": "Header 2",
                "###": "Header 3",
                "####": "Header 4",
                "#####": "Header 5",
                "######": "Header 6",
            }

        self.return_each_line = return_each_line

    def split_text(self, text: str) -> list[Document]:
        """Split the input text into structured chunks.

        This method processes the input text line by line, identifying and handling
        specific patterns such as headers, code blocks, and horizontal rules to split it
        into structured chunks based on headers, code blocks, and horizontal rules.

Defined In

libs/text-splitters/langchain_text_splitters/markdown.py

Source

View on GitHub

Frequently Asked Questions

What is the ExperimentalMarkdownSyntaxTextSplitter class?

ExperimentalMarkdownSyntaxTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/markdown.py.

Where is ExperimentalMarkdownSyntaxTextSplitter defined?

ExperimentalMarkdownSyntaxTextSplitter is defined in libs/text-splitters/langchain_text_splitters/markdown.py at line 298.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free