ExperimentalMarkdownSyntaxTextSplitter Class — langchain Architecture
Architecture documentation for the ExperimentalMarkdownSyntaxTextSplitter class in markdown.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD cd7394a9_9856_dc15_cb00_078cf42f0529["ExperimentalMarkdownSyntaxTextSplitter"] 06b29b39_6256_6eec_e45e_4ff2a6e60f25["markdown.py"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|defined in| 06b29b39_6256_6eec_e45e_4ff2a6e60f25 fc2b67ed_e223_c683_5ac5_a115772a6829["__init__()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| fc2b67ed_e223_c683_5ac5_a115772a6829 ca4b44a0_217b_9ee3_738c_a86f47cf5d13["split_text()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| ca4b44a0_217b_9ee3_738c_a86f47cf5d13 d853fccb_3a1e_9745_c402_faf93b6c62b2["_resolve_header_stack()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| d853fccb_3a1e_9745_c402_faf93b6c62b2 7401677c_acf5_67de_69f0_eb4e531b0f66["_resolve_code_chunk()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| 7401677c_acf5_67de_69f0_eb4e531b0f66 b10a7aa1_da71_adf1_c53e_7aa30fa1d45c["_complete_chunk_doc()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| b10a7aa1_da71_adf1_c53e_7aa30fa1d45c e4272ad6_fa6c_2270_2b87_1b76f6930a95["_match_header()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| e4272ad6_fa6c_2270_2b87_1b76f6930a95 a98db0b3_1cd6_99eb_e5f6_1cfcd76b1885["_match_code()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| a98db0b3_1cd6_99eb_e5f6_1cfcd76b1885 79e4075c_b84b_1d21_02bd_fab2d78f732e["_match_horz()"] cd7394a9_9856_dc15_cb00_078cf42f0529 -->|method| 79e4075c_b84b_1d21_02bd_fab2d78f732e
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/markdown.py lines 298–481
class ExperimentalMarkdownSyntaxTextSplitter:
"""An experimental text splitter for handling Markdown syntax.
This splitter aims to retain the exact whitespace of the original text while
extracting structured metadata, such as headers. It is a re-implementation of the
`MarkdownHeaderTextSplitter` with notable changes to the approach and additional
features.
Key Features:
* Retains the original whitespace and formatting of the Markdown text.
* Extracts headers, code blocks, and horizontal rules as metadata.
* Splits out code blocks and includes the language in the "Code" metadata key.
* Splits text on horizontal rules (`---`) as well.
* Defaults to sensible splitting behavior, which can be overridden using the
`headers_to_split_on` parameter.
Example:
```python
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
splitter = ExperimentalMarkdownSyntaxTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = splitter.split(text)
for chunk in chunks:
print(chunk)
```
This class is currently experimental and subject to change based on feedback and
further development.
"""
def __init__(
self,
headers_to_split_on: list[tuple[str, str]] | None = None,
return_each_line: bool = False, # noqa: FBT001,FBT002
strip_headers: bool = True, # noqa: FBT001,FBT002
) -> None:
"""Initialize the text splitter with header splitting and formatting options.
This constructor sets up the required configuration for splitting text into
chunks based on specified headers and formatting preferences.
Args:
headers_to_split_on: A list of tuples, where each tuple contains a header
tag (e.g., "h1") and its corresponding metadata key.
If `None`, default headers are used.
return_each_line: Whether to return each line as an individual chunk.
Defaults to `False`, which aggregates lines into larger chunks.
strip_headers: Whether to exclude headers from the resulting chunks.
"""
self.chunks: list[Document] = []
self.current_chunk = Document(page_content="")
self.current_header_stack: list[tuple[int, str]] = []
self.strip_headers = strip_headers
if headers_to_split_on:
self.splittable_headers = dict(headers_to_split_on)
else:
self.splittable_headers = {
"#": "Header 1",
"##": "Header 2",
"###": "Header 3",
"####": "Header 4",
"#####": "Header 5",
"######": "Header 6",
}
self.return_each_line = return_each_line
def split_text(self, text: str) -> list[Document]:
"""Split the input text into structured chunks.
This method processes the input text line by line, identifying and handling
specific patterns such as headers, code blocks, and horizontal rules to split it
into structured chunks based on headers, code blocks, and horizontal rules.
Source
Frequently Asked Questions
What is the ExperimentalMarkdownSyntaxTextSplitter class?
ExperimentalMarkdownSyntaxTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/markdown.py.
Where is ExperimentalMarkdownSyntaxTextSplitter defined?
ExperimentalMarkdownSyntaxTextSplitter is defined in libs/text-splitters/langchain_text_splitters/markdown.py at line 298.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free