MarkdownHeaderTextSplitter Class — langchain Architecture
Architecture documentation for the MarkdownHeaderTextSplitter class in markdown.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672["MarkdownHeaderTextSplitter"] 06b29b39_6256_6eec_e45e_4ff2a6e60f25["markdown.py"] 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|defined in| 06b29b39_6256_6eec_e45e_4ff2a6e60f25 e7138765_af2e_2047_7003_e6f5f9952e55["__init__()"] 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| e7138765_af2e_2047_7003_e6f5f9952e55 c0f7b205_386f_81f7_010d_c9dac195bb30["_is_custom_header()"] 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| c0f7b205_386f_81f7_010d_c9dac195bb30 cd7326ce_b97a_382e_8cf3_d4647f3a82a6["aggregate_lines_to_chunks()"] 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| cd7326ce_b97a_382e_8cf3_d4647f3a82a6 b18c92c3_4d24_0e77_6322_b71c795c08ff["split_text()"] 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 -->|method| b18c92c3_4d24_0e77_6322_b71c795c08ff
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/markdown.py lines 23–280
class MarkdownHeaderTextSplitter:
"""Splitting markdown files based on specified headers."""
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
return_each_line: bool = False, # noqa: FBT001,FBT002
strip_headers: bool = True, # noqa: FBT001,FBT002
custom_header_patterns: dict[str, int] | None = None,
) -> None:
"""Create a new `MarkdownHeaderTextSplitter`.
Args:
headers_to_split_on: Headers we want to track
return_each_line: Return each line w/ associated headers
strip_headers: Strip split headers from the content of the chunk
custom_header_patterns: Optional dict mapping header patterns to their
levels.
For example: `{"**": 1, "***": 2}` to treat `**Header**` as level 1 and
`***Header***` as level 2 headers.
"""
# Output line-by-line or aggregated into chunks w/ common headers
self.return_each_line = return_each_line
# Given the headers we want to split on,
# (e.g., "#, ##, etc") order by length
self.headers_to_split_on = sorted(
headers_to_split_on, key=lambda split: len(split[0]), reverse=True
)
# Strip headers split headers from the content of the chunk
self.strip_headers = strip_headers
# Custom header patterns with their levels
self.custom_header_patterns = custom_header_patterns or {}
def _is_custom_header(self, line: str, sep: str) -> bool:
"""Check if line matches a custom header pattern.
Args:
line: The line to check
sep: The separator pattern to match
Returns:
`True` if the line matches the custom pattern format
"""
if sep not in self.custom_header_patterns:
return False
# Escape special regex characters in the separator
escaped_sep = re.escape(sep)
# Create regex pattern to match exactly one separator at start and end
# with content in between
pattern = (
f"^{escaped_sep}(?!{escaped_sep})(.+?)(?<!{escaped_sep}){escaped_sep}$"
)
match = re.match(pattern, line)
if match:
# Extract the content between the patterns
content = match.group(1).strip()
# Valid header if there's actual content (not just whitespace or separators)
# Check that content doesn't consist only of separator characters
if content and not all(c in sep for c in content.replace(" ", "")):
return True
return False
def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]:
"""Combine lines with common metadata into chunks.
Args:
lines: Line of text / associated header metadata
Returns:
List of `Document` objects with common metadata aggregated.
"""
aggregated_chunks: list[LineType] = []
for line in lines:
if (
aggregated_chunks
and aggregated_chunks[-1]["metadata"] == line["metadata"]
):
Source
Frequently Asked Questions
What is the MarkdownHeaderTextSplitter class?
MarkdownHeaderTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/markdown.py.
Where is MarkdownHeaderTextSplitter defined?
MarkdownHeaderTextSplitter is defined in libs/text-splitters/langchain_text_splitters/markdown.py at line 23.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free