markdown.py — langchain Source File
Architecture documentation for markdown.py, a python file in the langchain codebase. 5 imports, 0 dependents.
Entity Profile
Dependency Diagram
graph LR 06b29b39_6256_6eec_e45e_4ff2a6e60f25["markdown.py"] 67ec3255_645e_8b6e_1eff_1eb3c648ed95["re"] 06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 67ec3255_645e_8b6e_1eff_1eb3c648ed95 8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3["typing"] 06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3 c554676d_b731_47b2_a98f_c1c2d537c0aa["langchain_core.documents"] 06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> c554676d_b731_47b2_a98f_c1c2d537c0aa 885a8262_5dd0_fc53_460c_b7a8de727b5e["langchain_text_splitters.base"] 06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 885a8262_5dd0_fc53_460c_b7a8de727b5e 26e26c06_c107_2778_a237_35607f5a6d20["langchain_text_splitters.character"] 06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 26e26c06_c107_2778_a237_35607f5a6d20 style 06b29b39_6256_6eec_e45e_4ff2a6e60f25 fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
"""Markdown text splitters."""
from __future__ import annotations
import re
from typing import Any, TypedDict
from langchain_core.documents import Document
from langchain_text_splitters.base import Language
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
"""Attempts to split the text along Markdown-formatted headings."""
def __init__(self, **kwargs: Any) -> None:
"""Initialize a `MarkdownTextSplitter`."""
separators = self.get_separators_for_language(Language.MARKDOWN)
super().__init__(separators=separators, **kwargs)
class MarkdownHeaderTextSplitter:
"""Splitting markdown files based on specified headers."""
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
return_each_line: bool = False, # noqa: FBT001,FBT002
strip_headers: bool = True, # noqa: FBT001,FBT002
custom_header_patterns: dict[str, int] | None = None,
) -> None:
"""Create a new `MarkdownHeaderTextSplitter`.
Args:
headers_to_split_on: Headers we want to track
return_each_line: Return each line w/ associated headers
strip_headers: Strip split headers from the content of the chunk
custom_header_patterns: Optional dict mapping header patterns to their
levels.
For example: `{"**": 1, "***": 2}` to treat `**Header**` as level 1 and
`***Header***` as level 2 headers.
"""
# Output line-by-line or aggregated into chunks w/ common headers
self.return_each_line = return_each_line
# Given the headers we want to split on,
# (e.g., "#, ##, etc") order by length
self.headers_to_split_on = sorted(
headers_to_split_on, key=lambda split: len(split[0]), reverse=True
)
# Strip headers split headers from the content of the chunk
self.strip_headers = strip_headers
# Custom header patterns with their levels
self.custom_header_patterns = custom_header_patterns or {}
def _is_custom_header(self, line: str, sep: str) -> bool:
"""Check if line matches a custom header pattern.
Args:
// ... (422 more lines)
Domain
Subdomains
Classes
Dependencies
- langchain_core.documents
- langchain_text_splitters.base
- langchain_text_splitters.character
- re
- typing
Source
Frequently Asked Questions
What does markdown.py do?
markdown.py is a source file in the langchain codebase, written in python. It belongs to the DocumentProcessing domain, TextSplitters subdomain.
What does markdown.py depend on?
markdown.py imports 5 module(s): langchain_core.documents, langchain_text_splitters.base, langchain_text_splitters.character, re, typing.
Where is markdown.py in the architecture?
markdown.py is located at libs/text-splitters/langchain_text_splitters/markdown.py (domain: DocumentProcessing, subdomain: TextSplitters, directory: libs/text-splitters/langchain_text_splitters).
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free