markdown.py — langchain Source File

Architecture documentation for markdown.py, a python file in the langchain codebase. 5 imports, 0 dependents.

File python DocumentProcessing TextSplitters 5 imports 5 classes

Entity Profile

DocumentProcessing→ TextSplitters→ markdown.py — langchain Source File

Dependency Diagram

graph LR
  06b29b39_6256_6eec_e45e_4ff2a6e60f25["markdown.py"]
  67ec3255_645e_8b6e_1eff_1eb3c648ed95["re"]
  06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 67ec3255_645e_8b6e_1eff_1eb3c648ed95
  8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3["typing"]
  06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3
  c554676d_b731_47b2_a98f_c1c2d537c0aa["langchain_core.documents"]
  06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> c554676d_b731_47b2_a98f_c1c2d537c0aa
  885a8262_5dd0_fc53_460c_b7a8de727b5e["langchain_text_splitters.base"]
  06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 885a8262_5dd0_fc53_460c_b7a8de727b5e
  26e26c06_c107_2778_a237_35607f5a6d20["langchain_text_splitters.character"]
  06b29b39_6256_6eec_e45e_4ff2a6e60f25 --> 26e26c06_c107_2778_a237_35607f5a6d20
  style 06b29b39_6256_6eec_e45e_4ff2a6e60f25 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

"""Markdown text splitters."""

from __future__ import annotations

import re
from typing import Any, TypedDict

from langchain_core.documents import Document

from langchain_text_splitters.base import Language
from langchain_text_splitters.character import RecursiveCharacterTextSplitter


class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
    """Attempts to split the text along Markdown-formatted headings."""

    def __init__(self, **kwargs: Any) -> None:
        """Initialize a `MarkdownTextSplitter`."""
        separators = self.get_separators_for_language(Language.MARKDOWN)
        super().__init__(separators=separators, **kwargs)


class MarkdownHeaderTextSplitter:
    """Splitting markdown files based on specified headers."""

    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        return_each_line: bool = False,  # noqa: FBT001,FBT002
        strip_headers: bool = True,  # noqa: FBT001,FBT002
        custom_header_patterns: dict[str, int] | None = None,
    ) -> None:
        """Create a new `MarkdownHeaderTextSplitter`.

        Args:
            headers_to_split_on: Headers we want to track
            return_each_line: Return each line w/ associated headers
            strip_headers: Strip split headers from the content of the chunk
            custom_header_patterns: Optional dict mapping header patterns to their
                levels.

                For example: `{"**": 1, "***": 2}` to treat `**Header**` as level 1 and
                `***Header***` as level 2 headers.
        """
        # Output line-by-line or aggregated into chunks w/ common headers
        self.return_each_line = return_each_line
        # Given the headers we want to split on,
        # (e.g., "#, ##, etc") order by length
        self.headers_to_split_on = sorted(
            headers_to_split_on, key=lambda split: len(split[0]), reverse=True
        )
        # Strip headers split headers from the content of the chunk
        self.strip_headers = strip_headers
        # Custom header patterns with their levels
        self.custom_header_patterns = custom_header_patterns or {}

    def _is_custom_header(self, line: str, sep: str) -> bool:
        """Check if line matches a custom header pattern.

        Args:
// ... (422 more lines)

Domain

DocumentProcessing

Subdomains

TextSplitters

Classes

Dependencies

langchain_core.documents
langchain_text_splitters.base
langchain_text_splitters.character
re
typing

Source

View on GitHub

Frequently Asked Questions

What does markdown.py do?

markdown.py is a source file in the langchain codebase, written in python. It belongs to the DocumentProcessing domain, TextSplitters subdomain.

What does markdown.py depend on?

markdown.py imports 5 module(s): langchain_core.documents, langchain_text_splitters.base, langchain_text_splitters.character, re, typing.

Where is markdown.py in the architecture?

markdown.py is located at libs/text-splitters/langchain_text_splitters/markdown.py (domain: DocumentProcessing, subdomain: TextSplitters, directory: libs/text-splitters/langchain_text_splitters).

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free