HTMLHeaderTextSplitter Class — langchain Architecture

Architecture documentation for the HTMLHeaderTextSplitter class in html.py from the langchain codebase.

Class python DocumentProcessing

Entity Profile

DocumentProcessing→ HTMLHeaderTextSplitter Class — langchain Architecture

Dependency Diagram

graph TD
  86dc20d4_404a_b608_01da_8dea923ef2c9["HTMLHeaderTextSplitter"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f["html.py"]
  86dc20d4_404a_b608_01da_8dea923ef2c9 -->|defined in| e3efe57c_5b49_c26c_6ca5_45acccb8037f
  95c4d16f_9ef1_6b54_c03d_da25e987fd1c["__init__()"]
  86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| 95c4d16f_9ef1_6b54_c03d_da25e987fd1c
  3a8f906a_02bf_a0ff_6dbb_2ffbc48f937d["split_text()"]
  86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| 3a8f906a_02bf_a0ff_6dbb_2ffbc48f937d
  982f8e7f_63e2_a8f4_7f7f_3def7fb3d84b["split_text_from_url()"]
  86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| 982f8e7f_63e2_a8f4_7f7f_3def7fb3d84b
  cf1e77cb_9fca_ca93_1428_c967d5cb0c97["split_text_from_file()"]
  86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| cf1e77cb_9fca_ca93_1428_c967d5cb0c97
  f984e4e0_af18_0fc8_0a3e_18771966d43d["_generate_documents()"]
  86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| f984e4e0_af18_0fc8_0a3e_18771966d43d

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/html.py lines 83–345

class HTMLHeaderTextSplitter:
    """Split HTML content into structured Documents based on specified headers.

    Splits HTML content by detecting specified header tags and creating hierarchical
    `Document` objects that reflect the semantic structure of the original content. For
    each identified section, the splitter associates the extracted text with metadata
    corresponding to the encountered headers.

    If no specified headers are found, the entire content is returned as a single
    `Document`. This allows for flexible handling of HTML input, ensuring that
    information is organized according to its semantic headers.

    The splitter provides the option to return each HTML element as a separate
    `Document` or aggregate them into semantically meaningful chunks. It also
    gracefully handles multiple levels of nested headers, creating a rich,
    hierarchical representation of the content.

    Example:
        ```python
        from langchain_text_splitters.html_header_text_splitter import (
            HTMLHeaderTextSplitter,
        )

        # Define headers for splitting on h1 and h2 tags.
        headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]

        splitter = HTMLHeaderTextSplitter(
            headers_to_split_on=headers_to_split_on,
            return_each_element=False
        )

        html_content = \"\"\"
        <html>
            <body>
                <h1>Introduction</h1>
                <p>Welcome to the introduction section.</p>
                <h2>Background</h2>
                <p>Some background details here.</p>
                <h1>Conclusion</h1>
                <p>Final thoughts.</p>
            </body>
        </html>
        \"\"\"

        documents = splitter.split_text(html_content)

        # 'documents' now contains Document objects reflecting the hierarchy:
        # - Document with metadata={"Main Topic": "Introduction"} and
        #   content="Introduction"
        # - Document with metadata={"Main Topic": "Introduction"} and
        #   content="Welcome to the introduction section."
        # - Document with metadata={"Main Topic": "Introduction",
        #   "Sub Topic": "Background"} and content="Background"
        # - Document with metadata={"Main Topic": "Introduction",
        #   "Sub Topic": "Background"} and content="Some background details here."
        # - Document with metadata={"Main Topic": "Conclusion"} and
        #   content="Conclusion"
        # - Document with metadata={"Main Topic": "Conclusion"} and
        #   content="Final thoughts."
        ```
    """

    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        return_each_element: bool = False,  # noqa: FBT001,FBT002
    ) -> None:
        """Initialize with headers to split on.

        Args:
            headers_to_split_on: A list of `(header_tag,
                header_name)` pairs representing the headers that define splitting
                boundaries.

                For example, `[("h1", "Header 1"), ("h2", "Header 2")]` will split
                content by `h1` and `h2` tags, assigning their textual content to the
                `Document` metadata.
            return_each_element: If `True`, every HTML element encountered
                (including headers, paragraphs, etc.) is returned as a separate
                `Document`.

Domain

DocumentProcessing

Defined In

libs/text-splitters/langchain_text_splitters/html.py

Source

View on GitHub

Frequently Asked Questions

What is the HTMLHeaderTextSplitter class?

HTMLHeaderTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.

Where is HTMLHeaderTextSplitter defined?

HTMLHeaderTextSplitter is defined in libs/text-splitters/langchain_text_splitters/html.py at line 83.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free