split_html_by_headers() — langchain Function Reference

Architecture documentation for the split_html_by_headers() function in html.py from the langchain codebase.

Function python DocumentProcessing TextSplitters calls 1 called by 1

Entity Profile

DocumentProcessing→ TextSplitters→ split_html_by_headers() — langchain Function Reference

Dependency Diagram

graph TD
  219ab3b6_0b12_7f58_ba5f_9bfbebda0057["split_html_by_headers()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c["HTMLSectionSplitter"]
  219ab3b6_0b12_7f58_ba5f_9bfbebda0057 -->|defined in| 0c8a5f97_7cb0_fe24_746d_9689c4e5426c
  170f66e6_a026_8fd5_9128_33eeefb7dd62["split_text_from_file()"]
  170f66e6_a026_8fd5_9128_33eeefb7dd62 -->|calls| 219ab3b6_0b12_7f58_ba5f_9bfbebda0057
  4134a695_a3ab_4bed_f7a0_3a766652fc3e["_find_all_tags()"]
  219ab3b6_0b12_7f58_ba5f_9bfbebda0057 -->|calls| 4134a695_a3ab_4bed_f7a0_3a766652fc3e
  style 219ab3b6_0b12_7f58_ba5f_9bfbebda0057 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/html.py lines 432–491

    def split_html_by_headers(self, html_doc: str) -> list[dict[str, str | None]]:
        """Split an HTML document into sections based on specified header tags.

        This method uses BeautifulSoup to parse the HTML content and divides it into
        sections based on headers defined in `headers_to_split_on`. Each section
        contains the header text, content under the header, and the tag name.

        Args:
            html_doc: The HTML document to be split into sections.

        Returns:
            A list of dictionaries representing sections.

                Each dictionary contains:

                * `'header'`: The header text or a default title for the first section.
                * `'content'`: The content under the header.
                * `'tag_name'`: The name of the header tag (e.g., `h1`, `h2`).

        Raises:
            ImportError: If BeautifulSoup is not installed.
        """
        if not _HAS_BS4:
            msg = "Unable to import BeautifulSoup/PageElement, \
                    please install with `pip install \
                    bs4`."
            raise ImportError(msg)

        soup = BeautifulSoup(html_doc, "html.parser")
        header_names = list(self.headers_to_split_on.keys())
        sections: list[dict[str, str | None]] = []

        headers = _find_all_tags(soup, name=["body", *header_names])

        for i, header in enumerate(headers):
            if i == 0:
                current_header = "#TITLE#"
                current_header_tag = "h1"
                section_content: list[str] = []
            else:
                current_header = header.text.strip()
                current_header_tag = header.name
                section_content = []
            for element in header.next_elements:
                if i + 1 < len(headers) and element == headers[i + 1]:
                    break
                if isinstance(element, str):
                    section_content.append(element)
            content = " ".join(section_content).strip()

            if content:
                sections.append(
                    {
                        "header": current_header,
                        "content": content,
                        "tag_name": current_header_tag,
                    }
                )

        return sections