_process_html() — langchain Function Reference

Architecture documentation for the _process_html() function in html.py from the langchain codebase.

Function python DocumentProcessing TextSplitters calls 4 called by 1

Entity Profile

DocumentProcessing→ TextSplitters→ _process_html() — langchain Function Reference

Dependency Diagram

graph TD
  252723d0_ba69_6fd1_f520_2ee9bc89cc3e["_process_html()"]
  5af47ada_f6e1_33df_ed07_12ca64351fa0["HTMLSemanticPreservingSplitter"]
  252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|defined in| 5af47ada_f6e1_33df_ed07_12ca64351fa0
  127c75d0_d814_d16e_a93c_928f021add9c["split_text()"]
  127c75d0_d814_d16e_a93c_928f021add9c -->|calls| 252723d0_ba69_6fd1_f520_2ee9bc89cc3e
  c8888fd5_e6f3_6afa_7df1_2497296339db["_normalize_and_clean_text()"]
  252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| c8888fd5_e6f3_6afa_7df1_2497296339db
  5c2975ee_08fc_2de6_69ef_e1ab9fb5ded8["_create_documents()"]
  252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| 5c2975ee_08fc_2de6_69ef_e1ab9fb5ded8
  4134a695_a3ab_4bed_f7a0_3a766652fc3e["_find_all_tags()"]
  252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| 4134a695_a3ab_4bed_f7a0_3a766652fc3e
  c152abff_8100_9c7d_3485_bc87dc010c4f["_find_all_strings()"]
  252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| c152abff_8100_9c7d_3485_bc87dc010c4f
  style 252723d0_ba69_6fd1_f520_2ee9bc89cc3e fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/html.py lines 846–989

    def _process_html(self, soup: BeautifulSoup) -> list[Document]:
        """Processes the HTML content using BeautifulSoup and splits it using headers.

        Args:
            soup: Parsed HTML content using BeautifulSoup.

        Returns:
            A list of `Document` objects containing the split content.
        """
        documents: list[Document] = []
        current_headers: dict[str, str] = {}
        current_content: list[str] = []
        preserved_elements: dict[str, str] = {}
        placeholder_count: int = 0

        def _get_element_text(element: PageElement) -> str:
            """Recursively extracts and processes the text of an element.

            Applies custom handlers where applicable, and ensures correct spacing.

            Args:
                element: The HTML element to process.

            Returns:
                The processed text of the element.
            """
            element = cast("Tag | NavigableString", element)
            if element.name in self._custom_handlers:
                return self._custom_handlers[element.name](element)

            text = ""

            if element.name is not None:
                for child in element.children:
                    child_text = _get_element_text(child).strip()
                    if text and child_text:
                        text += " "
                    text += child_text
            elif element.string:
                text += element.string

            return self._normalize_and_clean_text(text)

        elements = _find_all_tags(soup, recursive=False)

        def _process_element(
            element: ResultSet[Tag],
            documents: list[Document],
            current_headers: dict[str, str],
            current_content: list[str],
            preserved_elements: dict[str, str],
            placeholder_count: int,
        ) -> tuple[list[Document], dict[str, str], list[str], dict[str, str], int]:
            for elem in element:
                if elem.name in [h[0] for h in self._headers_to_split_on]:
                    if current_content:
                        documents.extend(
                            self._create_documents(
                                current_headers,
                                " ".join(current_content),
                                preserved_elements,
                            )
                        )
                        current_content.clear()
                        preserved_elements.clear()
                    header_name = elem.get_text(strip=True)
                    current_headers = {
                        dict(self._headers_to_split_on)[elem.name]: header_name
                    }
                elif elem.name in self._elements_to_preserve:
                    placeholder = f"PRESERVED_{placeholder_count}"
                    preserved_elements[placeholder] = _get_element_text(elem)
                    current_content.append(placeholder)
                    placeholder_count += 1
                else:
                    # Recursively process children to find nested headers or
                    # preserved elements.
                    children = _find_all_tags(elem, recursive=False)
                    if children:
                        # Element has children - recursively process them.
                        (