_process_html() — langchain Function Reference
Architecture documentation for the _process_html() function in html.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 252723d0_ba69_6fd1_f520_2ee9bc89cc3e["_process_html()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0["HTMLSemanticPreservingSplitter"] 252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|defined in| 5af47ada_f6e1_33df_ed07_12ca64351fa0 127c75d0_d814_d16e_a93c_928f021add9c["split_text()"] 127c75d0_d814_d16e_a93c_928f021add9c -->|calls| 252723d0_ba69_6fd1_f520_2ee9bc89cc3e c8888fd5_e6f3_6afa_7df1_2497296339db["_normalize_and_clean_text()"] 252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| c8888fd5_e6f3_6afa_7df1_2497296339db 5c2975ee_08fc_2de6_69ef_e1ab9fb5ded8["_create_documents()"] 252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| 5c2975ee_08fc_2de6_69ef_e1ab9fb5ded8 4134a695_a3ab_4bed_f7a0_3a766652fc3e["_find_all_tags()"] 252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| 4134a695_a3ab_4bed_f7a0_3a766652fc3e c152abff_8100_9c7d_3485_bc87dc010c4f["_find_all_strings()"] 252723d0_ba69_6fd1_f520_2ee9bc89cc3e -->|calls| c152abff_8100_9c7d_3485_bc87dc010c4f style 252723d0_ba69_6fd1_f520_2ee9bc89cc3e fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/html.py lines 846–989
def _process_html(self, soup: BeautifulSoup) -> list[Document]:
"""Processes the HTML content using BeautifulSoup and splits it using headers.
Args:
soup: Parsed HTML content using BeautifulSoup.
Returns:
A list of `Document` objects containing the split content.
"""
documents: list[Document] = []
current_headers: dict[str, str] = {}
current_content: list[str] = []
preserved_elements: dict[str, str] = {}
placeholder_count: int = 0
def _get_element_text(element: PageElement) -> str:
"""Recursively extracts and processes the text of an element.
Applies custom handlers where applicable, and ensures correct spacing.
Args:
element: The HTML element to process.
Returns:
The processed text of the element.
"""
element = cast("Tag | NavigableString", element)
if element.name in self._custom_handlers:
return self._custom_handlers[element.name](element)
text = ""
if element.name is not None:
for child in element.children:
child_text = _get_element_text(child).strip()
if text and child_text:
text += " "
text += child_text
elif element.string:
text += element.string
return self._normalize_and_clean_text(text)
elements = _find_all_tags(soup, recursive=False)
def _process_element(
element: ResultSet[Tag],
documents: list[Document],
current_headers: dict[str, str],
current_content: list[str],
preserved_elements: dict[str, str],
placeholder_count: int,
) -> tuple[list[Document], dict[str, str], list[str], dict[str, str], int]:
for elem in element:
if elem.name in [h[0] for h in self._headers_to_split_on]:
if current_content:
documents.extend(
self._create_documents(
current_headers,
" ".join(current_content),
preserved_elements,
)
)
current_content.clear()
preserved_elements.clear()
header_name = elem.get_text(strip=True)
current_headers = {
dict(self._headers_to_split_on)[elem.name]: header_name
}
elif elem.name in self._elements_to_preserve:
placeholder = f"PRESERVED_{placeholder_count}"
preserved_elements[placeholder] = _get_element_text(elem)
current_content.append(placeholder)
placeholder_count += 1
else:
# Recursively process children to find nested headers or
# preserved elements.
children = _find_all_tags(elem, recursive=False)
if children:
# Element has children - recursively process them.
(
Domain
Subdomains
Called By
Source
Frequently Asked Questions
What does _process_html() do?
_process_html() is a function in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.
Where is _process_html() defined?
_process_html() is defined in libs/text-splitters/langchain_text_splitters/html.py at line 846.
What does _process_html() call?
_process_html() calls 4 function(s): _create_documents, _find_all_strings, _find_all_tags, _normalize_and_clean_text.
What calls _process_html()?
_process_html() is called by 1 function(s): split_text.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free