HTMLHeaderTextSplitter Class — langchain Architecture
Architecture documentation for the HTMLHeaderTextSplitter class in html.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 86dc20d4_404a_b608_01da_8dea923ef2c9["HTMLHeaderTextSplitter"] e3efe57c_5b49_c26c_6ca5_45acccb8037f["html.py"] 86dc20d4_404a_b608_01da_8dea923ef2c9 -->|defined in| e3efe57c_5b49_c26c_6ca5_45acccb8037f 95c4d16f_9ef1_6b54_c03d_da25e987fd1c["__init__()"] 86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| 95c4d16f_9ef1_6b54_c03d_da25e987fd1c 3a8f906a_02bf_a0ff_6dbb_2ffbc48f937d["split_text()"] 86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| 3a8f906a_02bf_a0ff_6dbb_2ffbc48f937d 982f8e7f_63e2_a8f4_7f7f_3def7fb3d84b["split_text_from_url()"] 86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| 982f8e7f_63e2_a8f4_7f7f_3def7fb3d84b cf1e77cb_9fca_ca93_1428_c967d5cb0c97["split_text_from_file()"] 86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| cf1e77cb_9fca_ca93_1428_c967d5cb0c97 f984e4e0_af18_0fc8_0a3e_18771966d43d["_generate_documents()"] 86dc20d4_404a_b608_01da_8dea923ef2c9 -->|method| f984e4e0_af18_0fc8_0a3e_18771966d43d
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/html.py lines 83–345
class HTMLHeaderTextSplitter:
"""Split HTML content into structured Documents based on specified headers.
Splits HTML content by detecting specified header tags and creating hierarchical
`Document` objects that reflect the semantic structure of the original content. For
each identified section, the splitter associates the extracted text with metadata
corresponding to the encountered headers.
If no specified headers are found, the entire content is returned as a single
`Document`. This allows for flexible handling of HTML input, ensuring that
information is organized according to its semantic headers.
The splitter provides the option to return each HTML element as a separate
`Document` or aggregate them into semantically meaningful chunks. It also
gracefully handles multiple levels of nested headers, creating a rich,
hierarchical representation of the content.
Example:
```python
from langchain_text_splitters.html_header_text_splitter import (
HTMLHeaderTextSplitter,
)
# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
return_each_element=False
)
html_content = \"\"\"
<html>
<body>
<h1>Introduction</h1>
<p>Welcome to the introduction section.</p>
<h2>Background</h2>
<p>Some background details here.</p>
<h1>Conclusion</h1>
<p>Final thoughts.</p>
</body>
</html>
\"\"\"
documents = splitter.split_text(html_content)
# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Final thoughts."
```
"""
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
return_each_element: bool = False, # noqa: FBT001,FBT002
) -> None:
"""Initialize with headers to split on.
Args:
headers_to_split_on: A list of `(header_tag,
header_name)` pairs representing the headers that define splitting
boundaries.
For example, `[("h1", "Header 1"), ("h2", "Header 2")]` will split
content by `h1` and `h2` tags, assigning their textual content to the
`Document` metadata.
return_each_element: If `True`, every HTML element encountered
(including headers, paragraphs, etc.) is returned as a separate
`Document`.
Domain
Source
Frequently Asked Questions
What is the HTMLHeaderTextSplitter class?
HTMLHeaderTextSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.
Where is HTMLHeaderTextSplitter defined?
HTMLHeaderTextSplitter is defined in libs/text-splitters/langchain_text_splitters/html.py at line 83.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free