split_html_by_headers() — langchain Function Reference
Architecture documentation for the split_html_by_headers() function in html.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 219ab3b6_0b12_7f58_ba5f_9bfbebda0057["split_html_by_headers()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c["HTMLSectionSplitter"] 219ab3b6_0b12_7f58_ba5f_9bfbebda0057 -->|defined in| 0c8a5f97_7cb0_fe24_746d_9689c4e5426c 170f66e6_a026_8fd5_9128_33eeefb7dd62["split_text_from_file()"] 170f66e6_a026_8fd5_9128_33eeefb7dd62 -->|calls| 219ab3b6_0b12_7f58_ba5f_9bfbebda0057 4134a695_a3ab_4bed_f7a0_3a766652fc3e["_find_all_tags()"] 219ab3b6_0b12_7f58_ba5f_9bfbebda0057 -->|calls| 4134a695_a3ab_4bed_f7a0_3a766652fc3e style 219ab3b6_0b12_7f58_ba5f_9bfbebda0057 fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/html.py lines 432–491
def split_html_by_headers(self, html_doc: str) -> list[dict[str, str | None]]:
"""Split an HTML document into sections based on specified header tags.
This method uses BeautifulSoup to parse the HTML content and divides it into
sections based on headers defined in `headers_to_split_on`. Each section
contains the header text, content under the header, and the tag name.
Args:
html_doc: The HTML document to be split into sections.
Returns:
A list of dictionaries representing sections.
Each dictionary contains:
* `'header'`: The header text or a default title for the first section.
* `'content'`: The content under the header.
* `'tag_name'`: The name of the header tag (e.g., `h1`, `h2`).
Raises:
ImportError: If BeautifulSoup is not installed.
"""
if not _HAS_BS4:
msg = "Unable to import BeautifulSoup/PageElement, \
please install with `pip install \
bs4`."
raise ImportError(msg)
soup = BeautifulSoup(html_doc, "html.parser")
header_names = list(self.headers_to_split_on.keys())
sections: list[dict[str, str | None]] = []
headers = _find_all_tags(soup, name=["body", *header_names])
for i, header in enumerate(headers):
if i == 0:
current_header = "#TITLE#"
current_header_tag = "h1"
section_content: list[str] = []
else:
current_header = header.text.strip()
current_header_tag = header.name
section_content = []
for element in header.next_elements:
if i + 1 < len(headers) and element == headers[i + 1]:
break
if isinstance(element, str):
section_content.append(element)
content = " ".join(section_content).strip()
if content:
sections.append(
{
"header": current_header,
"content": content,
"tag_name": current_header_tag,
}
)
return sections
Domain
Subdomains
Calls
Called By
Source
Frequently Asked Questions
What does split_html_by_headers() do?
split_html_by_headers() is a function in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.
Where is split_html_by_headers() defined?
split_html_by_headers() is defined in libs/text-splitters/langchain_text_splitters/html.py at line 432.
What does split_html_by_headers() call?
split_html_by_headers() calls 1 function(s): _find_all_tags.
What calls split_html_by_headers()?
split_html_by_headers() is called by 1 function(s): split_text_from_file.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free