HTMLSectionSplitter Class — langchain Architecture
Architecture documentation for the HTMLSectionSplitter class in html.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 0c8a5f97_7cb0_fe24_746d_9689c4e5426c["HTMLSectionSplitter"] e3efe57c_5b49_c26c_6ca5_45acccb8037f["html.py"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|defined in| e3efe57c_5b49_c26c_6ca5_45acccb8037f 884c7870_4829_e49a_c22a_6f89bea245e3["__init__()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 884c7870_4829_e49a_c22a_6f89bea245e3 242347f4_37b6_e8c6_d9d5_c00530e34196["split_documents()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 242347f4_37b6_e8c6_d9d5_c00530e34196 cdce0dab_74f2_fff9_b284_195643913ed5["split_text()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| cdce0dab_74f2_fff9_b284_195643913ed5 fb63895c_3000_9932_3530_3357c6736f4f["create_documents()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| fb63895c_3000_9932_3530_3357c6736f4f 219ab3b6_0b12_7f58_ba5f_9bfbebda0057["split_html_by_headers()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 219ab3b6_0b12_7f58_ba5f_9bfbebda0057 c2708424_8958_9cb1_390e_d816b56479f3["convert_possible_tags_to_header()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| c2708424_8958_9cb1_390e_d816b56479f3 170f66e6_a026_8fd5_9128_33eeefb7dd62["split_text_from_file()"] 0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 170f66e6_a026_8fd5_9128_33eeefb7dd62
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/html.py lines 348–553
class HTMLSectionSplitter:
"""Splitting HTML files based on specified tag and font sizes.
Requires lxml package.
"""
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
**kwargs: Any,
) -> None:
"""Create a new `HTMLSectionSplitter`.
Args:
headers_to_split_on: List of tuples of headers we want to track mapped to
(arbitrary) keys for metadata.
Allowed header values: `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, e.g.:
`[("h1", "Header 1"), ("h2", "Header 2"]`.
**kwargs: Additional optional arguments for customizations.
"""
self.headers_to_split_on = dict(headers_to_split_on)
self.xslt_path = (
pathlib.Path(__file__).parent / "xsl/converting_to_header.xslt"
).absolute()
self.kwargs = kwargs
def split_documents(self, documents: Iterable[Document]) -> list[Document]:
"""Split documents.
Args:
documents: Iterable of `Document` objects to be split.
Returns:
A list of split `Document` objects.
"""
texts, metadatas = [], []
for doc in documents:
texts.append(doc.page_content)
metadatas.append(doc.metadata)
results = self.create_documents(texts, metadatas=metadatas)
text_splitter = RecursiveCharacterTextSplitter(**self.kwargs)
return text_splitter.split_documents(results)
def split_text(self, text: str) -> list[Document]:
"""Split HTML text string.
Args:
text: HTML text
Returns:
A list of split `Document` objects.
"""
return self.split_text_from_file(StringIO(text))
def create_documents(
self, texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]:
"""Create a list of `Document` objects from a list of texts.
Args:
texts: A list of texts to be split and converted into documents.
metadatas: Optional list of metadata to associate with each document.
Returns:
A list of `Document` objects.
"""
metadatas_ = metadatas or [{}] * len(texts)
documents = []
for i, text in enumerate(texts):
for chunk in self.split_text(text):
metadata = copy.deepcopy(metadatas_[i])
for key in chunk.metadata:
if chunk.metadata[key] == "#TITLE#":
chunk.metadata[key] = metadata["Title"]
metadata = {**metadata, **chunk.metadata}
new_doc = Document(page_content=chunk.page_content, metadata=metadata)
Domain
Source
Frequently Asked Questions
What is the HTMLSectionSplitter class?
HTMLSectionSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.
Where is HTMLSectionSplitter defined?
HTMLSectionSplitter is defined in libs/text-splitters/langchain_text_splitters/html.py at line 348.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free