HTMLSemanticPreservingSplitter Class — langchain Architecture
Architecture documentation for the HTMLSemanticPreservingSplitter class in html.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 5af47ada_f6e1_33df_ed07_12ca64351fa0["HTMLSemanticPreservingSplitter"] 91ea4f6e_168e_8d34_bca6_53e61cdc1840["BaseDocumentTransformer"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|extends| 91ea4f6e_168e_8d34_bca6_53e61cdc1840 e3efe57c_5b49_c26c_6ca5_45acccb8037f["html.py"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|defined in| e3efe57c_5b49_c26c_6ca5_45acccb8037f 1d51f8a8_72d7_33cb_0ea1_b4f720cde381["__init__()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| 1d51f8a8_72d7_33cb_0ea1_b4f720cde381 127c75d0_d814_d16e_a93c_928f021add9c["split_text()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| 127c75d0_d814_d16e_a93c_928f021add9c e9c69e37_40ed_2949_d6dc_f6a7770ff7b8["transform_documents()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| e9c69e37_40ed_2949_d6dc_f6a7770ff7b8 2030eaef_a33b_19d9_d540_9d9919faafba["_process_media()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| 2030eaef_a33b_19d9_d540_9d9919faafba ff63d8f1_7353_0b16_2f96_7dadb57a8348["_process_links()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| ff63d8f1_7353_0b16_2f96_7dadb57a8348 33b3aaec_4039_d612_320f_cc74c2e5758c["_filter_tags()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| 33b3aaec_4039_d612_320f_cc74c2e5758c c8888fd5_e6f3_6afa_7df1_2497296339db["_normalize_and_clean_text()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| c8888fd5_e6f3_6afa_7df1_2497296339db 252723d0_ba69_6fd1_f520_2ee9bc89cc3e["_process_html()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| 252723d0_ba69_6fd1_f520_2ee9bc89cc3e 5c2975ee_08fc_2de6_69ef_e1ab9fb5ded8["_create_documents()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| 5c2975ee_08fc_2de6_69ef_e1ab9fb5ded8 1ad208c7_864b_e6dd_1344_b5ed70211298["_further_split_chunk()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| 1ad208c7_864b_e6dd_1344_b5ed70211298 f7ca6eae_27af_591b_5082_e978259ac965["_reinsert_preserved_elements()"] 5af47ada_f6e1_33df_ed07_12ca64351fa0 -->|method| f7ca6eae_27af_591b_5082_e978259ac965
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/html.py lines 557–1060
class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
"""Split HTML content preserving semantic structure.
Splits HTML content by headers into generalized chunks, preserving semantic
structure. If chunks exceed the maximum chunk size, it uses
`RecursiveCharacterTextSplitter` for further splitting.
The splitter preserves full HTML elements and converts links to Markdown-like links.
It can also preserve images, videos, and audio elements by converting them into
Markdown format. Note that some chunks may exceed the maximum size to maintain
semantic integrity.
!!! version-added "Added in `langchain-text-splitters` 0.3.5"
Example:
```python
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter
def custom_iframe_extractor(iframe_tag):
```
Custom handler function to extract the 'src' attribute from an <iframe> tag.
Converts the iframe to a Markdown-like link: [iframe:<src>](src).
Args:
iframe_tag (bs4.element.Tag): The <iframe> tag to be processed.
Returns:
str: A formatted string representing the iframe in Markdown-like format.
```
iframe_src = iframe_tag.get('src', '')
return f"[iframe:{iframe_src}]({iframe_src})"
text_splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
max_chunk_size=500,
preserve_links=True,
preserve_images=True,
custom_handlers={"iframe": custom_iframe_extractor}
)
```
""" # noqa: D214
def __init__(
self,
headers_to_split_on: list[tuple[str, str]],
*,
max_chunk_size: int = 1000,
chunk_overlap: int = 0,
separators: list[str] | None = None,
elements_to_preserve: list[str] | None = None,
preserve_links: bool = False,
preserve_images: bool = False,
preserve_videos: bool = False,
preserve_audio: bool = False,
custom_handlers: dict[str, Callable[[Tag], str]] | None = None,
stopword_removal: bool = False,
stopword_lang: str = "english",
normalize_text: bool = False,
external_metadata: dict[str, str] | None = None,
allowlist_tags: list[str] | None = None,
denylist_tags: list[str] | None = None,
preserve_parent_metadata: bool = False,
keep_separator: bool | Literal["start", "end"] = True,
) -> None:
"""Initialize splitter.
Args:
headers_to_split_on: HTML headers (e.g., `h1`, `h2`) that define content
sections.
max_chunk_size: Maximum size for each chunk, with allowance for exceeding
this limit to preserve semantics.
chunk_overlap: Number of characters to overlap between chunks to ensure
contextual continuity.
separators: Delimiters used by `RecursiveCharacterTextSplitter` for
further splitting.
elements_to_preserve: HTML tags (e.g., `table`, `ul`) to remain
intact during splitting.
preserve_links: Converts `a` tags to Markdown links (`[text](url)`).
preserve_images: Converts `img` tags to Markdown images (``).
preserve_videos: Converts `video` tags to Markdown video links
(``).
Domain
Extends
Source
Frequently Asked Questions
What is the HTMLSemanticPreservingSplitter class?
HTMLSemanticPreservingSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.
Where is HTMLSemanticPreservingSplitter defined?
HTMLSemanticPreservingSplitter is defined in libs/text-splitters/langchain_text_splitters/html.py at line 557.
What does HTMLSemanticPreservingSplitter extend?
HTMLSemanticPreservingSplitter extends BaseDocumentTransformer.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free