HTMLSectionSplitter Class — langchain Architecture

Architecture documentation for the HTMLSectionSplitter class in html.py from the langchain codebase.

Class python DocumentProcessing

Entity Profile

DocumentProcessing→ HTMLSectionSplitter Class — langchain Architecture

Dependency Diagram

graph TD
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c["HTMLSectionSplitter"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f["html.py"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|defined in| e3efe57c_5b49_c26c_6ca5_45acccb8037f
  884c7870_4829_e49a_c22a_6f89bea245e3["__init__()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 884c7870_4829_e49a_c22a_6f89bea245e3
  242347f4_37b6_e8c6_d9d5_c00530e34196["split_documents()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 242347f4_37b6_e8c6_d9d5_c00530e34196
  cdce0dab_74f2_fff9_b284_195643913ed5["split_text()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| cdce0dab_74f2_fff9_b284_195643913ed5
  fb63895c_3000_9932_3530_3357c6736f4f["create_documents()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| fb63895c_3000_9932_3530_3357c6736f4f
  219ab3b6_0b12_7f58_ba5f_9bfbebda0057["split_html_by_headers()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 219ab3b6_0b12_7f58_ba5f_9bfbebda0057
  c2708424_8958_9cb1_390e_d816b56479f3["convert_possible_tags_to_header()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| c2708424_8958_9cb1_390e_d816b56479f3
  170f66e6_a026_8fd5_9128_33eeefb7dd62["split_text_from_file()"]
  0c8a5f97_7cb0_fe24_746d_9689c4e5426c -->|method| 170f66e6_a026_8fd5_9128_33eeefb7dd62

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/html.py lines 348–553

class HTMLSectionSplitter:
    """Splitting HTML files based on specified tag and font sizes.

    Requires lxml package.
    """

    def __init__(
        self,
        headers_to_split_on: list[tuple[str, str]],
        **kwargs: Any,
    ) -> None:
        """Create a new `HTMLSectionSplitter`.

        Args:
            headers_to_split_on: List of tuples of headers we want to track mapped to
                (arbitrary) keys for metadata.

                Allowed header values: `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, e.g.:
                `[("h1", "Header 1"), ("h2", "Header 2"]`.
            **kwargs: Additional optional arguments for customizations.

        """
        self.headers_to_split_on = dict(headers_to_split_on)
        self.xslt_path = (
            pathlib.Path(__file__).parent / "xsl/converting_to_header.xslt"
        ).absolute()
        self.kwargs = kwargs

    def split_documents(self, documents: Iterable[Document]) -> list[Document]:
        """Split documents.

        Args:
            documents: Iterable of `Document` objects to be split.

        Returns:
            A list of split `Document` objects.
        """
        texts, metadatas = [], []
        for doc in documents:
            texts.append(doc.page_content)
            metadatas.append(doc.metadata)
        results = self.create_documents(texts, metadatas=metadatas)

        text_splitter = RecursiveCharacterTextSplitter(**self.kwargs)

        return text_splitter.split_documents(results)

    def split_text(self, text: str) -> list[Document]:
        """Split HTML text string.

        Args:
            text: HTML text

        Returns:
            A list of split `Document` objects.
        """
        return self.split_text_from_file(StringIO(text))

    def create_documents(
        self, texts: list[str], metadatas: list[dict[Any, Any]] | None = None
    ) -> list[Document]:
        """Create a list of `Document` objects from a list of texts.

        Args:
            texts: A list of texts to be split and converted into documents.
            metadatas: Optional list of metadata to associate with each document.

        Returns:
            A list of `Document` objects.
        """
        metadatas_ = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            for chunk in self.split_text(text):
                metadata = copy.deepcopy(metadatas_[i])

                for key in chunk.metadata:
                    if chunk.metadata[key] == "#TITLE#":
                        chunk.metadata[key] = metadata["Title"]
                metadata = {**metadata, **chunk.metadata}
                new_doc = Document(page_content=chunk.page_content, metadata=metadata)

Domain

DocumentProcessing

Defined In

libs/text-splitters/langchain_text_splitters/html.py

Source

View on GitHub

Frequently Asked Questions

What is the HTMLSectionSplitter class?

HTMLSectionSplitter is a class in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.

Where is HTMLSectionSplitter defined?

HTMLSectionSplitter is defined in libs/text-splitters/langchain_text_splitters/html.py at line 348.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free