_generate_documents() — langchain Function Reference

Architecture documentation for the _generate_documents() function in html.py from the langchain codebase.

Function python DocumentProcessing TextSplitters calls 1 called by 1

Entity Profile

DocumentProcessing→ TextSplitters→ _generate_documents() — langchain Function Reference

Dependency Diagram

graph TD
  f984e4e0_af18_0fc8_0a3e_18771966d43d["_generate_documents()"]
  86dc20d4_404a_b608_01da_8dea923ef2c9["HTMLHeaderTextSplitter"]
  f984e4e0_af18_0fc8_0a3e_18771966d43d -->|defined in| 86dc20d4_404a_b608_01da_8dea923ef2c9
  cf1e77cb_9fca_ca93_1428_c967d5cb0c97["split_text_from_file()"]
  cf1e77cb_9fca_ca93_1428_c967d5cb0c97 -->|calls| f984e4e0_af18_0fc8_0a3e_18771966d43d
  c152abff_8100_9c7d_3485_bc87dc010c4f["_find_all_strings()"]
  f984e4e0_af18_0fc8_0a3e_18771966d43d -->|calls| c152abff_8100_9c7d_3485_bc87dc010c4f
  style f984e4e0_af18_0fc8_0a3e_18771966d43d fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/html.py lines 230–345

    def _generate_documents(self, html_content: str) -> Iterator[Document]:
        """Private method that performs a DFS traversal over the DOM and yields.

        Document objects on-the-fly. This approach maintains the same splitting logic
        (headers vs. non-headers, chunking, etc.) while walking the DOM explicitly in
        code.

        Args:
            html_content: The raw HTML content.

        Yields:
            Document objects as they are created.

        Raises:
            ImportError: If BeautifulSoup is not installed.
        """
        if not _HAS_BS4:
            msg = (
                "Unable to import BeautifulSoup. Please install via `pip install bs4`."
            )
            raise ImportError(msg)

        soup = BeautifulSoup(html_content, "html.parser")
        body = soup.body or soup

        # Dictionary of active headers:
        #   key = user-defined header name (e.g. "Header 1")
        #   value = tuple of header_text, level, dom_depth
        active_headers: dict[str, tuple[str, int, int]] = {}
        current_chunk: list[str] = []

        def finalize_chunk() -> Document | None:
            """Finalize the accumulated chunk into a single Document."""
            if not current_chunk:
                return None

            final_text = "  \n".join(line for line in current_chunk if line.strip())
            current_chunk.clear()
            if not final_text.strip():
                return None

            final_meta = {k: v[0] for k, v in active_headers.items()}
            return Document(page_content=final_text, metadata=final_meta)

        # We'll use a stack for DFS traversal
        stack = [body]
        while stack:
            node = stack.pop()
            children = list(node.children)

            stack.extend(
                child for child in reversed(children) if isinstance(child, Tag)
            )

            tag = getattr(node, "name", None)
            if not tag:
                continue

            text_elements = [
                str(child).strip() for child in _find_all_strings(node, recursive=False)
            ]
            node_text = " ".join(elem for elem in text_elements if elem)
            if not node_text:
                continue

            dom_depth = len(list(node.parents))

            # If this node is one of our headers
            if tag in self.header_tags:
                # If we're aggregating, finalize whatever chunk we had
                if not self.return_each_element:
                    doc = finalize_chunk()
                    if doc:
                        yield doc

                # Determine numeric level (h1->1, h2->2, etc.)
                try:
                    level = int(tag[1:])
                except ValueError:
                    level = 9999