_generate_documents() — langchain Function Reference
Architecture documentation for the _generate_documents() function in html.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD f984e4e0_af18_0fc8_0a3e_18771966d43d["_generate_documents()"] 86dc20d4_404a_b608_01da_8dea923ef2c9["HTMLHeaderTextSplitter"] f984e4e0_af18_0fc8_0a3e_18771966d43d -->|defined in| 86dc20d4_404a_b608_01da_8dea923ef2c9 cf1e77cb_9fca_ca93_1428_c967d5cb0c97["split_text_from_file()"] cf1e77cb_9fca_ca93_1428_c967d5cb0c97 -->|calls| f984e4e0_af18_0fc8_0a3e_18771966d43d c152abff_8100_9c7d_3485_bc87dc010c4f["_find_all_strings()"] f984e4e0_af18_0fc8_0a3e_18771966d43d -->|calls| c152abff_8100_9c7d_3485_bc87dc010c4f style f984e4e0_af18_0fc8_0a3e_18771966d43d fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/html.py lines 230–345
def _generate_documents(self, html_content: str) -> Iterator[Document]:
"""Private method that performs a DFS traversal over the DOM and yields.
Document objects on-the-fly. This approach maintains the same splitting logic
(headers vs. non-headers, chunking, etc.) while walking the DOM explicitly in
code.
Args:
html_content: The raw HTML content.
Yields:
Document objects as they are created.
Raises:
ImportError: If BeautifulSoup is not installed.
"""
if not _HAS_BS4:
msg = (
"Unable to import BeautifulSoup. Please install via `pip install bs4`."
)
raise ImportError(msg)
soup = BeautifulSoup(html_content, "html.parser")
body = soup.body or soup
# Dictionary of active headers:
# key = user-defined header name (e.g. "Header 1")
# value = tuple of header_text, level, dom_depth
active_headers: dict[str, tuple[str, int, int]] = {}
current_chunk: list[str] = []
def finalize_chunk() -> Document | None:
"""Finalize the accumulated chunk into a single Document."""
if not current_chunk:
return None
final_text = " \n".join(line for line in current_chunk if line.strip())
current_chunk.clear()
if not final_text.strip():
return None
final_meta = {k: v[0] for k, v in active_headers.items()}
return Document(page_content=final_text, metadata=final_meta)
# We'll use a stack for DFS traversal
stack = [body]
while stack:
node = stack.pop()
children = list(node.children)
stack.extend(
child for child in reversed(children) if isinstance(child, Tag)
)
tag = getattr(node, "name", None)
if not tag:
continue
text_elements = [
str(child).strip() for child in _find_all_strings(node, recursive=False)
]
node_text = " ".join(elem for elem in text_elements if elem)
if not node_text:
continue
dom_depth = len(list(node.parents))
# If this node is one of our headers
if tag in self.header_tags:
# If we're aggregating, finalize whatever chunk we had
if not self.return_each_element:
doc = finalize_chunk()
if doc:
yield doc
# Determine numeric level (h1->1, h2->2, etc.)
try:
level = int(tag[1:])
except ValueError:
level = 9999
Domain
Subdomains
Calls
Called By
Source
Frequently Asked Questions
What does _generate_documents() do?
_generate_documents() is a function in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/html.py.
Where is _generate_documents() defined?
_generate_documents() is defined in libs/text-splitters/langchain_text_splitters/html.py at line 230.
What does _generate_documents() call?
_generate_documents() calls 1 function(s): _find_all_strings.
What calls _generate_documents()?
_generate_documents() is called by 1 function(s): split_text_from_file.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free