test_html_security.py — langchain Source File
Architecture documentation for test_html_security.py, a python file in the langchain codebase. 2 imports, 0 dependents.
Entity Profile
Dependency Diagram
graph LR 1dc7ce96_c79b_910a_383d_c0c57040500a["test_html_security.py"] 120e2591_3e15_b895_72b6_cb26195e40a6["pytest"] 1dc7ce96_c79b_910a_383d_c0c57040500a --> 120e2591_3e15_b895_72b6_cb26195e40a6 e39c01af_a371_ebc0_15ec_4d64e7690fd7["langchain_text_splitters.html"] 1dc7ce96_c79b_910a_383d_c0c57040500a --> e39c01af_a371_ebc0_15ec_4d64e7690fd7 style 1dc7ce96_c79b_910a_383d_c0c57040500a fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
"""Security tests for HTML splitters to prevent XXE attacks."""
import pytest
from langchain_text_splitters.html import HTMLSectionSplitter
@pytest.mark.requires("lxml", "bs4")
class TestHTMLSectionSplitterSecurity:
"""Security tests for HTMLSectionSplitter to ensure XXE prevention."""
def test_xxe_entity_attack_blocked(self) -> None:
"""Test that external entity attacks are blocked."""
# Create HTML content to process
html_content = """<html><body><p>Test content</p></body></html>"""
# Since xslt_path parameter is removed, this attack vector is eliminated
# The splitter should use only the default XSLT
splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1")])
# Process the HTML - should not contain any external entity content
result = splitter.split_text(html_content)
# Verify that no external entity content is present
all_content = " ".join([doc.page_content for doc in result])
assert "root:" not in all_content # /etc/passwd content
assert "XXE Attack Result" not in all_content
def test_xxe_document_function_blocked(self) -> None:
"""Test that XSLT document() function attacks are blocked."""
# Even if someone modifies the default XSLT internally,
# the secure parser configuration should block document() attacks
html_content = (
"""<html><body><h1>Test Header</h1><p>Test content</p></body></html>"""
)
splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1")])
# Process the HTML safely
result = splitter.split_text(html_content)
# Should process normally without any security issues
assert len(result) > 0
assert any("Test content" in doc.page_content for doc in result)
def test_secure_parser_configuration(self) -> None:
"""Test that parsers are configured with security settings."""
# This test verifies our security hardening is in place
html_content = """<html><body><h1>Test</h1></body></html>"""
splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1")])
# The convert_possible_tags_to_header method should use secure parsers
result = splitter.convert_possible_tags_to_header(html_content)
# Result should be valid transformed HTML
assert result is not None
assert isinstance(result, str)
// ... (71 more lines)
Domain
Subdomains
Classes
Dependencies
- langchain_text_splitters.html
- pytest
Source
Frequently Asked Questions
What does test_html_security.py do?
test_html_security.py is a source file in the langchain codebase, written in python. It belongs to the CoreAbstractions domain, Serialization subdomain.
What does test_html_security.py depend on?
test_html_security.py imports 2 module(s): langchain_text_splitters.html, pytest.
Where is test_html_security.py in the architecture?
test_html_security.py is located at libs/text-splitters/tests/unit_tests/test_html_security.py (domain: CoreAbstractions, subdomain: Serialization, directory: libs/text-splitters/tests/unit_tests).
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free