Home / File/ test_html_security.py — langchain Source File

test_html_security.py — langchain Source File

Architecture documentation for test_html_security.py, a python file in the langchain codebase. 2 imports, 0 dependents.

Entity Profile

Dependency Diagram

graph LR
  1dc7ce96_c79b_910a_383d_c0c57040500a["test_html_security.py"]
  120e2591_3e15_b895_72b6_cb26195e40a6["pytest"]
  1dc7ce96_c79b_910a_383d_c0c57040500a --> 120e2591_3e15_b895_72b6_cb26195e40a6
  e39c01af_a371_ebc0_15ec_4d64e7690fd7["langchain_text_splitters.html"]
  1dc7ce96_c79b_910a_383d_c0c57040500a --> e39c01af_a371_ebc0_15ec_4d64e7690fd7
  style 1dc7ce96_c79b_910a_383d_c0c57040500a fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

"""Security tests for HTML splitters to prevent XXE attacks."""

import pytest

from langchain_text_splitters.html import HTMLSectionSplitter


@pytest.mark.requires("lxml", "bs4")
class TestHTMLSectionSplitterSecurity:
    """Security tests for HTMLSectionSplitter to ensure XXE prevention."""

    def test_xxe_entity_attack_blocked(self) -> None:
        """Test that external entity attacks are blocked."""
        # Create HTML content to process
        html_content = """<html><body><p>Test content</p></body></html>"""

        # Since xslt_path parameter is removed, this attack vector is eliminated
        # The splitter should use only the default XSLT
        splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1")])

        # Process the HTML - should not contain any external entity content
        result = splitter.split_text(html_content)

        # Verify that no external entity content is present
        all_content = " ".join([doc.page_content for doc in result])
        assert "root:" not in all_content  # /etc/passwd content
        assert "XXE Attack Result" not in all_content

    def test_xxe_document_function_blocked(self) -> None:
        """Test that XSLT document() function attacks are blocked."""
        # Even if someone modifies the default XSLT internally,
        # the secure parser configuration should block document() attacks

        html_content = (
            """<html><body><h1>Test Header</h1><p>Test content</p></body></html>"""
        )

        splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1")])

        # Process the HTML safely
        result = splitter.split_text(html_content)

        # Should process normally without any security issues
        assert len(result) > 0
        assert any("Test content" in doc.page_content for doc in result)

    def test_secure_parser_configuration(self) -> None:
        """Test that parsers are configured with security settings."""
        # This test verifies our security hardening is in place
        html_content = """<html><body><h1>Test</h1></body></html>"""

        splitter = HTMLSectionSplitter(headers_to_split_on=[("h1", "Header 1")])

        # The convert_possible_tags_to_header method should use secure parsers
        result = splitter.convert_possible_tags_to_header(html_content)

        # Result should be valid transformed HTML
        assert result is not None
        assert isinstance(result, str)

// ... (71 more lines)

Subdomains

Dependencies

  • langchain_text_splitters.html
  • pytest

Frequently Asked Questions

What does test_html_security.py do?
test_html_security.py is a source file in the langchain codebase, written in python. It belongs to the CoreAbstractions domain, Serialization subdomain.
What does test_html_security.py depend on?
test_html_security.py imports 2 module(s): langchain_text_splitters.html, pytest.
Where is test_html_security.py in the architecture?
test_html_security.py is located at libs/text-splitters/tests/unit_tests/test_html_security.py (domain: CoreAbstractions, subdomain: Serialization, directory: libs/text-splitters/tests/unit_tests).

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free