html.py — langchain Source File

Architecture documentation for html.py, a python file in the langchain codebase. 4 imports, 0 dependents.

File python CoreAbstractions RunnableInterface 4 imports 2 functions

Entity Profile

CoreAbstractions→ RunnableInterface→ html.py — langchain Source File

Dependency Diagram

graph LR
  12c46b04_27d4_b4f1_cde0_e519281fdaeb["html.py"]
  2a7f66a7_8738_3d47_375b_70fcaa6ac169["logging"]
  12c46b04_27d4_b4f1_cde0_e519281fdaeb --> 2a7f66a7_8738_3d47_375b_70fcaa6ac169
  67ec3255_645e_8b6e_1eff_1eb3c648ed95["re"]
  12c46b04_27d4_b4f1_cde0_e519281fdaeb --> 67ec3255_645e_8b6e_1eff_1eb3c648ed95
  cfe2bde5_180e_e3b0_df2b_55b3ebaca8e7["collections.abc"]
  12c46b04_27d4_b4f1_cde0_e519281fdaeb --> cfe2bde5_180e_e3b0_df2b_55b3ebaca8e7
  c89186be_3766_27dd_efaa_6092bf0ccc74["urllib.parse"]
  12c46b04_27d4_b4f1_cde0_e519281fdaeb --> c89186be_3766_27dd_efaa_6092bf0ccc74
  style 12c46b04_27d4_b4f1_cde0_e519281fdaeb fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

"""Utilities for working with HTML."""

import logging
import re
from collections.abc import Sequence
from urllib.parse import urljoin, urlparse

logger = logging.getLogger(__name__)

PREFIXES_TO_IGNORE = ("javascript:", "mailto:", "#")

SUFFIXES_TO_IGNORE = (
    ".css",
    ".js",
    ".ico",
    ".png",
    ".jpg",
    ".jpeg",
    ".gif",
    ".svg",
    ".csv",
    ".bz2",
    ".zip",
    ".epub",
    ".webp",
    ".pdf",
    ".docx",
    ".xlsx",
    ".pptx",
    ".pptm",
)

SUFFIXES_TO_IGNORE_REGEX = (
    "(?!" + "|".join([re.escape(s) + r"[\#'\"]" for s in SUFFIXES_TO_IGNORE]) + ")"
)

PREFIXES_TO_IGNORE_REGEX = (
    "(?!" + "|".join([re.escape(s) for s in PREFIXES_TO_IGNORE]) + ")"
)

DEFAULT_LINK_REGEX = (
    rf"href=[\"']{PREFIXES_TO_IGNORE_REGEX}((?:{SUFFIXES_TO_IGNORE_REGEX}.)*?)[\#'\"]"
)


def find_all_links(
    raw_html: str, *, pattern: str | re.Pattern | None = None
) -> list[str]:
    """Extract all links from a raw HTML string.

    Args:
        raw_html: original HTML.
        pattern: Regex to use for extracting links from raw HTML.

    Returns:
        A list of all links found in the HTML.
    """
    pattern = pattern or DEFAULT_LINK_REGEX
    return list(set(re.findall(pattern, raw_html)))

// ... (73 more lines)

Domain

CoreAbstractions

Subdomains

RunnableInterface

Functions

Dependencies

collections.abc
logging
re
urllib.parse

Source

View on GitHub

Frequently Asked Questions

What does html.py do?

html.py is a source file in the langchain codebase, written in python. It belongs to the CoreAbstractions domain, RunnableInterface subdomain.

What functions are defined in html.py?

html.py defines 2 function(s): extract_sub_links, find_all_links.

What does html.py depend on?

html.py imports 4 module(s): collections.abc, logging, re, urllib.parse.

Where is html.py in the architecture?

html.py is located at libs/core/langchain_core/utils/html.py (domain: CoreAbstractions, subdomain: RunnableInterface, directory: libs/core/langchain_core/utils).

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free