html.py — langchain Source File
Architecture documentation for html.py, a python file in the langchain codebase. 4 imports, 0 dependents.
Entity Profile
Dependency Diagram
graph LR 12c46b04_27d4_b4f1_cde0_e519281fdaeb["html.py"] 2a7f66a7_8738_3d47_375b_70fcaa6ac169["logging"] 12c46b04_27d4_b4f1_cde0_e519281fdaeb --> 2a7f66a7_8738_3d47_375b_70fcaa6ac169 67ec3255_645e_8b6e_1eff_1eb3c648ed95["re"] 12c46b04_27d4_b4f1_cde0_e519281fdaeb --> 67ec3255_645e_8b6e_1eff_1eb3c648ed95 cfe2bde5_180e_e3b0_df2b_55b3ebaca8e7["collections.abc"] 12c46b04_27d4_b4f1_cde0_e519281fdaeb --> cfe2bde5_180e_e3b0_df2b_55b3ebaca8e7 c89186be_3766_27dd_efaa_6092bf0ccc74["urllib.parse"] 12c46b04_27d4_b4f1_cde0_e519281fdaeb --> c89186be_3766_27dd_efaa_6092bf0ccc74 style 12c46b04_27d4_b4f1_cde0_e519281fdaeb fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
"""Utilities for working with HTML."""
import logging
import re
from collections.abc import Sequence
from urllib.parse import urljoin, urlparse
logger = logging.getLogger(__name__)
PREFIXES_TO_IGNORE = ("javascript:", "mailto:", "#")
SUFFIXES_TO_IGNORE = (
".css",
".js",
".ico",
".png",
".jpg",
".jpeg",
".gif",
".svg",
".csv",
".bz2",
".zip",
".epub",
".webp",
".pdf",
".docx",
".xlsx",
".pptx",
".pptm",
)
SUFFIXES_TO_IGNORE_REGEX = (
"(?!" + "|".join([re.escape(s) + r"[\#'\"]" for s in SUFFIXES_TO_IGNORE]) + ")"
)
PREFIXES_TO_IGNORE_REGEX = (
"(?!" + "|".join([re.escape(s) for s in PREFIXES_TO_IGNORE]) + ")"
)
DEFAULT_LINK_REGEX = (
rf"href=[\"']{PREFIXES_TO_IGNORE_REGEX}((?:{SUFFIXES_TO_IGNORE_REGEX}.)*?)[\#'\"]"
)
def find_all_links(
raw_html: str, *, pattern: str | re.Pattern | None = None
) -> list[str]:
"""Extract all links from a raw HTML string.
Args:
raw_html: original HTML.
pattern: Regex to use for extracting links from raw HTML.
Returns:
A list of all links found in the HTML.
"""
pattern = pattern or DEFAULT_LINK_REGEX
return list(set(re.findall(pattern, raw_html)))
// ... (73 more lines)
Domain
Subdomains
Functions
Dependencies
- collections.abc
- logging
- re
- urllib.parse
Source
Frequently Asked Questions
What does html.py do?
html.py is a source file in the langchain codebase, written in python. It belongs to the CoreAbstractions domain, RunnableInterface subdomain.
What functions are defined in html.py?
html.py defines 2 function(s): extract_sub_links, find_all_links.
What does html.py depend on?
html.py imports 4 module(s): collections.abc, logging, re, urllib.parse.
Where is html.py in the architecture?
html.py is located at libs/core/langchain_core/utils/html.py (domain: CoreAbstractions, subdomain: RunnableInterface, directory: libs/core/langchain_core/utils).
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free