Home / File/ html.py — langchain Source File

html.py — langchain Source File

Architecture documentation for html.py, a python file in the langchain codebase. 15 imports, 0 dependents.

File python DocumentProcessing TextSplitters 15 imports 9 functions 4 classes

Entity Profile

Dependency Diagram

graph LR
  e3efe57c_5b49_c26c_6ca5_45acccb8037f["html.py"]
  e874d8a4_cef0_9d0b_d1ee_84999c07cc2c["copy"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> e874d8a4_cef0_9d0b_d1ee_84999c07cc2c
  b6ee5de5_719a_eeb5_1e11_e9c63bc22ef8["pathlib"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> b6ee5de5_719a_eeb5_1e11_e9c63bc22ef8
  67ec3255_645e_8b6e_1eff_1eb3c648ed95["re"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 67ec3255_645e_8b6e_1eff_1eb3c648ed95
  4e334bc1_18d9_a6a4_18e5_7a3030396c51["io"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 4e334bc1_18d9_a6a4_18e5_7a3030396c51
  8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3["typing"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3
  792c09b7_7372_31d2_e29c_dc98949aa3c2["requests"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 792c09b7_7372_31d2_e29c_dc98949aa3c2
  b19a8b7e_fbee_95b1_65b8_509a1ed3cad7["langchain_core._api"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> b19a8b7e_fbee_95b1_65b8_509a1ed3cad7
  c554676d_b731_47b2_a98f_c1c2d537c0aa["langchain_core.documents"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> c554676d_b731_47b2_a98f_c1c2d537c0aa
  91721f45_4909_e489_8c1f_084f8bd87145["typing_extensions"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 91721f45_4909_e489_8c1f_084f8bd87145
  26e26c06_c107_2778_a237_35607f5a6d20["langchain_text_splitters.character"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 26e26c06_c107_2778_a237_35607f5a6d20
  cfe2bde5_180e_e3b0_df2b_55b3ebaca8e7["collections.abc"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> cfe2bde5_180e_e3b0_df2b_55b3ebaca8e7
  12ec29de_c252_354e_e837_1cd86b8f7af4["bs4.element"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 12ec29de_c252_354e_e837_1cd86b8f7af4
  0a45c4a1_846f_03df_b842_eb6b566c6404["nltk.py"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> 0a45c4a1_846f_03df_b842_eb6b566c6404
  eb7a0951_cedf_4f9a_c480_750414eb0f4e["bs4"]
  e3efe57c_5b49_c26c_6ca5_45acccb8037f --> eb7a0951_cedf_4f9a_c480_750414eb0f4e
  style e3efe57c_5b49_c26c_6ca5_45acccb8037f fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

"""HTML text splitters."""

from __future__ import annotations

import copy
import pathlib
import re
from io import StringIO
from typing import (
    IO,
    TYPE_CHECKING,
    Any,
    Literal,
    TypedDict,
    cast,
)

import requests
from langchain_core._api import beta
from langchain_core.documents import BaseDocumentTransformer, Document
from typing_extensions import override

from langchain_text_splitters.character import RecursiveCharacterTextSplitter

if TYPE_CHECKING:
    from collections.abc import Callable, Iterable, Iterator, Sequence

    from bs4.element import ResultSet

try:
    import nltk

    _HAS_NLTK = True
except ImportError:
    _HAS_NLTK = False

try:
    from bs4 import BeautifulSoup, Tag
    from bs4.element import NavigableString, PageElement

    _HAS_BS4 = True
except ImportError:
    _HAS_BS4 = False

try:
    from lxml import etree

    _HAS_LXML = True
except ImportError:
    _HAS_LXML = False


class ElementType(TypedDict):
    """Element type as typed dict."""

    url: str
    xpath: str
    content: str
    metadata: dict[str, str]

// ... (1004 more lines)

Subdomains

Dependencies

  • bs4
  • bs4.element
  • collections.abc
  • copy
  • io
  • langchain_core._api
  • langchain_core.documents
  • langchain_text_splitters.character
  • lxml
  • nltk.py
  • pathlib
  • re
  • requests
  • typing
  • typing_extensions

Frequently Asked Questions

What does html.py do?
html.py is a source file in the langchain codebase, written in python. It belongs to the DocumentProcessing domain, TextSplitters subdomain.
What functions are defined in html.py?
html.py defines 9 function(s): _HAS_BS4, _HAS_LXML, _HAS_NLTK, _find_all_strings, _find_all_tags, bs4, collections, lxml, nltk.
What does html.py depend on?
html.py imports 15 module(s): bs4, bs4.element, collections.abc, copy, io, langchain_core._api, langchain_core.documents, langchain_text_splitters.character, and 7 more.
Where is html.py in the architecture?
html.py is located at libs/text-splitters/langchain_text_splitters/html.py (domain: DocumentProcessing, subdomain: TextSplitters, directory: libs/text-splitters/langchain_text_splitters).

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free