crawler.py — langchain Source File

Architecture documentation for crawler.py, a python file in the langchain codebase. 5 imports, 0 dependents.

File python CoreAbstractions RunnableInterface 5 imports 1 functions 2 classes

Entity Profile

CoreAbstractions→ RunnableInterface→ crawler.py — langchain Source File

Dependency Diagram

graph LR
  3917f38f_3078_cc58_74a8_71a235ab29ed["crawler.py"]
  2a7f66a7_8738_3d47_375b_70fcaa6ac169["logging"]
  3917f38f_3078_cc58_74a8_71a235ab29ed --> 2a7f66a7_8738_3d47_375b_70fcaa6ac169
  0c1d9a1b_c553_0388_dbc1_58af49567aa2["time"]
  3917f38f_3078_cc58_74a8_71a235ab29ed --> 0c1d9a1b_c553_0388_dbc1_58af49567aa2
  d76a28c2_c3ab_00a8_5208_77807a49449d["sys"]
  3917f38f_3078_cc58_74a8_71a235ab29ed --> d76a28c2_c3ab_00a8_5208_77807a49449d
  8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3["typing"]
  3917f38f_3078_cc58_74a8_71a235ab29ed --> 8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3
  958c58cc_a30e_96f7_1ed2_ce3683f10d86["playwright.sync_api"]
  3917f38f_3078_cc58_74a8_71a235ab29ed --> 958c58cc_a30e_96f7_1ed2_ce3683f10d86
  style 3917f38f_3078_cc58_74a8_71a235ab29ed fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

import logging
import time
from sys import platform
from typing import (
    TYPE_CHECKING,
    Any,
    TypedDict,
)

if TYPE_CHECKING:
    from playwright.sync_api import Browser, CDPSession, Page

logger = logging.getLogger(__name__)

black_listed_elements: set[str] = {
    "html",
    "head",
    "title",
    "meta",
    "iframe",
    "body",
    "script",
    "style",
    "path",
    "svg",
    "br",
    "::marker",
}


class ElementInViewPort(TypedDict):
    """A typed dictionary containing information about elements in the viewport."""

    node_index: str
    backend_node_id: int
    node_name: str | None
    node_value: str | None
    node_meta: list[str]
    is_clickable: bool
    origin_x: int
    origin_y: int
    center_x: int
    center_y: int


class Crawler:
    """A crawler for web pages.

    **Security Note**: This is an implementation of a crawler that uses a browser via
        Playwright.

        This crawler can be used to load arbitrary webpages INCLUDING content
        from the local file system.

        Control access to who can submit crawling requests and what network access
        the crawler has.

        Make sure to scope permissions to the minimal permissions necessary for
        the application.

// ... (420 more lines)

Domain

CoreAbstractions

Subdomains

RunnableInterface

Functions

playwright()

Classes

Dependencies

logging
playwright.sync_api
sys
time
typing

Source

View on GitHub

Frequently Asked Questions

What does crawler.py do?

crawler.py is a source file in the langchain codebase, written in python. It belongs to the CoreAbstractions domain, RunnableInterface subdomain.

What functions are defined in crawler.py?

crawler.py defines 1 function(s): playwright.

What does crawler.py depend on?

crawler.py imports 5 module(s): logging, playwright.sync_api, sys, time, typing.

Where is crawler.py in the architecture?

crawler.py is located at libs/langchain/langchain_classic/chains/natbot/crawler.py (domain: CoreAbstractions, subdomain: RunnableInterface, directory: libs/langchain/langchain_classic/chains/natbot).

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free