Crawler Class — langchain Architecture

Architecture documentation for the Crawler class in crawler.py from the langchain codebase.

Class python

Entity Profile

Dependency Diagram

graph TD
  73034b47_6ada_6cee_4b85_e74b6a3e14f1["Crawler"]
  145fa990_11fa_b0bb_ae0a_7f5499aed5f1["crawler.py"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|defined in| 145fa990_11fa_b0bb_ae0a_7f5499aed5f1
  26714a04_6ec4_5020_1632_5abdeca3bb3f["__init__()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|method| 26714a04_6ec4_5020_1632_5abdeca3bb3f
  6ad987c3_70f5_176e_ae15_70693a21b69c["go_to_page()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|method| 6ad987c3_70f5_176e_ae15_70693a21b69c
  b1787df7_7f98_f476_b1ef_b22bcc573a1f["scroll()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|method| b1787df7_7f98_f476_b1ef_b22bcc573a1f
  a568a47c_223d_5e43_b052_ae9cd4915f78["click()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|method| a568a47c_223d_5e43_b052_ae9cd4915f78
  0d17c403_8681_a93a_1e43_a269b13f41dc["type()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|method| 0d17c403_8681_a93a_1e43_a269b13f41dc
  232a9aa6_3df9_48f4_e9c9_87da7ce52476["enter()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|method| 232a9aa6_3df9_48f4_e9c9_87da7ce52476
  c0e7550e_b556_2c60_2efa_05b08fe640eb["crawl()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1 -->|method| c0e7550e_b556_2c60_2efa_05b08fe640eb

Relationship Graph

Source Code

libs/langchain/langchain_classic/chains/natbot/crawler.py lines 46–479

class Crawler:
    """A crawler for web pages.

    **Security Note**: This is an implementation of a crawler that uses a browser via
        Playwright.

        This crawler can be used to load arbitrary webpages INCLUDING content
        from the local file system.

        Control access to who can submit crawling requests and what network access
        the crawler has.

        Make sure to scope permissions to the minimal permissions necessary for
        the application.

        See https://docs.langchain.com/oss/python/security-policy for more information.
    """

    def __init__(self) -> None:
        """Initialize the crawler."""
        try:
            from playwright.sync_api import sync_playwright
        except ImportError as e:
            msg = (
                "Could not import playwright python package. "
                "Please install it with `pip install playwright`."
            )
            raise ImportError(msg) from e
        self.browser: Browser = (
            sync_playwright().start().chromium.launch(headless=False)
        )
        self.page: Page = self.browser.new_page()
        self.page.set_viewport_size({"width": 1280, "height": 1080})
        self.page_element_buffer: dict[int, ElementInViewPort]
        self.client: CDPSession

    def go_to_page(self, url: str) -> None:
        """Navigate to the given URL.

        Args:
            url: The URL to navigate to. If it does not contain a scheme, it will be
                prefixed with "http://".
        """
        self.page.goto(url=url if "://" in url else "http://" + url)
        self.client = self.page.context.new_cdp_session(self.page)
        self.page_element_buffer = {}

    def scroll(self, direction: str) -> None:
        """Scroll the page in the given direction.

        Args:
            direction: The direction to scroll in, either "up" or "down".
        """
        if direction == "up":
            self.page.evaluate(
                "(document.scrollingElement || document.body).scrollTop = "
                "(document.scrollingElement || document.body).scrollTop - "
                "window.innerHeight;"
            )
        elif direction == "down":
            self.page.evaluate(
                "(document.scrollingElement || document.body).scrollTop = "
                "(document.scrollingElement || document.body).scrollTop + "
                "window.innerHeight;"
            )

    def click(self, id_: str | int) -> None:
        """Click on an element with the given id.

        Args:
            id_: The id of the element to click on.
        """
        # Inject javascript into the page which removes the target= attribute from links
        js = """
		links = document.getElementsByTagName("a");
		for (var i = 0; i < links.length; i++) {
			links[i].removeAttribute("target");
		}
		"""
        self.page.evaluate(js)

Defined In

libs/langchain/langchain_classic/chains/natbot/crawler.py

Source

View on GitHub

Frequently Asked Questions

What is the Crawler class?

Crawler is a class in the langchain codebase, defined in libs/langchain/langchain_classic/chains/natbot/crawler.py.

Where is Crawler defined?

Crawler is defined in libs/langchain/langchain_classic/chains/natbot/crawler.py at line 46.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free