crawl() — langchain Function Reference

Architecture documentation for the crawl() function in crawler.py from the langchain codebase.

Function python AgentOrchestration ClassicChains

Entity Profile

AgentOrchestration→ ClassicChains→ crawl() — langchain Function Reference

Dependency Diagram

graph TD
  c0e7550e_b556_2c60_2efa_05b08fe640eb["crawl()"]
  73034b47_6ada_6cee_4b85_e74b6a3e14f1["Crawler"]
  c0e7550e_b556_2c60_2efa_05b08fe640eb -->|defined in| 73034b47_6ada_6cee_4b85_e74b6a3e14f1
  style c0e7550e_b556_2c60_2efa_05b08fe640eb fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/langchain/langchain_classic/chains/natbot/crawler.py lines 150–479

    def crawl(self) -> list[str]:
        """Crawl the current page.

        Returns:
            A list of the elements in the viewport.
        """
        page = self.page
        page_element_buffer = self.page_element_buffer
        start = time.time()

        page_state_as_text = []

        device_pixel_ratio: float = page.evaluate("window.devicePixelRatio")
        if platform == "darwin" and device_pixel_ratio == 1:  # lies
            device_pixel_ratio = 2

        win_upper_bound: float = page.evaluate("window.pageYOffset")
        win_left_bound: float = page.evaluate("window.pageXOffset")
        win_width: float = page.evaluate("window.screen.width")
        win_height: float = page.evaluate("window.screen.height")
        win_right_bound: float = win_left_bound + win_width
        win_lower_bound: float = win_upper_bound + win_height

        # 	percentage_progress_start = (win_upper_bound / document_scroll_height) * 100
        # 	percentage_progress_end = (
        # 		(win_height + win_upper_bound) / document_scroll_height
        # 	) * 100
        percentage_progress_start = 1
        percentage_progress_end = 2

        page_state_as_text.append(
            {
                "x": 0,
                "y": 0,
                "text": f"[scrollbar {percentage_progress_start:0.2f}-"
                f"{percentage_progress_end:0.2f}%]",
            }
        )

        tree = self.client.send(
            "DOMSnapshot.captureSnapshot",
            {"computedStyles": [], "includeDOMRects": True, "includePaintOrder": True},
        )
        strings: dict[int, str] = tree["strings"]
        document: dict[str, Any] = tree["documents"][0]
        nodes: dict[str, Any] = document["nodes"]
        backend_node_id: dict[int, int] = nodes["backendNodeId"]
        attributes: dict[int, dict[int, Any]] = nodes["attributes"]
        node_value: dict[int, int] = nodes["nodeValue"]
        parent: dict[int, int] = nodes["parentIndex"]
        node_names: dict[int, int] = nodes["nodeName"]
        is_clickable: set[int] = set(nodes["isClickable"]["index"])

        input_value: dict[str, Any] = nodes["inputValue"]
        input_value_index: list[int] = input_value["index"]
        input_value_values: list[int] = input_value["value"]

        layout: dict[str, Any] = document["layout"]
        layout_node_index: list[int] = layout["nodeIndex"]
        bounds: dict[int, list[float]] = layout["bounds"]

        cursor: int = 0

        child_nodes: dict[str, list[dict[str, Any]]] = {}
        elements_in_view_port: list[ElementInViewPort] = []

        anchor_ancestry: dict[str, tuple[bool, int | None]] = {"-1": (False, None)}
        button_ancestry: dict[str, tuple[bool, int | None]] = {"-1": (False, None)}

        def convert_name(
            node_name: str | None,
            has_click_handler: bool | None,  # noqa: FBT001
        ) -> str:
            if node_name == "a":
                return "link"
            if node_name == "input":
                return "input"
            if node_name == "img":
                return "img"
            if (
                node_name == "button" or has_click_handler