_normalize_and_clean_text() — langchain Function Reference

Architecture documentation for the _normalize_and_clean_text() function in html.py from the langchain codebase.

Function python DataProcessing TextSplitters called by 1

Entity Profile

DataProcessing→ TextSplitters→ _normalize_and_clean_text() — langchain Function Reference

Dependency Diagram

graph TD
  06dfaecf_69d8_2d95_85d1_3aef3a299e2e["_normalize_and_clean_text()"]
  c05f2267_6bc4_946d_a7f8_3d7745082745["HTMLSemanticPreservingSplitter"]
  06dfaecf_69d8_2d95_85d1_3aef3a299e2e -->|defined in| c05f2267_6bc4_946d_a7f8_3d7745082745
  bccaa6fd_208f_1a39_f885_e3cca863a319["_process_html()"]
  bccaa6fd_208f_1a39_f885_e3cca863a319 -->|calls| 06dfaecf_69d8_2d95_85d1_3aef3a299e2e
  style 06dfaecf_69d8_2d95_85d1_3aef3a299e2e fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/html.py lines 825–844

    def _normalize_and_clean_text(self, text: str) -> str:
        """Normalizes the text by removing extra spaces and newlines.

        Args:
            text: The text to be normalized.

        Returns:
            The normalized text.
        """
        if self._normalize_text:
            text = text.lower()
            text = re.sub(r"[^\w\s]", "", text)
            text = re.sub(r"\s+", " ", text).strip()

        if self._stopword_removal:
            text = " ".join(
                [word for word in text.split() if word not in self._stopwords]
            )

        return text