Home / File/ character.py — langchain Source File

character.py — langchain Source File

Architecture documentation for character.py, a python file in the langchain codebase. 3 imports, 0 dependents.

File python DocumentProcessing TextSplitters 3 imports 1 functions 2 classes

Entity Profile

Dependency Diagram

graph LR
  2928a4a1_9408_cbea_fa7c_7f66eab697a2["character.py"]
  67ec3255_645e_8b6e_1eff_1eb3c648ed95["re"]
  2928a4a1_9408_cbea_fa7c_7f66eab697a2 --> 67ec3255_645e_8b6e_1eff_1eb3c648ed95
  8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3["typing"]
  2928a4a1_9408_cbea_fa7c_7f66eab697a2 --> 8e2034b7_ceb8_963f_29fc_2ea6b50ef9b3
  885a8262_5dd0_fc53_460c_b7a8de727b5e["langchain_text_splitters.base"]
  2928a4a1_9408_cbea_fa7c_7f66eab697a2 --> 885a8262_5dd0_fc53_460c_b7a8de727b5e
  style 2928a4a1_9408_cbea_fa7c_7f66eab697a2 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

"""Character text splitters."""

from __future__ import annotations

import re
from typing import Any, Literal

from langchain_text_splitters.base import Language, TextSplitter


class CharacterTextSplitter(TextSplitter):
    """Splitting text that looks at characters."""

    def __init__(
        self,
        separator: str = "\n\n",
        is_separator_regex: bool = False,  # noqa: FBT001,FBT002
        **kwargs: Any,
    ) -> None:
        """Create a new TextSplitter."""
        super().__init__(**kwargs)
        self._separator = separator
        self._is_separator_regex = is_separator_regex

    def split_text(self, text: str) -> list[str]:
        """Split into chunks without re-inserting lookaround separators.

        Args:
            text: The text to split.

        Returns:
            A list of text chunks.
        """
        # 1. Determine split pattern: raw regex or escaped literal
        sep_pattern = (
            self._separator if self._is_separator_regex else re.escape(self._separator)
        )

        # 2. Initial split (keep separator if requested)
        splits = _split_text_with_regex(
            text, sep_pattern, keep_separator=self._keep_separator
        )

        # 3. Detect zero-width lookaround so we never re-insert it
        lookaround_prefixes = ("(?=", "(?<!", "(?<=", "(?!")
        is_lookaround = self._is_separator_regex and any(
            self._separator.startswith(p) for p in lookaround_prefixes
        )

        # 4. Decide merge separator:
        #    - if keep_separator or lookaround -> don't re-insert
        #    - else -> re-insert literal separator
        merge_sep = ""
        if not (self._keep_separator or is_lookaround):
            merge_sep = self._separator

        # 5. Merge adjacent splits and return
        return self._merge_splits(splits, merge_sep)


// ... (744 more lines)

Subdomains

Dependencies

  • langchain_text_splitters.base
  • re
  • typing

Frequently Asked Questions

What does character.py do?
character.py is a source file in the langchain codebase, written in python. It belongs to the DocumentProcessing domain, TextSplitters subdomain.
What functions are defined in character.py?
character.py defines 1 function(s): _split_text_with_regex.
What does character.py depend on?
character.py imports 3 module(s): langchain_text_splitters.base, re, typing.
Where is character.py in the architecture?
character.py is located at libs/text-splitters/langchain_text_splitters/character.py (domain: DocumentProcessing, subdomain: TextSplitters, directory: libs/text-splitters/langchain_text_splitters).

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free