Home / File/ spacy.py — langchain Source File

spacy.py — langchain Source File

Architecture documentation for spacy.py, a python file in the langchain codebase. 6 imports, 1 dependents.

File python DataProcessing TextSplitters 6 imports 1 dependents 3 functions 1 classes

Entity Profile

Dependency Diagram

graph LR
  6511588c_fdc6_97a2_3753_2c61ff504a39["spacy.py"]
  feec1ec4_6917_867b_d228_b134d0ff8099["typing"]
  6511588c_fdc6_97a2_3753_2c61ff504a39 --> feec1ec4_6917_867b_d228_b134d0ff8099
  f85fae70_1011_eaec_151c_4083140ae9e5["typing_extensions"]
  6511588c_fdc6_97a2_3753_2c61ff504a39 --> f85fae70_1011_eaec_151c_4083140ae9e5
  8dcf5d75_3e05_1e6b_3ce2_4d8907e376c0["langchain_text_splitters.base"]
  6511588c_fdc6_97a2_3753_2c61ff504a39 --> 8dcf5d75_3e05_1e6b_3ce2_4d8907e376c0
  6511588c_fdc6_97a2_3753_2c61ff504a39["spacy.py"]
  6511588c_fdc6_97a2_3753_2c61ff504a39 --> 6511588c_fdc6_97a2_3753_2c61ff504a39
  d9ede8c0_1451_dd31_bbb7_41c515ba4657["spacy.lang.en"]
  6511588c_fdc6_97a2_3753_2c61ff504a39 --> d9ede8c0_1451_dd31_bbb7_41c515ba4657
  b0563be8_2a1a_a970_40e8_9f3b7eeeb8a8["spacy.language"]
  6511588c_fdc6_97a2_3753_2c61ff504a39 --> b0563be8_2a1a_a970_40e8_9f3b7eeeb8a8
  6511588c_fdc6_97a2_3753_2c61ff504a39["spacy.py"]
  6511588c_fdc6_97a2_3753_2c61ff504a39 --> 6511588c_fdc6_97a2_3753_2c61ff504a39
  style 6511588c_fdc6_97a2_3753_2c61ff504a39 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

"""Spacy text splitter."""

from __future__ import annotations

from typing import TYPE_CHECKING, Any

from typing_extensions import override

from langchain_text_splitters.base import TextSplitter

try:
    # Type ignores needed as long as spacy doesn't support Python 3.14.
    import spacy  # type: ignore[import-not-found, unused-ignore]
    from spacy.lang.en import English  # type: ignore[import-not-found, unused-ignore]

    if TYPE_CHECKING:
        from spacy.language import (  # type: ignore[import-not-found, unused-ignore]
            Language,
        )

    _HAS_SPACY = True
except ImportError:
    _HAS_SPACY = False


class SpacyTextSplitter(TextSplitter):
    """Splitting text using Spacy package.

    Per default, Spacy's `en_core_web_sm` model is used and
    its default max_length is 1000000 (it is the length of maximum character
    this model takes which can be increased for large files). For a faster, but
    potentially less accurate splitting, you can use `pipeline='sentencizer'`.
    """

    def __init__(
        self,
        separator: str = "\n\n",
        pipeline: str = "en_core_web_sm",
        max_length: int = 1_000_000,
        *,
        strip_whitespace: bool = True,
        **kwargs: Any,
    ) -> None:
        """Initialize the spacy text splitter."""
        super().__init__(**kwargs)
        self._tokenizer = _make_spacy_pipeline_for_splitting(
            pipeline, max_length=max_length
        )
        self._separator = separator
        self._strip_whitespace = strip_whitespace

    @override
    def split_text(self, text: str) -> list[str]:
        splits = (
            s.text if self._strip_whitespace else s.text_with_ws
            for s in self._tokenizer(text).sents
        )
        return self._merge_splits(splits, self._separator)


def _make_spacy_pipeline_for_splitting(
    pipeline: str, *, max_length: int = 1_000_000
) -> Language:
    if not _HAS_SPACY:
        msg = "Spacy is not installed, please install it with `pip install spacy`."
        raise ImportError(msg)
    if pipeline == "sentencizer":
        sentencizer: Language = English()
        sentencizer.add_pipe("sentencizer")
    else:
        sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
        sentencizer.max_length = max_length
    return sentencizer

Subdomains

Dependencies

  • langchain_text_splitters.base
  • spacy.lang.en
  • spacy.language
  • spacy.py
  • typing
  • typing_extensions

Frequently Asked Questions

What does spacy.py do?
spacy.py is a source file in the langchain codebase, written in python. It belongs to the DataProcessing domain, TextSplitters subdomain.
What functions are defined in spacy.py?
spacy.py defines 3 function(s): _HAS_SPACY, _make_spacy_pipeline_for_splitting, spacy.
What does spacy.py depend on?
spacy.py imports 6 module(s): langchain_text_splitters.base, spacy.lang.en, spacy.language, spacy.py, typing, typing_extensions.
What files import spacy.py?
spacy.py is imported by 1 file(s): spacy.py.
Where is spacy.py in the architecture?
spacy.py is located at libs/text-splitters/langchain_text_splitters/spacy.py (domain: DataProcessing, subdomain: TextSplitters, directory: libs/text-splitters/langchain_text_splitters).

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free