spacy.py — langchain Source File
Architecture documentation for spacy.py, a python file in the langchain codebase. 6 imports, 1 dependents.
Entity Profile
Dependency Diagram
graph LR 6511588c_fdc6_97a2_3753_2c61ff504a39["spacy.py"] feec1ec4_6917_867b_d228_b134d0ff8099["typing"] 6511588c_fdc6_97a2_3753_2c61ff504a39 --> feec1ec4_6917_867b_d228_b134d0ff8099 f85fae70_1011_eaec_151c_4083140ae9e5["typing_extensions"] 6511588c_fdc6_97a2_3753_2c61ff504a39 --> f85fae70_1011_eaec_151c_4083140ae9e5 8dcf5d75_3e05_1e6b_3ce2_4d8907e376c0["langchain_text_splitters.base"] 6511588c_fdc6_97a2_3753_2c61ff504a39 --> 8dcf5d75_3e05_1e6b_3ce2_4d8907e376c0 6511588c_fdc6_97a2_3753_2c61ff504a39["spacy.py"] 6511588c_fdc6_97a2_3753_2c61ff504a39 --> 6511588c_fdc6_97a2_3753_2c61ff504a39 d9ede8c0_1451_dd31_bbb7_41c515ba4657["spacy.lang.en"] 6511588c_fdc6_97a2_3753_2c61ff504a39 --> d9ede8c0_1451_dd31_bbb7_41c515ba4657 b0563be8_2a1a_a970_40e8_9f3b7eeeb8a8["spacy.language"] 6511588c_fdc6_97a2_3753_2c61ff504a39 --> b0563be8_2a1a_a970_40e8_9f3b7eeeb8a8 6511588c_fdc6_97a2_3753_2c61ff504a39["spacy.py"] 6511588c_fdc6_97a2_3753_2c61ff504a39 --> 6511588c_fdc6_97a2_3753_2c61ff504a39 style 6511588c_fdc6_97a2_3753_2c61ff504a39 fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
"""Spacy text splitter."""
from __future__ import annotations
from typing import TYPE_CHECKING, Any
from typing_extensions import override
from langchain_text_splitters.base import TextSplitter
try:
# Type ignores needed as long as spacy doesn't support Python 3.14.
import spacy # type: ignore[import-not-found, unused-ignore]
from spacy.lang.en import English # type: ignore[import-not-found, unused-ignore]
if TYPE_CHECKING:
from spacy.language import ( # type: ignore[import-not-found, unused-ignore]
Language,
)
_HAS_SPACY = True
except ImportError:
_HAS_SPACY = False
class SpacyTextSplitter(TextSplitter):
"""Splitting text using Spacy package.
Per default, Spacy's `en_core_web_sm` model is used and
its default max_length is 1000000 (it is the length of maximum character
this model takes which can be increased for large files). For a faster, but
potentially less accurate splitting, you can use `pipeline='sentencizer'`.
"""
def __init__(
self,
separator: str = "\n\n",
pipeline: str = "en_core_web_sm",
max_length: int = 1_000_000,
*,
strip_whitespace: bool = True,
**kwargs: Any,
) -> None:
"""Initialize the spacy text splitter."""
super().__init__(**kwargs)
self._tokenizer = _make_spacy_pipeline_for_splitting(
pipeline, max_length=max_length
)
self._separator = separator
self._strip_whitespace = strip_whitespace
@override
def split_text(self, text: str) -> list[str]:
splits = (
s.text if self._strip_whitespace else s.text_with_ws
for s in self._tokenizer(text).sents
)
return self._merge_splits(splits, self._separator)
def _make_spacy_pipeline_for_splitting(
pipeline: str, *, max_length: int = 1_000_000
) -> Language:
if not _HAS_SPACY:
msg = "Spacy is not installed, please install it with `pip install spacy`."
raise ImportError(msg)
if pipeline == "sentencizer":
sentencizer: Language = English()
sentencizer.add_pipe("sentencizer")
else:
sentencizer = spacy.load(pipeline, exclude=["ner", "tagger"])
sentencizer.max_length = max_length
return sentencizer
Domain
Subdomains
Classes
Dependencies
- langchain_text_splitters.base
- spacy.lang.en
- spacy.language
- spacy.py
- typing
- typing_extensions
Source
Frequently Asked Questions
What does spacy.py do?
spacy.py is a source file in the langchain codebase, written in python. It belongs to the DataProcessing domain, TextSplitters subdomain.
What functions are defined in spacy.py?
spacy.py defines 3 function(s): _HAS_SPACY, _make_spacy_pipeline_for_splitting, spacy.
What does spacy.py depend on?
spacy.py imports 6 module(s): langchain_text_splitters.base, spacy.lang.en, spacy.language, spacy.py, typing, typing_extensions.
What files import spacy.py?
spacy.py is imported by 1 file(s): spacy.py.
Where is spacy.py in the architecture?
spacy.py is located at libs/text-splitters/langchain_text_splitters/spacy.py (domain: DataProcessing, subdomain: TextSplitters, directory: libs/text-splitters/langchain_text_splitters).
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free