_tokenize() — langchain Function Reference
Architecture documentation for the _tokenize() function in base.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD 63131dd8_7b39_d355_1b39_29f63d60d98e["_tokenize()"] 2f237d29_e276_c4ef_3a56_7139ce49b50e["OpenAIEmbeddings"] 63131dd8_7b39_d355_1b39_29f63d60d98e -->|defined in| 2f237d29_e276_c4ef_3a56_7139ce49b50e bd7de307_7f7c_35fc_e574_e5dfd1b9a161["_get_len_safe_embeddings()"] bd7de307_7f7c_35fc_e574_e5dfd1b9a161 -->|calls| 63131dd8_7b39_d355_1b39_29f63d60d98e style 63131dd8_7b39_d355_1b39_29f63d60d98e fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
libs/partners/openai/langchain_openai/embeddings/base.py lines 429–525
def _tokenize(
self, texts: list[str], chunk_size: int
) -> tuple[Iterable[int], list[list[int] | str], list[int], list[int]]:
"""Tokenize and batch input texts.
Splits texts based on `embedding_ctx_length` and groups them into batches
of size `chunk_size`.
Args:
texts: The list of texts to tokenize.
chunk_size: The maximum number of texts to include in a single batch.
Returns:
A tuple containing:
1. An iterable of starting indices in the token list for each batch.
2. A list of tokenized texts (token arrays for tiktoken, strings for
HuggingFace).
3. An iterable mapping each token array to the index of the original
text. Same length as the token list.
4. A list of token counts for each tokenized text.
"""
tokens: list[list[int] | str] = []
indices: list[int] = []
token_counts: list[int] = []
model_name = self.tiktoken_model_name or self.model
# If tiktoken flag set to False
if not self.tiktoken_enabled:
try:
from transformers import AutoTokenizer
except ImportError:
msg = (
"Could not import transformers python package. "
"This is needed for OpenAIEmbeddings to work without "
"`tiktoken`. Please install it with `pip install transformers`. "
)
raise ValueError(msg)
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_name
)
for i, text in enumerate(texts):
# Tokenize the text using HuggingFace transformers
tokenized: list[int] = tokenizer.encode(text, add_special_tokens=False)
# Split tokens into chunks respecting the embedding_ctx_length
for j in range(0, len(tokenized), self.embedding_ctx_length):
token_chunk: list[int] = tokenized[
j : j + self.embedding_ctx_length
]
# Convert token IDs back to a string
chunk_text: str = tokenizer.decode(token_chunk)
tokens.append(chunk_text)
indices.append(i)
token_counts.append(len(token_chunk))
else:
try:
encoding = tiktoken.encoding_for_model(model_name)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
encoder_kwargs: dict[str, Any] = {
k: v
for k, v in {
"allowed_special": self.allowed_special,
"disallowed_special": self.disallowed_special,
}.items()
if v is not None
}
for i, text in enumerate(texts):
if self.model.endswith("001"):
# See: https://github.com/openai/openai-python/
# issues/418#issuecomment-1525939500
# replace newlines, which can negatively affect performance.
text = text.replace("\n", " ")
if encoder_kwargs:
token = encoding.encode(text, **encoder_kwargs)
else:
token = encoding.encode_ordinary(text)
Domain
Subdomains
Called By
Source
Frequently Asked Questions
What does _tokenize() do?
_tokenize() is a function in the langchain codebase, defined in libs/partners/openai/langchain_openai/embeddings/base.py.
Where is _tokenize() defined?
_tokenize() is defined in libs/partners/openai/langchain_openai/embeddings/base.py at line 429.
What calls _tokenize()?
_tokenize() is called by 1 function(s): _get_len_safe_embeddings.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free