_tokenize() — langchain Function Reference

Architecture documentation for the _tokenize() function in base.py from the langchain codebase.

Function python LangChainCore LanguageModelBase called by 1

Entity Profile

LangChainCore→ LanguageModelBase→ _tokenize() — langchain Function Reference

Dependency Diagram

graph TD
  63131dd8_7b39_d355_1b39_29f63d60d98e["_tokenize()"]
  2f237d29_e276_c4ef_3a56_7139ce49b50e["OpenAIEmbeddings"]
  63131dd8_7b39_d355_1b39_29f63d60d98e -->|defined in| 2f237d29_e276_c4ef_3a56_7139ce49b50e
  bd7de307_7f7c_35fc_e574_e5dfd1b9a161["_get_len_safe_embeddings()"]
  bd7de307_7f7c_35fc_e574_e5dfd1b9a161 -->|calls| 63131dd8_7b39_d355_1b39_29f63d60d98e
  style 63131dd8_7b39_d355_1b39_29f63d60d98e fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/partners/openai/langchain_openai/embeddings/base.py lines 429–525

    def _tokenize(
        self, texts: list[str], chunk_size: int
    ) -> tuple[Iterable[int], list[list[int] | str], list[int], list[int]]:
        """Tokenize and batch input texts.

        Splits texts based on `embedding_ctx_length` and groups them into batches
        of size `chunk_size`.

        Args:
            texts: The list of texts to tokenize.
            chunk_size: The maximum number of texts to include in a single batch.

        Returns:
            A tuple containing:
                1. An iterable of starting indices in the token list for each batch.
                2. A list of tokenized texts (token arrays for tiktoken, strings for
                    HuggingFace).
                3. An iterable mapping each token array to the index of the original
                    text. Same length as the token list.
                4. A list of token counts for each tokenized text.
        """
        tokens: list[list[int] | str] = []
        indices: list[int] = []
        token_counts: list[int] = []
        model_name = self.tiktoken_model_name or self.model

        # If tiktoken flag set to False
        if not self.tiktoken_enabled:
            try:
                from transformers import AutoTokenizer
            except ImportError:
                msg = (
                    "Could not import transformers python package. "
                    "This is needed for OpenAIEmbeddings to work without "
                    "`tiktoken`. Please install it with `pip install transformers`. "
                )
                raise ValueError(msg)

            tokenizer = AutoTokenizer.from_pretrained(
                pretrained_model_name_or_path=model_name
            )
            for i, text in enumerate(texts):
                # Tokenize the text using HuggingFace transformers
                tokenized: list[int] = tokenizer.encode(text, add_special_tokens=False)

                # Split tokens into chunks respecting the embedding_ctx_length
                for j in range(0, len(tokenized), self.embedding_ctx_length):
                    token_chunk: list[int] = tokenized[
                        j : j + self.embedding_ctx_length
                    ]

                    # Convert token IDs back to a string
                    chunk_text: str = tokenizer.decode(token_chunk)
                    tokens.append(chunk_text)
                    indices.append(i)
                    token_counts.append(len(token_chunk))
        else:
            try:
                encoding = tiktoken.encoding_for_model(model_name)
            except KeyError:
                encoding = tiktoken.get_encoding("cl100k_base")
            encoder_kwargs: dict[str, Any] = {
                k: v
                for k, v in {
                    "allowed_special": self.allowed_special,
                    "disallowed_special": self.disallowed_special,
                }.items()
                if v is not None
            }
            for i, text in enumerate(texts):
                if self.model.endswith("001"):
                    # See: https://github.com/openai/openai-python/
                    #      issues/418#issuecomment-1525939500
                    # replace newlines, which can negatively affect performance.
                    text = text.replace("\n", " ")

                if encoder_kwargs:
                    token = encoding.encode(text, **encoder_kwargs)
                else:
                    token = encoding.encode_ordinary(text)