Home / Function/ _process_batched_chunked_embeddings() — langchain Function Reference

_process_batched_chunked_embeddings() — langchain Function Reference

Architecture documentation for the _process_batched_chunked_embeddings() function in base.py from the langchain codebase.

Entity Profile

Dependency Diagram

graph TD
  334ac3be_75cb_5ada_dfd5_067dbcd323f0["_process_batched_chunked_embeddings()"]
  e013244a_7e0e_baa7_ce3b_16dab4320e45["base.py"]
  334ac3be_75cb_5ada_dfd5_067dbcd323f0 -->|defined in| e013244a_7e0e_baa7_ce3b_16dab4320e45
  bd7de307_7f7c_35fc_e574_e5dfd1b9a161["_get_len_safe_embeddings()"]
  bd7de307_7f7c_35fc_e574_e5dfd1b9a161 -->|calls| 334ac3be_75cb_5ada_dfd5_067dbcd323f0
  b1a193e7_39a7_c737_2248_ba3dd74ba93c["_aget_len_safe_embeddings()"]
  b1a193e7_39a7_c737_2248_ba3dd74ba93c -->|calls| 334ac3be_75cb_5ada_dfd5_067dbcd323f0
  style 334ac3be_75cb_5ada_dfd5_067dbcd323f0 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/partners/openai/langchain_openai/embeddings/base.py lines 26–83

def _process_batched_chunked_embeddings(
    num_texts: int,
    tokens: list[list[int] | str],
    batched_embeddings: list[list[float]],
    indices: list[int],
    skip_empty: bool,
) -> list[list[float] | None]:
    # for each text, this is the list of embeddings (list of list of floats)
    # corresponding to the chunks of the text
    results: list[list[list[float]]] = [[] for _ in range(num_texts)]

    # for each text, this is the token length of each chunk
    # for transformers tokenization, this is the string length
    # for tiktoken, this is the number of tokens
    num_tokens_in_batch: list[list[int]] = [[] for _ in range(num_texts)]

    for i in range(len(indices)):
        if skip_empty and len(batched_embeddings[i]) == 1:
            continue
        results[indices[i]].append(batched_embeddings[i])
        num_tokens_in_batch[indices[i]].append(len(tokens[i]))

    # for each text, this is the final embedding
    embeddings: list[list[float] | None] = []
    for i in range(num_texts):
        # an embedding for each chunk
        _result: list[list[float]] = results[i]

        if len(_result) == 0:
            # this will be populated with the embedding of an empty string
            # in the sync or async code calling this
            embeddings.append(None)
            continue

        if len(_result) == 1:
            # if only one embedding was produced, use it
            embeddings.append(_result[0])
            continue

        # else we need to weighted average
        # should be same as
        # average = np.average(_result, axis=0, weights=num_tokens_in_batch[i])
        total_weight = sum(num_tokens_in_batch[i])
        average = [
            sum(
                val * weight
                for val, weight in zip(embedding, num_tokens_in_batch[i], strict=False)
            )
            / total_weight
            for embedding in zip(*_result, strict=False)
        ]

        # should be same as
        # embeddings.append((average / np.linalg.norm(average)).tolist())
        magnitude = sum(val**2 for val in average) ** 0.5
        embeddings.append([val / magnitude for val in average])

    return embeddings

Domain

Subdomains

Frequently Asked Questions

What does _process_batched_chunked_embeddings() do?
_process_batched_chunked_embeddings() is a function in the langchain codebase, defined in libs/partners/openai/langchain_openai/embeddings/base.py.
Where is _process_batched_chunked_embeddings() defined?
_process_batched_chunked_embeddings() is defined in libs/partners/openai/langchain_openai/embeddings/base.py at line 26.
What calls _process_batched_chunked_embeddings()?
_process_batched_chunked_embeddings() is called by 2 function(s): _aget_len_safe_embeddings, _get_len_safe_embeddings.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free