Home / Function/ _get_document_with_hash() — langchain Function Reference

_get_document_with_hash() — langchain Function Reference

Architecture documentation for the _get_document_with_hash() function in api.py from the langchain codebase.

Entity Profile

Dependency Diagram

graph TD
  adeaf2c1_ef58_0e0c_bf53_4534663c6164["_get_document_with_hash()"]
  203188c0_72d6_6932_bc21_edf25c4c00ef["api.py"]
  adeaf2c1_ef58_0e0c_bf53_4534663c6164 -->|defined in| 203188c0_72d6_6932_bc21_edf25c4c00ef
  5721a97d_0581_0694_e3e6_0ae44f2b3fb0["index()"]
  5721a97d_0581_0694_e3e6_0ae44f2b3fb0 -->|calls| adeaf2c1_ef58_0e0c_bf53_4534663c6164
  02b67c59_d093_f33d_633c_d77332eb191e["aindex()"]
  02b67c59_d093_f33d_633c_d77332eb191e -->|calls| adeaf2c1_ef58_0e0c_bf53_4534663c6164
  620ce5e7_2594_a746_99a6_c56af4fd553a["_calculate_hash()"]
  adeaf2c1_ef58_0e0c_bf53_4534663c6164 -->|calls| 620ce5e7_2594_a746_99a6_c56af4fd553a
  style adeaf2c1_ef58_0e0c_bf53_4534663c6164 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/core/langchain_core/indexing/api.py lines 169–224

def _get_document_with_hash(
    document: Document,
    *,
    key_encoder: Callable[[Document], str]
    | Literal["sha1", "sha256", "sha512", "blake2b"],
) -> Document:
    """Calculate a hash of the document, and assign it to the uid.

    When using one of the predefined hashing algorithms, the hash is calculated
    by hashing the content and the metadata of the document.

    Args:
        document: Document to hash.
        key_encoder: Hashing algorithm to use for hashing the document.
            If not provided, a default encoder using SHA-1 will be used.
            SHA-1 is not collision-resistant, and a motivated attacker
            could craft two different texts that hash to the
            same cache key.

            New applications should use one of the alternative encoders
            or provide a custom and strong key encoder function to avoid this risk.

            When changing the key encoder, you must change the
            index as well to avoid duplicated documents in the cache.

    Raises:
        ValueError: If the metadata cannot be serialized using json.

    Returns:
        Document with a unique identifier based on the hash of the content and metadata.
    """
    metadata: dict[str, Any] = dict(document.metadata or {})

    if callable(key_encoder):
        # If key_encoder is a callable, we use it to generate the hash.
        hash_ = key_encoder(document)
    else:
        # The hashes are calculated separate for the content and the metadata.
        content_hash = _calculate_hash(document.page_content, algorithm=key_encoder)
        try:
            serialized_meta = json.dumps(metadata, sort_keys=True)
        except Exception as e:
            msg = (
                f"Failed to hash metadata: {e}. "
                f"Please use a dict that can be serialized using json."
            )
            raise ValueError(msg) from e
        metadata_hash = _calculate_hash(serialized_meta, algorithm=key_encoder)
        hash_ = _calculate_hash(content_hash + metadata_hash, algorithm=key_encoder)

    return Document(
        # Assign a unique identifier based on the hash.
        id=hash_,
        page_content=document.page_content,
        metadata=document.metadata,
    )

Subdomains

Called By

Frequently Asked Questions

What does _get_document_with_hash() do?
_get_document_with_hash() is a function in the langchain codebase, defined in libs/core/langchain_core/indexing/api.py.
Where is _get_document_with_hash() defined?
_get_document_with_hash() is defined in libs/core/langchain_core/indexing/api.py at line 169.
What does _get_document_with_hash() call?
_get_document_with_hash() calls 1 function(s): _calculate_hash.
What calls _get_document_with_hash()?
_get_document_with_hash() is called by 2 function(s): aindex, index.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free