Home / Function/ split_text_on_tokens() — langchain Function Reference

split_text_on_tokens() — langchain Function Reference

Architecture documentation for the split_text_on_tokens() function in base.py from the langchain codebase.

Entity Profile

Dependency Diagram

graph TD
  0f51bcb8_84bd_5648_dc1d_650381b1e32d["split_text_on_tokens()"]
  d96ff4b9_fcc1_8428_729e_f75b099397b4["base.py"]
  0f51bcb8_84bd_5648_dc1d_650381b1e32d -->|defined in| d96ff4b9_fcc1_8428_729e_f75b099397b4
  b20de6d0_e7f4_4423_1863_b2f88e0d3c76["split_text()"]
  b20de6d0_e7f4_4423_1863_b2f88e0d3c76 -->|calls| 0f51bcb8_84bd_5648_dc1d_650381b1e32d
  style 0f51bcb8_84bd_5648_dc1d_650381b1e32d fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/base.py lines 422–450

def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]:
    """Split incoming text and return chunks using tokenizer.

    Args:
        text: The input text to be split.
        tokenizer: The tokenizer to use for splitting.

    Returns:
        A list of text chunks.
    """
    splits: list[str] = []
    input_ids = tokenizer.encode(text)
    start_idx = 0
    if tokenizer.tokens_per_chunk <= tokenizer.chunk_overlap:
        msg = "tokens_per_chunk must be greater than chunk_overlap"
        raise ValueError(msg)

    while start_idx < len(input_ids):
        cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
        chunk_ids = input_ids[start_idx:cur_idx]
        if not chunk_ids:
            break
        decoded = tokenizer.decode(chunk_ids)
        if decoded:
            splits.append(decoded)
        if cur_idx == len(input_ids):
            break
        start_idx += tokenizer.tokens_per_chunk - tokenizer.chunk_overlap
    return splits

Subdomains

Called By

Frequently Asked Questions

What does split_text_on_tokens() do?
split_text_on_tokens() is a function in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/base.py.
Where is split_text_on_tokens() defined?
split_text_on_tokens() is defined in libs/text-splitters/langchain_text_splitters/base.py at line 422.
What calls split_text_on_tokens()?
split_text_on_tokens() is called by 1 function(s): split_text.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free