split_text() — langchain Function Reference

Architecture documentation for the split_text() function in markdown.py from the langchain codebase.

Function python DocumentProcessing TextSplitters calls 7 called by 1

Entity Profile

DocumentProcessing→ TextSplitters→ split_text() — langchain Function Reference

Dependency Diagram

graph TD
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13["split_text()"]
  cd7394a9_9856_dc15_cb00_078cf42f0529["ExperimentalMarkdownSyntaxTextSplitter"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|defined in| cd7394a9_9856_dc15_cb00_078cf42f0529
  b18c92c3_4d24_0e77_6322_b71c795c08ff["split_text()"]
  b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| ca4b44a0_217b_9ee3_738c_a86f47cf5d13
  e4272ad6_fa6c_2270_2b87_1b76f6930a95["_match_header()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| e4272ad6_fa6c_2270_2b87_1b76f6930a95
  a98db0b3_1cd6_99eb_e5f6_1cfcd76b1885["_match_code()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| a98db0b3_1cd6_99eb_e5f6_1cfcd76b1885
  79e4075c_b84b_1d21_02bd_fab2d78f732e["_match_horz()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| 79e4075c_b84b_1d21_02bd_fab2d78f732e
  b10a7aa1_da71_adf1_c53e_7aa30fa1d45c["_complete_chunk_doc()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| b10a7aa1_da71_adf1_c53e_7aa30fa1d45c
  d853fccb_3a1e_9745_c402_faf93b6c62b2["_resolve_header_stack()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| d853fccb_3a1e_9745_c402_faf93b6c62b2
  7401677c_acf5_67de_69f0_eb4e531b0f66["_resolve_code_chunk()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| 7401677c_acf5_67de_69f0_eb4e531b0f66
  b18c92c3_4d24_0e77_6322_b71c795c08ff["split_text()"]
  ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| b18c92c3_4d24_0e77_6322_b71c795c08ff
  style ca4b44a0_217b_9ee3_738c_a86f47cf5d13 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

libs/text-splitters/langchain_text_splitters/markdown.py lines 372–432

    def split_text(self, text: str) -> list[Document]:
        """Split the input text into structured chunks.

        This method processes the input text line by line, identifying and handling
        specific patterns such as headers, code blocks, and horizontal rules to split it
        into structured chunks based on headers, code blocks, and horizontal rules.

        Args:
            text: The input text to be split into chunks.

        Returns:
            A list of `Document` objects representing the structured
            chunks of the input text. If `return_each_line` is enabled, each line
            is returned as a separate `Document`.
        """
        # Reset the state for each new file processed
        self.chunks.clear()
        self.current_chunk = Document(page_content="")
        self.current_header_stack.clear()

        raw_lines = text.splitlines(keepends=True)

        while raw_lines:
            raw_line = raw_lines.pop(0)
            header_match = self._match_header(raw_line)
            code_match = self._match_code(raw_line)
            horz_match = self._match_horz(raw_line)
            if header_match:
                self._complete_chunk_doc()

                if not self.strip_headers:
                    self.current_chunk.page_content += raw_line

                # add the header to the stack
                header_depth = len(header_match.group(1))
                header_text = header_match.group(2)
                self._resolve_header_stack(header_depth, header_text)
            elif code_match:
                self._complete_chunk_doc()
                self.current_chunk.page_content = self._resolve_code_chunk(
                    raw_line, raw_lines
                )
                self.current_chunk.metadata["Code"] = code_match.group(1)
                self._complete_chunk_doc()
            elif horz_match:
                self._complete_chunk_doc()
            else:
                self.current_chunk.page_content += raw_line

        self._complete_chunk_doc()
        # I don't see why `return_each_line` is a necessary feature of this splitter.
        # It's easy enough to do outside of the class and the caller can have more
        # control over it.
        if self.return_each_line:
            return [
                Document(page_content=line, metadata=chunk.metadata)
                for chunk in self.chunks
                for line in chunk.page_content.splitlines()
                if line and not line.isspace()
            ]
        return self.chunks

Domain

DocumentProcessing

Subdomains

TextSplitters

Defined In

libs/text-splitters/langchain_text_splitters/markdown.py

Calls

Called By

split_text()

Source

View on GitHub

Frequently Asked Questions

What does split_text() do?

split_text() is a function in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/markdown.py.

Where is split_text() defined?

split_text() is defined in libs/text-splitters/langchain_text_splitters/markdown.py at line 372.

What does split_text() call?

split_text() calls 7 function(s): _complete_chunk_doc, _match_code, _match_header, _match_horz, _resolve_code_chunk, _resolve_header_stack, split_text.

What calls split_text()?

split_text() is called by 1 function(s): split_text.

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free