split_text() — langchain Function Reference
Architecture documentation for the split_text() function in markdown.py from the langchain codebase.
Entity Profile
Dependency Diagram
graph TD b18c92c3_4d24_0e77_6322_b71c795c08ff["split_text()"] 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672["MarkdownHeaderTextSplitter"] b18c92c3_4d24_0e77_6322_b71c795c08ff -->|defined in| 6a11b5bb_e2e9_6671_54b0_3ed10f3c9672 ca4b44a0_217b_9ee3_738c_a86f47cf5d13["split_text()"] ca4b44a0_217b_9ee3_738c_a86f47cf5d13 -->|calls| b18c92c3_4d24_0e77_6322_b71c795c08ff c0f7b205_386f_81f7_010d_c9dac195bb30["_is_custom_header()"] b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| c0f7b205_386f_81f7_010d_c9dac195bb30 cd7326ce_b97a_382e_8cf3_d4647f3a82a6["aggregate_lines_to_chunks()"] b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| cd7326ce_b97a_382e_8cf3_d4647f3a82a6 ca4b44a0_217b_9ee3_738c_a86f47cf5d13["split_text()"] b18c92c3_4d24_0e77_6322_b71c795c08ff -->|calls| ca4b44a0_217b_9ee3_738c_a86f47cf5d13 style b18c92c3_4d24_0e77_6322_b71c795c08ff fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
libs/text-splitters/langchain_text_splitters/markdown.py lines 134–280
def split_text(self, text: str) -> list[Document]:
"""Split markdown file.
Args:
text: Markdown file
Returns:
List of `Document` objects.
"""
# Split the input text by newline character ("\n").
lines = text.split("\n")
# Final output
lines_with_metadata: list[LineType] = []
# Content and metadata of the chunk currently being processed
current_content: list[str] = []
current_metadata: dict[str, str] = {}
# Keep track of the nested header structure
header_stack: list[HeaderType] = []
initial_metadata: dict[str, str] = {}
in_code_block = False
opening_fence = ""
for line in lines:
stripped_line = line.strip()
# Remove all non-printable characters from the string, keeping only visible
# text.
stripped_line = "".join(filter(str.isprintable, stripped_line))
if not in_code_block:
# Exclude inline code spans
if stripped_line.startswith("```") and stripped_line.count("```") == 1:
in_code_block = True
opening_fence = "```"
elif stripped_line.startswith("~~~"):
in_code_block = True
opening_fence = "~~~"
elif stripped_line.startswith(opening_fence):
in_code_block = False
opening_fence = ""
if in_code_block:
current_content.append(stripped_line)
continue
# Check each line against each of the header types (e.g., #, ##)
for sep, name in self.headers_to_split_on:
is_standard_header = stripped_line.startswith(sep) and (
# Header with no text OR header is followed by space
# Both are valid conditions that sep is being used a header
len(stripped_line) == len(sep) or stripped_line[len(sep)] == " "
)
is_custom_header = self._is_custom_header(stripped_line, sep)
# Check if line matches either standard or custom header pattern
if is_standard_header or is_custom_header:
# Ensure we are tracking the header as metadata
if name is not None:
# Get the current header level
if sep in self.custom_header_patterns:
current_header_level = self.custom_header_patterns[sep]
else:
current_header_level = sep.count("#")
# Pop out headers of lower or same level from the stack
while (
header_stack
and header_stack[-1]["level"] >= current_header_level
):
# We have encountered a new header
# at the same or higher level
popped_header = header_stack.pop()
# Clear the metadata for the
# popped header in initial_metadata
if popped_header["name"] in initial_metadata:
initial_metadata.pop(popped_header["name"])
Domain
Subdomains
Called By
Source
Frequently Asked Questions
What does split_text() do?
split_text() is a function in the langchain codebase, defined in libs/text-splitters/langchain_text_splitters/markdown.py.
Where is split_text() defined?
split_text() is defined in libs/text-splitters/langchain_text_splitters/markdown.py at line 134.
What does split_text() call?
split_text() calls 3 function(s): _is_custom_header, aggregate_lines_to_chunks, split_text.
What calls split_text()?
split_text() is called by 1 function(s): split_text.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free