Effective Document Chunking: From Basic to Advanced Methods

2 min readJun 22, 2024

Introduction

Document chunking is a crucial technique in natural language processing that involves breaking down large texts into smaller, manageable pieces. This process enhances retrieval efficiency, comprehension, and processing in various applications such as search engines, chatbots, and machine learning models. This article explores different methods of chunking documents, from basic to advanced techniques, including OpenAI’s chunking tools.

Basic Methods

1. Fixed-Length Chunking

The simplest form of chunking involves splitting the document into fixed-length chunks based on a predefined number of words or characters.

def fixed_length_chunking(text, chunk_size=200):
    words = text.split()
    chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

2. Sentence-Based Chunking

This method divides the document into chunks based on complete sentences, ensuring that each chunk contains whole sentences rather than splitting them in the middle.

import nltk

nltk.download('punkt')

def sentence_based_chunking(text, max_sentences=5):
    sentences = nltk.sent_tokenize(text)
    chunks = [' '.join(sentences[i:i + max_sentences]) for i in range(0, len(sentences), max_sentences)]
    return chunks

Intermediate Methods

3. Paragraph-Based Chunking

Chunking by paragraphs retains the natural structure of the document and is useful when the logical structure of text is important.

def paragraph_based_chunking(text):
    paragraphs = text.split('\n\n')
    return paragraphs

4. Overlapping Chunks

Overlapping chunks provide better context by including some overlapping content between consecutive chunks, which can be helpful for models that process the chunks sequentially.

def overlapping_chunking(text, chunk_size=200, overlap=50):
    words = text.split()
    chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size - overlap)]
    return chunks

Advanced Methods

5. Character-Based Chunking with OpenAI’s Tokenizers

OpenAI’s CharacterTextSplitter can be used for precise control over chunk sizes by characters, ensuring chunks do not exceed model token limits.

from openai.embeddings_utils import CharacterTextSplitter

def character_based_chunking(text, chunk_size=200):
    splitter = CharacterTextSplitter(chunk_size=chunk_size)
    chunks = splitter.split(text)
    return chunks

6. Recursive Character-Based Chunking with OpenAI’s Tokenizers

For handling edge cases where certain chunks are too large, recursive splitting is employed to ensure all chunks meet size constraints.

from openai.embeddings_utils import RecursiveCharacterTextSplitter

def recursive_character_based_chunking(text, chunk_size=200):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
    chunks = splitter.split(text)
    return chunks

7. Semantic-Based Chunking with Embeddings

This advanced method uses semantic information to create chunks that represent coherent units of meaning, often utilizing embeddings or topic modeling.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

def semantic_based_chunking(text, num_chunks=10):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    sentences = nltk.sent_tokenize(text)
    embeddings = model.encode(sentences)
    
    kmeans = KMeans(n_clusters=num_chunks)
    kmeans.fit(embeddings)
    clusters = kmeans.predict(embeddings)
    
    chunks = [' '.join([sentences[i] for i in range(len(sentences)) if clusters[i] == cluster]) for cluster in range(num_chunks)]
    return chunks

Conclusion

Effective document chunking enhances the efficiency and accuracy of various natural language processing tasks. From basic methods like fixed-length and sentence-based chunking to advanced techniques using OpenAI’s tokenizers and semantic-based chunking, each method has its own advantages and use cases. Selecting the appropriate chunking strategy depends on the specific requirements and goals of your application.