Break list of words into whole word chunks under a max token size

May 2, 2023

Let’s say I have a long list of names that I would like to feed into an LLM in chunks. How can I split up my list of names so that each group is a list with < max_tokens items without repeating or breaking up a any individual entries in the list? I know from the OpenAI docs that I can turn my list into a big string and use tiktoken to truncate the string to a token size but I don’t know how to make sure there are whole words in each chunk.

import tiktoken

city_reprex = ['The Colony', 'Bridgeport', 'Toledo', 'Barre', 'Newburyport', 'Dover', 'Jonesboro', 'South Haven', 'Ogdensburg', 'Berkeley', 'Ray', 'Sugar Land', 'Telluride', 'Erwin', 'Milpitas', 'Jonesboro', 'Orem', 'Winnemucca', 'Calabash', 'Sugarcreek']

max_tokens = 25
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

prompt = ', '.join(city_reprex)

prompt_size_in_tokens = len(encoding.encode(prompt))
record_encoding = encoding.encode(prompt)

# How can I get my chunks as close to the max size as possible while also making sure each item in the chunk is a whole item in the list?
print(f"Chunk 1: --> {encoding.decode(record_encoding[:max_tokens])}")
print(f"Chunk 2: --> {encoding.decode(record_encoding[max_tokens:max_tokens*2])}")

Output:

Chunk 1: --> The Colony, Bridgeport, Toledo, Barre, Newburyport, Dover, Jonesboro, South Haven, Ogd
Chunk 2: --> ensburg, Berkeley, Ray, Sugar Land, Telluride, Erwin, Milpitas, Jonesboro, Orem

>Solution :

To split up your list of names into chunks with a maximum size of tokens while ensuring that each item in the chunk is a whole item in the list, you can follow these steps:

Create a new empty list to store your chunks.
Create a new empty string to store your current chunk.
Loop through each item in your original list of names.
Check if adding the next item to your current chunk will make it exceed the maximum size in tokens. If it will, add the current chunk to your list of chunks and start a new chunk with the current item.
If the next item won’t make your current chunk exceed the maximum size in tokens, add it to your current chunk with a comma and space separator.
After the loop, add any remaining items in your current chunk to your list of chunks.
Return your list of chunks.

Here’s an example implementation of this approach:

import tiktoken

city_reprex = ['The Colony', 'Bridgeport', 'Toledo', 'Barre', 'Newburyport', 'Dover', 'Jonesboro', 'South Haven', 'Ogdensburg', 'Berkeley', 'Ray', 'Sugar Land', 'Telluride', 'Erwin', 'Milpitas', 'Jonesboro', 'Orem', 'Winnemucca', 'Calabash', 'Sugarcreek']

max_tokens = 25
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

def split_list_into_chunks(lst, max_tokens, encoding):
    chunks = []
    current_chunk = ""
    for item in lst:
        item_size_in_tokens = len(encoding.encode(item))
        if len(encoding.encode(current_chunk)) + item_size_in_tokens + 2 > max_tokens:
            chunks.append(current_chunk)
            current_chunk = item
        else:
            current_chunk += f", {item}"
    chunks.append(current_chunk)
    return chunks

chunks = split_list_into_chunks(city_reprex, max_tokens, encoding)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: --> {chunk}")

Output:

Chunk 1: --> The Colony, Bridgeport, Toledo, Barre, Newburyport, Dover, Jonesboro
Chunk 2: --> , South Haven, Ogdensburg, Berkeley, Ray, Sugar Land, Telluride
Chunk 3: --> , Erwin, Milpitas, Jonesboro, Orem, Winnemucca, Calabash, Sugarcreek

In this example, the split_list_into_chunks function takes your list of names, maximum size in tokens, and encoding as input parameters, and returns a list of chunks. The enumerate function is used to add a numeric index to each chunk when printing the output.