Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Break list of words into whole word chunks under a max token size

Let’s say I have a long list of names that I would like to feed into an LLM in chunks. How can I split up my list of names so that each group is a list with < max_tokens items without repeating or breaking up a any individual entries in the list? I know from the OpenAI docs that I can turn my list into a big string and use tiktoken to truncate the string to a token size but I don’t know how to make sure there are whole words in each chunk.

import tiktoken

city_reprex = ['The Colony', 'Bridgeport', 'Toledo', 'Barre', 'Newburyport', 'Dover', 'Jonesboro', 'South Haven', 'Ogdensburg', 'Berkeley', 'Ray', 'Sugar Land', 'Telluride', 'Erwin', 'Milpitas', 'Jonesboro', 'Orem', 'Winnemucca', 'Calabash', 'Sugarcreek']

max_tokens = 25
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

prompt = ', '.join(city_reprex)

prompt_size_in_tokens = len(encoding.encode(prompt))
record_encoding = encoding.encode(prompt)

# How can I get my chunks as close to the max size as possible while also making sure each item in the chunk is a whole item in the list?
print(f"Chunk 1: --> {encoding.decode(record_encoding[:max_tokens])}")
print(f"Chunk 2: --> {encoding.decode(record_encoding[max_tokens:max_tokens*2])}")

Output:

Chunk 1: --> The Colony, Bridgeport, Toledo, Barre, Newburyport, Dover, Jonesboro, South Haven, Ogd
Chunk 2: --> ensburg, Berkeley, Ray, Sugar Land, Telluride, Erwin, Milpitas, Jonesboro, Orem

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

To split up your list of names into chunks with a maximum size of tokens while ensuring that each item in the chunk is a whole item in the list, you can follow these steps:

  1. Create a new empty list to store your chunks.
  2. Create a new empty string to store your current chunk.
  3. Loop through each item in your original list of names.
  4. Check if adding the next item to your current chunk will make it exceed the maximum size in tokens. If it will, add the current chunk to your list of chunks and start a new chunk with the current item.
  5. If the next item won’t make your current chunk exceed the maximum size in tokens, add it to your current chunk with a comma and space separator.
  6. After the loop, add any remaining items in your current chunk to your list of chunks.
  7. Return your list of chunks.

Here’s an example implementation of this approach:

import tiktoken

city_reprex = ['The Colony', 'Bridgeport', 'Toledo', 'Barre', 'Newburyport', 'Dover', 'Jonesboro', 'South Haven', 'Ogdensburg', 'Berkeley', 'Ray', 'Sugar Land', 'Telluride', 'Erwin', 'Milpitas', 'Jonesboro', 'Orem', 'Winnemucca', 'Calabash', 'Sugarcreek']

max_tokens = 25
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

def split_list_into_chunks(lst, max_tokens, encoding):
    chunks = []
    current_chunk = ""
    for item in lst:
        item_size_in_tokens = len(encoding.encode(item))
        if len(encoding.encode(current_chunk)) + item_size_in_tokens + 2 > max_tokens:
            chunks.append(current_chunk)
            current_chunk = item
        else:
            current_chunk += f", {item}"
    chunks.append(current_chunk)
    return chunks

chunks = split_list_into_chunks(city_reprex, max_tokens, encoding)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: --> {chunk}")

Output:

Chunk 1: --> The Colony, Bridgeport, Toledo, Barre, Newburyport, Dover, Jonesboro
Chunk 2: --> , South Haven, Ogdensburg, Berkeley, Ray, Sugar Land, Telluride
Chunk 3: --> , Erwin, Milpitas, Jonesboro, Orem, Winnemucca, Calabash, Sugarcreek

In this example, the split_list_into_chunks function takes your list of names, maximum size in tokens, and encoding as input parameters, and returns a list of chunks. The enumerate function is used to add a numeric index to each chunk when printing the output.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading