Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Split a long text in two or more parts each one with a maximum length in python

Let’s suppose I have a long text that I want to process with an API having a maximum number of allowed characters (N). I would like to split that text into 2 or more texts with shorter than N characters, and based on a separator. I know I could split by separator but I would like to keep the number of output sub-texts the smallest as possible.

For example, suppose my text is:

"Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique. Consulatu cotidieque ex sea, nam no duis prompta expetendis.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has."

which is 550 characters long. Let’s suppose that N is 250. I would expect the text to be split in this way:

  • Part 1: "Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique" (237 characters)

  • Part 2: "Consulatu cotidieque ex sea, nam no duis prompta expetendis.

Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros." (232 characters)

  • Part 3: the remaining.

Any idea on how to do this in Python?

Thank you for any help.
Francesca

>Solution :

You can do that using regex:

import re


ouput = re.findall(r".{1,250}(?:\.|$)", data)
print(ouput)

  • .{1,250}: Matches any character between 1 and 250 times, as many times as possible.
  • \.: Matches a dot.
  • |: Or
  • $: Matches the end of the string.

You can also put the delimiter and the maximum length in a variable.

import re


num_max = 250
delimiter = re.escape('.')

ouput = re.findall(fr".{{1,{num_max}}}(?:{delimiter}|$)", data)
print(ouput)

Output:

[
    'Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique.',
    ' Consulatu cotidieque ex sea, nam no duis prompta expetendis.',
    'Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has.'
]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading