Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How would you go about finding longest string per row in a data frame?

I am writing a piece of code which allows me to open a CSV file and remove nan rows and also find strings that are too long in the data frame. I want the program to say what row the length of data exceeds the 30-character limit and give you an option to exit or skip.

I previously had it set up so it would go by columns instead, however im finding it difficult to locate the string when its set up like this.

for column in df:
        print(column,"->", df[column].astype(str).str.len().max())
        if df[column].astype(str).str.len().max() > 30 and column != ('Column 17'):
            print ("ERROR: Length of data exceeds 30 character limit")
            abill=int(input("1.Continue through file.\n2.Exit\n"))
            if abill==1:
                continue
            else:
                sys.exit()
        else:
            continue

This is my code at the moment.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

I would recommend not to use a loop, but rather to vectorize.

So, you want to identify the strings longer than a threshold, except for excluded columns?

Assuming this example:

        col1        col2              col3
0        abc           A  this_is_excluded
1  defghijkl       BCDEF          excluded
2       mnop  GHIJKLMNOP          excluded

If you want to mask the long strings:

exclude = ['col3']  # do not consider the columns in this list
threshold = 9       # consider strings longer or equal to threshold

mask = (df.drop(columns=exclude, errors='ignore')
          .apply(lambda s: s.str.len().ge(threshold))
          .reindex(columns=df.columns, fill_value=False)
       )

out = df.mask(mask, '') # mask with empty string

Output:

   col1   col2              col3
0   abc      A  this_is_excluded
1        BCDEF          excluded
2  mnop                 excluded

If you want to drop the rows with long strings:

exclude = ['col3']
threshold = 9

mask = (df.drop(columns=exclude, errors='ignore')
          .apply(lambda s: s.str.len().ge(threshold))
        )

out = df.loc[~mask.any(axis=1)]

Output:

  col1 col2              col3
0  abc    A  this_is_excluded

If you want to drop the columns with at least one too long string:

exclude = ['col3']
threshold = 9

mask = (df.drop(columns=exclude, errors='ignore')
          .agg(lambda s: s.str.len().ge(threshold).any())
       )

out = df.drop(columns=mask[mask].index)

Output:

               col3
0  this_is_excluded
1          excluded
2          excluded
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading