Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pandas split set column in a duplicated row if set is bigger than len(x)

I have a dataframe that looks like this:

index      key                                   set_col          data
    0     "a1"                                ("a", "b")     "a1_data"   
    1     "a2"                      ("j", "k", "l", "m")     "a2_data"
    2     "b1"       ("z", "y", "x", "w", "v", "u", "t")     "b1_data"

I need to split the set_col, if the length of the set is higher than 3 elements and add it to a duplicated row, with the same data, resulting in this df:

index      key                                   set_col          data
    0     "a1"                                ("a", "b")     "a1_data"   
    1     "a2"                           ("j", "k", "l")     "a2_data"
    2     "a2"                                     ("m")     "a2_data"
    3     "b1"                           ("z", "y", "x")     "b1_data"
    4     "b1"                           ("w", "v", "u")     "b1_data"
    5     "b1"                                     ("t")     "b1_data"

I have read other answers using explode, replace or assign, like this or this but neither handles the case for splitting lists or sets to a length and duplicating the rows.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

On this answer I found the following code:

def split(a, n):
    k, m = divmod(len(a), n)
    return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))

And I try to apply to the columns like this:

df['split_set_col'] = df['set_col'].apply(split(df['set_col'], 3))

But i get the Error:

pandas.errors.SpecificationError: nested renamer is not supported

>Solution :

Your function call is not right:

df['set_col'].apply(split(df['set_col'], 3))

Replace with:

df['set_col'].apply(split, n=3)  # note the n=3 as named argument

The function also contains errors, use np.array_split:

import numpy as np

def split(a, n):
    return np.array_split(a, np.arange(0, len(a), n)[1:])

df['split_set_col'] = df['set_col'].apply(split, n=3)

Output:

>>> df.explode('split_set_col', ignore_index=True)
    key                set_col       data split_set_col
0  "a1"                 (a, b)  "a1_data"        [a, b]
1  "a2"           (j, k, l, m)  "a2_data"     [j, k, l]
2  "a2"           (j, k, l, m)  "a2_data"           [m]
3  "b1"  (z, y, x, w, v, u, t)  "b1_data"     [z, y, x]
4  "b1"  (z, y, x, w, v, u, t)  "b1_data"     [w, v, u]
5  "b1"  (z, y, x, w, v, u, t)  "b1_data"           [t]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading