pandas split set column in a duplicated row if set is bigger than len(x)

February 9, 2023

I have a dataframe that looks like this:

index      key                                   set_col          data
    0     "a1"                                ("a", "b")     "a1_data"   
    1     "a2"                      ("j", "k", "l", "m")     "a2_data"
    2     "b1"       ("z", "y", "x", "w", "v", "u", "t")     "b1_data"

I need to split the set_col, if the length of the set is higher than 3 elements and add it to a duplicated row, with the same data, resulting in this df:

index      key                                   set_col          data
    0     "a1"                                ("a", "b")     "a1_data"   
    1     "a2"                           ("j", "k", "l")     "a2_data"
    2     "a2"                                     ("m")     "a2_data"
    3     "b1"                           ("z", "y", "x")     "b1_data"
    4     "b1"                           ("w", "v", "u")     "b1_data"
    5     "b1"                                     ("t")     "b1_data"

I have read other answers using explode, replace or assign, like this or this but neither handles the case for splitting lists or sets to a length and duplicating the rows.

On this answer I found the following code:

def split(a, n):
    k, m = divmod(len(a), n)
    return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))

And I try to apply to the columns like this:

df['split_set_col'] = df['set_col'].apply(split(df['set_col'], 3))

But i get the Error:

pandas.errors.SpecificationError: nested renamer is not supported

>Solution :

Your function call is not right:

df['set_col'].apply(split(df['set_col'], 3))

Replace with:

df['set_col'].apply(split, n=3)  # note the n=3 as named argument

The function also contains errors, use np.array_split:

import numpy as np

def split(a, n):
    return np.array_split(a, np.arange(0, len(a), n)[1:])

df['split_set_col'] = df['set_col'].apply(split, n=3)

Output:

>>> df.explode('split_set_col', ignore_index=True)
    key                set_col       data split_set_col
0  "a1"                 (a, b)  "a1_data"        [a, b]
1  "a2"           (j, k, l, m)  "a2_data"     [j, k, l]
2  "a2"           (j, k, l, m)  "a2_data"           [m]
3  "b1"  (z, y, x, w, v, u, t)  "b1_data"     [z, y, x]
4  "b1"  (z, y, x, w, v, u, t)  "b1_data"     [w, v, u]
5  "b1"  (z, y, x, w, v, u, t)  "b1_data"           [t]