Advertisements
I have a dataframe that looks like this:
index key set_col data
0 "a1" ("a", "b") "a1_data"
1 "a2" ("j", "k", "l", "m") "a2_data"
2 "b1" ("z", "y", "x", "w", "v", "u", "t") "b1_data"
I need to split the set_col
, if the length of the set is higher than 3 elements and add it to a duplicated row, with the same data, resulting in this df:
index key set_col data
0 "a1" ("a", "b") "a1_data"
1 "a2" ("j", "k", "l") "a2_data"
2 "a2" ("m") "a2_data"
3 "b1" ("z", "y", "x") "b1_data"
4 "b1" ("w", "v", "u") "b1_data"
5 "b1" ("t") "b1_data"
I have read other answers using explode
, replace
or assign
, like this or this but neither handles the case for splitting lists or sets to a length and duplicating the rows.
On this answer I found the following code:
def split(a, n):
k, m = divmod(len(a), n)
return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))
And I try to apply to the columns like this:
df['split_set_col'] = df['set_col'].apply(split(df['set_col'], 3))
But i get the Error:
pandas.errors.SpecificationError: nested renamer is not supported
>Solution :
Your function call is not right:
df['set_col'].apply(split(df['set_col'], 3))
Replace with:
df['set_col'].apply(split, n=3) # note the n=3 as named argument
The function also contains errors, use np.array_split
:
import numpy as np
def split(a, n):
return np.array_split(a, np.arange(0, len(a), n)[1:])
df['split_set_col'] = df['set_col'].apply(split, n=3)
Output:
>>> df.explode('split_set_col', ignore_index=True)
key set_col data split_set_col
0 "a1" (a, b) "a1_data" [a, b]
1 "a2" (j, k, l, m) "a2_data" [j, k, l]
2 "a2" (j, k, l, m) "a2_data" [m]
3 "b1" (z, y, x, w, v, u, t) "b1_data" [z, y, x]
4 "b1" (z, y, x, w, v, u, t) "b1_data" [w, v, u]
5 "b1" (z, y, x, w, v, u, t) "b1_data" [t]