How to pick slices of groups and complete some if necessary?

December 23, 2023

My input is a dataframe :

df = pd.DataFrame({'mycol': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'C', 'A', 'A']})

print(df)

  mycol
0     A
1     B
2     A
3     B
4     B
5     C
6     A
7     C
8     A
9     A

A appear 5 times, B appear 3 times and C appear 2 times.

Lets say we need to slice the groups to take only 3 items of each group. Then index 8 and 9 should be removed because group A is exceeded by 2 values and index 10 should be created to complete group C because it lacks 1 value.

PS : the order of the original rows must be preserved and the index is important for me as I need to track the positions (start and end) for each slice.

For that I made the code below but some indexes are messed up and also there is some undesired nan rows.

def func(group):
    df2 = pd.DataFrame(None, index=[max(group.index)+1], columns=['mycol'])
    result = pd.concat([group.iloc[:3], df2])
    result['newcol'] = [group.name + str(i+1) for i in range(len(result))] 
    return result
    
final = df.groupby('mycol').apply(func).droplevel(0)
print(final)

   mycol newcol
0      A     A1
2      A     A2
6      A     A3
10   NaN     A4
1      B     B1
3      B     B2
4      B     B3
5    NaN     B4
5      C     C1
7      C     C2
8    NaN     C3

Do you guys know how to fix my code ? Or do you have another suggestions ?

My expected output is this :

   mycol  newcol
0      A      A1
1      B      B1
2      A      A2
3      B      B2
4      B      B3
5      C      C1
6      A      A3
7      C      C2
10   NaN      C3

>Solution :

You can use a groupby.cumcount, pivot and stack:

N = 3

c = df.groupby('mycol').cumcount().add(1)

out= (df.assign(newcol=df['mycol']+c.astype(str), c=c)
        .pivot(index='mycol', columns='c', values='newcol')
        .iloc[:, :N].stack(dropna=False)
        .reset_index(0, name='newcol')
      )

Output:

  mycol newcol
c             
1     A     A1
2     A     A2
3     A     A3
1     B     B1
2     B     B2
3     B     B3
1     C     C1
2     C     C2
3     C    NaN

Or with a custom groupby.apply:

from itertools import count

N = 3
c = count(len(df))

out = (df
   .groupby('mycol', group_keys=False)
   .apply(lambda g: pd.DataFrame(
       {'mycol': [g.name]*min(N, len(g))+[float('nan')]*(N-len(g)),
        'newcol': [f'{g.name}{x+1}' for x in range(N)],
       }, index=g.index[:N].tolist()+[next(c) for _ in range(N-len(g))])
         )
)

Output:

   mycol newcol
0      A     A1
2      A     A2
6      A     A3
1      B     B1
3      B     B2
4      B     B3
5      C     C1
7      C     C2
10   NaN     C3