Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to pick slices of groups and complete some if necessary?

My input is a dataframe :

df = pd.DataFrame({'mycol': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'C', 'A', 'A']})

print(df)

  mycol
0     A
1     B
2     A
3     B
4     B
5     C
6     A
7     C
8     A
9     A

A appear 5 times, B appear 3 times and C appear 2 times.

Lets say we need to slice the groups to take only 3 items of each group. Then index 8 and 9 should be removed because group A is exceeded by 2 values and index 10 should be created to complete group C because it lacks 1 value.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

PS : the order of the original rows must be preserved and the index is important for me as I need to track the positions (start and end) for each slice.

For that I made the code below but some indexes are messed up and also there is some undesired nan rows.

def func(group):
    df2 = pd.DataFrame(None, index=[max(group.index)+1], columns=['mycol'])
    result = pd.concat([group.iloc[:3], df2])
    result['newcol'] = [group.name + str(i+1) for i in range(len(result))] 
    return result
    
final = df.groupby('mycol').apply(func).droplevel(0)
print(final)

   mycol newcol
0      A     A1
2      A     A2
6      A     A3
10   NaN     A4
1      B     B1
3      B     B2
4      B     B3
5    NaN     B4
5      C     C1
7      C     C2
8    NaN     C3

Do you guys know how to fix my code ? Or do you have another suggestions ?

My expected output is this :

   mycol  newcol
0      A      A1
1      B      B1
2      A      A2
3      B      B2
4      B      B3
5      C      C1
6      A      A3
7      C      C2
10   NaN      C3

>Solution :

You can use a groupby.cumcount, pivot and stack:

N = 3

c = df.groupby('mycol').cumcount().add(1)

out= (df.assign(newcol=df['mycol']+c.astype(str), c=c)
        .pivot(index='mycol', columns='c', values='newcol')
        .iloc[:, :N].stack(dropna=False)
        .reset_index(0, name='newcol')
      )

Output:

  mycol newcol
c             
1     A     A1
2     A     A2
3     A     A3
1     B     B1
2     B     B2
3     B     B3
1     C     C1
2     C     C2
3     C    NaN

Or with a custom groupby.apply:

from itertools import count

N = 3
c = count(len(df))

out = (df
   .groupby('mycol', group_keys=False)
   .apply(lambda g: pd.DataFrame(
       {'mycol': [g.name]*min(N, len(g))+[float('nan')]*(N-len(g)),
        'newcol': [f'{g.name}{x+1}' for x in range(N)],
       }, index=g.index[:N].tolist()+[next(c) for _ in range(N-len(g))])
         )
)

Output:

   mycol newcol
0      A     A1
2      A     A2
6      A     A3
1      B     B1
3      B     B2
4      B     B3
5      C     C1
7      C     C2
10   NaN     C3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading