Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas groupby two columns and expand the third

I have a Pandas dataframe with the following structure:

A       B       C
a       b       1
a       b       2
a       b       3
c       d       7
c       d       8
c       d       5
c       d       6
c       d       3
e       b       4
e       b       3
e       b       2
e       b       1

And I will like to transform it into this:

A       B       C1      C2      C3      C4      C5
a       b       1       2       3       NAN     NAN
c       d       7       8       5       6       3
e       b       4       3       2       1       NAN

In other words, something like groupby A and B and expand C into different columns.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Knowing that the length of each group is different.

C is already ordered

Shorter groups can have NAN or NULL values (empty), it does not matter.

SOLUTION:

The accepted solution but avoiding the deprecation warning:

N = 3
g  = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
        .add_prefix('C_')
        .astype('Int64')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())

Thanks!

>Solution :

Use GroupBy.cumcount with add 1 for starting new columns names by 1, then pass to DataFrame.pivot, add DataFrame.add_prefix for rename columns names with remove columns name by DataFrame.rename_axis and convert MultiIndex by DataFrame.reset_index to columns A,B:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
   A  B   C1   C2   C3   C4   C5
0  a  b  1.0  2.0  3.0  NaN  NaN
1  c  d  7.0  8.0  5.0  6.0  3.0
2  e  b  4.0  3.0  2.0  1.0  NaN

If need integers add DataFrame.astype with Int64:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = (df.pivot(['A','B'], 'g', 'C')
        .add_prefix('C')
        .astype('Int64')
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B  C1  C2  C3    C4    C5
0  a  b   1   2   3  <NA>  <NA>
1  c  d   7   8   5     6     3
2  e  b   4   3   2     1  <NA>

EDIT: If need maximum N new columns it means A,B are duplicated, need helper groups g1, g2 with integer and modulo division and add to helper new level in index:

N = 4
g  = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
        .add_prefix('C')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B   C1   C2   C3   C4
0  a  b  1.0  2.0  3.0  NaN
1  c  d  7.0  8.0  5.0  6.0
2  c  d  3.0  NaN  NaN  NaN
3  e  b  4.0  3.0  2.0  1.0 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading