Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find how many switches happen at the columns of a pandas column by group

I have the following dataframe

import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3],
                    'time': [1,2,3,4,1,2,3,4,1,2,3,4],
                    'cat': ['a', 'a', 'b', 'c', 
                            'a', 'b', 'b', 'b', 
                            'c', 'b', 'c', 'b']
})

I want to calculate, how many times the cat changes from one time to the next, by id

So:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • for id == 1, cat changes from a to a 1 time, from a to b 1 time and from b to c 1 times
  • for id == 2, cat changes from a to b 1 time, and from b to b 2 times
  • for id == 3, cat changes from c to b 2 times, and from b to c 1 time

Any ideas how I could compute that ?

Ideally the output should look something like:

pd.DataFrame({'id': [1,2,3],
              'a to a': [1,0,0],
              'a to b': [1,1,0],
              'a to c': [0,0,0],
              'b to a': [0,0,0],
'b to b': [0,2,0],
'b to c': [1,0,1],
'c to a': [0,0,0],
'c to b': [0,0,2],
'c to c': [0,0,0]
    })

>Solution :

We can use pandas.Series.shift to shift the values, then pandas.DataFrame.groupby to group the distinct values, and count.

# Have the next category be stored in next_cat
foo['next_cat'] = foo['cat'].shift(-1)

# If the use condition is to ensure that each ID's next category is
# of the same ID, we use this instead:
foo['next_cat'] = foo.groupby(['id'])['cat'].shift(-1)

# Only groupby for values that are not NaN in next_cat (ignore the last row)
# Then group by category, next category, and the ID, and count these values.
foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()

This outputs:

                 time
cat next_cat id
a   a        1      1
    b        1      1
             2      1
b   b        2      2
    c        1      1
             2      1
             3      1
c   a        1      1
    b        3      2

We can then drop the index and pivot in order to achieve your ideal shape using pandas.DataFrame.pivot_table:

# This time around we're storing the data into foo.
foo = foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()

# Reset the index so we can pivot using these columns.
foo = foo.reset_index()
foo.pivot_table(columns=foo['cat'] + " to " + foo['next_cat'], values=['time'], index=['id']).fillna(0).astype(int)

This outputs:

     time
   a to a a to b b to b b to c c to a c to b
id
1       1      1      0      1      1      0
2       0      1      2      1      0      0
3       0      0      0      1      0      2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading