Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Applying pd.get_dummies to dataframe but alter output

I am using pd.get_dummies on this example dataframe below- and it’s working properly but I want to see if anyone has an idea of how to alter the results. I’ll describe below:

Original DF

 ID      type
AA23      A 
AB24      B 
DJ44      B
KD33      C
KD33      A
BK89      B
BL92      B
BL92      C
IO89      A

df after applying: pd.get_dummies(df, columns = [‘type’],prefix = ‘type’)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

 ID      type_A    type_B    type_C
AA23       1         0          0 
AB24       0         1          0 
DJ44       0         1          0
KD33       0         0          1
KD33       1         0          0
BK89       0         1          0
BL92       0         1          0
BL92       0         0          1
IO89       0         0          0

What I’m looking to for is similar but, for cases where there are 2 or more ID’s (i.e. KD33 or BL92), I want just one line per ID and the associated type columns marked with 1. For example with ID = KD33, I want one line where ‘type_A’ and ‘type_C’ have 1.

 ID      type_A    type_B    type_C
AA23       1         0          0 
AB24       0         1          0 
DJ44       0         1          0
KD33       1         0          1
BK89       0         1          0
BL92       0         1          1
IO89       0         0          0

>Solution :

One option is to just do the whole thing with a .groupby():

In [36]: df.groupby(["ID", "type"]).agg(lambda x: 1).unstack().fillna(0).astype(int).add_prefix("type_")
Out[36]:
type  type_A  type_B  type_C
ID
AA23       1       0       0
AB24       0       1       0
BK89       0       1       0
BL92       0       1       1
DJ44       0       1       0
IO89       1       0       0
KD33       1       0       1

You can also just tack the .groupby on to the end of the get_dummies version:

In [37]: pd.get_dummies(df, columns = ['type'],prefix = 'type').groupby("ID").sum()
Out[37]:
      type_A  type_B  type_C
ID
AA23       1       0       0
AB24       0       1       0
BK89       0       1       0
BL92       0       1       1
DJ44       0       1       0
IO89       1       0       0
KD33       1       0       1

On this small example, the first version is slightly faster but needs more massaging to get the format the same:

In [48]: %timeit df.groupby(["ID", "type"]).agg(lambda x: 1).unstack().fillna(0).astype(int).add_prefix("type_")
1.3 ms ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [49]: %timeit pd.get_dummies(df, columns = ['type'],prefix = 'type').groupby("ID").sum()
1.66 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading