Applying pd.get_dummies to dataframe but alter output

November 24, 2021

I am using pd.get_dummies on this example dataframe below- and it’s working properly but I want to see if anyone has an idea of how to alter the results. I’ll describe below:

Original DF

 ID      type
AA23      A 
AB24      B 
DJ44      B
KD33      C
KD33      A
BK89      B
BL92      B
BL92      C
IO89      A

df after applying: pd.get_dummies(df, columns = [‘type’],prefix = ‘type’)

 ID      type_A    type_B    type_C
AA23       1         0          0 
AB24       0         1          0 
DJ44       0         1          0
KD33       0         0          1
KD33       1         0          0
BK89       0         1          0
BL92       0         1          0
BL92       0         0          1
IO89       0         0          0

What I’m looking to for is similar but, for cases where there are 2 or more ID’s (i.e. KD33 or BL92), I want just one line per ID and the associated type columns marked with 1. For example with ID = KD33, I want one line where ‘type_A’ and ‘type_C’ have 1.

 ID      type_A    type_B    type_C
AA23       1         0          0 
AB24       0         1          0 
DJ44       0         1          0
KD33       1         0          1
BK89       0         1          0
BL92       0         1          1
IO89       0         0          0

>Solution :

One option is to just do the whole thing with a .groupby():

In [36]: df.groupby(["ID", "type"]).agg(lambda x: 1).unstack().fillna(0).astype(int).add_prefix("type_")
Out[36]:
type  type_A  type_B  type_C
ID
AA23       1       0       0
AB24       0       1       0
BK89       0       1       0
BL92       0       1       1
DJ44       0       1       0
IO89       1       0       0
KD33       1       0       1

You can also just tack the .groupby on to the end of the get_dummies version:

In [37]: pd.get_dummies(df, columns = ['type'],prefix = 'type').groupby("ID").sum()
Out[37]:
      type_A  type_B  type_C
ID
AA23       1       0       0
AB24       0       1       0
BK89       0       1       0
BL92       0       1       1
DJ44       0       1       0
IO89       1       0       0
KD33       1       0       1

On this small example, the first version is slightly faster but needs more massaging to get the format the same:

In [48]: %timeit df.groupby(["ID", "type"]).agg(lambda x: 1).unstack().fillna(0).astype(int).add_prefix("type_")
1.3 ms ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [49]: %timeit pd.get_dummies(df, columns = ['type'],prefix = 'type').groupby("ID").sum()
1.66 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)