I have a dataframe with 2 categorical columns (col1, col2).
col1 col2
0 A DE
1 A B
2 B BA
3 A A
4 C C
I want to map the unique string values to integers, for example (A:0, B:1, BA:2, C:3, DE:4)
col1 col2 ideal1 ideal2
0 A DE 0 4
1 A B 0 1
2 B BA 1 2
3 A A 0 0
4 C C 3 3
I am have tried to use factorize or category, but I am not getting the same unique value for both columns, as can be seen from ROW C:
Here is my code:
df = pd.DataFrame({'col1': ["A", "A", "B", "A" , "C"], 'col2': ["DE", "B", "BA", "A", "C"]})
#ideal map alphabetical: A:0, B:1, BA:2, C:3, DE:4
#ideal result
df["ideal1"] = [0, 0, 1,0, 3]
df["ideal2"] = [4,1,2,0,3]
#trial #1 --> C value 2 & 3 : not matching
df["cat1"] = df['col1'].astype("category").cat.codes
df["cat2"] = df['col2'].astype("category").cat.codes
#trial #2 --> C value 2 & 4 : not matching
df["fac1"] = pd.factorize(df["col1"])[0]
df["fac2"] = pd.factorize(df["col2"])[0]
print (df)
OUT:
col1 col2 ideal1 ideal2 cat1 cat2 fac1 fac2
0 A DE 0 4 0 4 0 0
1 A B 0 1 0 1 0 1
2 B BA 1 2 1 2 1 2
3 A A 0 0 0 0 0 3
4 C C 3 3 2 3 2 4
>Solution :
To get the same categories across columns you need to reshape to a single dimension first. Then use factorize and restore the original shape.
Here is an example using stack/unstack:
x = df.stack()
x[:] = x.factorize()[0]
df2 = x.unstack()
Output:
col1 col2
0 0 1
1 0 2
2 2 3
3 0 0
4 4 4
Joining to the original data:
x = df.stack()
x[:] = x.factorize()[0]
df2 = df.join(x.unstack().add_suffix('_cat'))
Output:
col1 col2 col1_cat col2_cat
0 A DE 0 1
1 A B 0 2
2 B BA 2 3
3 A A 0 0
4 C C 4 4
alphabetical order
If you really want alphabetical order, you could create you own custom mapping dictionary:
import numpy as np
cats = {k:v for v,k in enumerate(np.unique(df.values))}
df.replace(cats)
Output:
col1 col2
0 0 4
1 0 1
2 1 2
3 0 0
4 3 3