Fastest way to assign row to dataframe in pandas groupby loop

October 17, 2023

Ok so I have 2 dataframes:

df = pd.DataFrame({'A':['German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.index[0])

A   German Shepherd     Border Collie   Golden Retriever    Beagle  Daschund

df2 = pd.DataFrame({'ID':['A','A','A','B','C','C','C','C','C'],
                   'Breed':['German Shepherd','Beagle','Dashung','Border Collie',
                           'German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})

ID  Breed
0   A   German Shepherd
1   A   Beagle
2   A   Dashung
3   B   Border Collie
4   C   German Shepherd
5   C   Border Collie
6   C   Golden Retriever
7   C   Beagle
8   C   Daschund

I want to find which ID the dog breed is in in df2 and then update df if it is present for that ID:

dogs_grouped = df2.groupby('ID')
missing_dogs = []
vals = [np.nan for i in df.columns]
for group_name, df_group in dogs_grouped:
    print(f'Cluster: {group_name}')
    cluster_dogs = sorted(list(set(df_group['Breed'].to_list())))
    cluster_dogs = [i for i in cluster_dogs if i in all_dogs]
    weird_dogs = [i for i in cluster_dogs if i not in all_dogs]
    missing_dogs.append(weird_dogs)
    df = df.append(pd.Series(vals, index=df.columns, name=group_name))
    df.loc[group_name][cluster_dogs] = 1
df = df.fillna(0)

My code works, but it’s extremely slow for large datasets. I have a dataset with 500k rows I am iterating through and it’s taking hours to create a 4000 x 30,000 matrix.

A   German Shepherd     Border Collie   Golden Retriever    Beagle      Daschund
A        1                    0               0                1           0
B        0                    1               0                0           0
C        1                    1               1                1           1

There has to be a more pythonic/pandas way to approach this?

>Solution :

I think you just want pd.crosstab (If some values (columns) are missing, you can reindex the columns from the values you have in df1)

x = pd.crosstab(df2["ID"], df2["Breed"])
print(x)

Prints:

Breed  Beagle  Border Collie  Daschund  Dashung  German Shepherd  Golden Retriever
ID                                                                                
A           1              0         0        1                1                 0
B           0              1         0        0                0                 0
C           1              1         1        0                1                 1

With .reindex afterwards:

x = x.reindex(
    columns=[
        "Some New Breed",
        "German Shepherd",
        "Border Collie",
        "Golden Retriever",
        "Beagle",
        "Daschund",
    ],
    fill_value=0,
)
print(x)

Prints:

Breed  Some New Breed  German Shepherd  Border Collie  Golden Retriever  Beagle  Daschund
ID                                                                                       
A                   0                1              0                 0       1         0
B                   0                0              1                 0       0         0
C                   0                1              1                 1       1         1