Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Fastest way to assign row to dataframe in pandas groupby loop

Ok so I have 2 dataframes:

df = pd.DataFrame({'A':['German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.index[0])

A   German Shepherd     Border Collie   Golden Retriever    Beagle  Daschund

df2 = pd.DataFrame({'ID':['A','A','A','B','C','C','C','C','C'],
                   'Breed':['German Shepherd','Beagle','Dashung','Border Collie',
                           'German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})

ID  Breed
0   A   German Shepherd
1   A   Beagle
2   A   Dashung
3   B   Border Collie
4   C   German Shepherd
5   C   Border Collie
6   C   Golden Retriever
7   C   Beagle
8   C   Daschund

I want to find which ID the dog breed is in in df2 and then update df if it is present for that ID:

dogs_grouped = df2.groupby('ID')
missing_dogs = []
vals = [np.nan for i in df.columns]
for group_name, df_group in dogs_grouped:
    print(f'Cluster: {group_name}')
    cluster_dogs = sorted(list(set(df_group['Breed'].to_list())))
    cluster_dogs = [i for i in cluster_dogs if i in all_dogs]
    weird_dogs = [i for i in cluster_dogs if i not in all_dogs]
    missing_dogs.append(weird_dogs)
    df = df.append(pd.Series(vals, index=df.columns, name=group_name))
    df.loc[group_name][cluster_dogs] = 1
df = df.fillna(0)

My code works, but it’s extremely slow for large datasets. I have a dataset with 500k rows I am iterating through and it’s taking hours to create a 4000 x 30,000 matrix.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

A   German Shepherd     Border Collie   Golden Retriever    Beagle      Daschund
A        1                    0               0                1           0
B        0                    1               0                0           0
C        1                    1               1                1           1

There has to be a more pythonic/pandas way to approach this?

>Solution :

I think you just want pd.crosstab (If some values (columns) are missing, you can reindex the columns from the values you have in df1)

x = pd.crosstab(df2["ID"], df2["Breed"])
print(x)

Prints:

Breed  Beagle  Border Collie  Daschund  Dashung  German Shepherd  Golden Retriever
ID                                                                                
A           1              0         0        1                1                 0
B           0              1         0        0                0                 0
C           1              1         1        0                1                 1

With .reindex afterwards:

x = x.reindex(
    columns=[
        "Some New Breed",
        "German Shepherd",
        "Border Collie",
        "Golden Retriever",
        "Beagle",
        "Daschund",
    ],
    fill_value=0,
)
print(x)

Prints:

Breed  Some New Breed  German Shepherd  Border Collie  Golden Retriever  Beagle  Daschund
ID                                                                                       
A                   0                1              0                 0       1         0
B                   0                0              1                 0       0         0
C                   0                1              1                 1       1         1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading