Ok so I have 2 dataframes:
df = pd.DataFrame({'A':['German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.index[0])
A German Shepherd Border Collie Golden Retriever Beagle Daschund
df2 = pd.DataFrame({'ID':['A','A','A','B','C','C','C','C','C'],
'Breed':['German Shepherd','Beagle','Dashung','Border Collie',
'German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
ID Breed
0 A German Shepherd
1 A Beagle
2 A Dashung
3 B Border Collie
4 C German Shepherd
5 C Border Collie
6 C Golden Retriever
7 C Beagle
8 C Daschund
I want to find which ID the dog breed is in in df2 and then update df if it is present for that ID:
dogs_grouped = df2.groupby('ID')
missing_dogs = []
vals = [np.nan for i in df.columns]
for group_name, df_group in dogs_grouped:
print(f'Cluster: {group_name}')
cluster_dogs = sorted(list(set(df_group['Breed'].to_list())))
cluster_dogs = [i for i in cluster_dogs if i in all_dogs]
weird_dogs = [i for i in cluster_dogs if i not in all_dogs]
missing_dogs.append(weird_dogs)
df = df.append(pd.Series(vals, index=df.columns, name=group_name))
df.loc[group_name][cluster_dogs] = 1
df = df.fillna(0)
My code works, but it’s extremely slow for large datasets. I have a dataset with 500k rows I am iterating through and it’s taking hours to create a 4000 x 30,000 matrix.
A German Shepherd Border Collie Golden Retriever Beagle Daschund
A 1 0 0 1 0
B 0 1 0 0 0
C 1 1 1 1 1
There has to be a more pythonic/pandas way to approach this?
>Solution :
I think you just want pd.crosstab (If some values (columns) are missing, you can reindex the columns from the values you have in df1)
x = pd.crosstab(df2["ID"], df2["Breed"])
print(x)
Prints:
Breed Beagle Border Collie Daschund Dashung German Shepherd Golden Retriever
ID
A 1 0 0 1 1 0
B 0 1 0 0 0 0
C 1 1 1 0 1 1
With .reindex afterwards:
x = x.reindex(
columns=[
"Some New Breed",
"German Shepherd",
"Border Collie",
"Golden Retriever",
"Beagle",
"Daschund",
],
fill_value=0,
)
print(x)
Prints:
Breed Some New Breed German Shepherd Border Collie Golden Retriever Beagle Daschund
ID
A 0 1 0 0 1 0
B 0 0 1 0 0 0
C 0 1 1 1 1 1