I have a dataframe that looks like this:
author|string
abc|hi
abc|yo
def|whats
ghi|up
ghi|dog
how can I select only one row per author? I’m at a loss.
I want to do something like this:
df.loc[unique authors].sample(n=1000)
and get something like this:
author|string
abc|hi
def|whats
ghi|up
I was thinking of converting the author column to categories, but I don’t know where to go from there.
I could just do something like this but it seems stupid.
author_list = df['author'].unique().tolist()
indexes = []
for author in author_list:
indexes.append(df.loc[df['author'] == author].iloc[0].index)
df.iloc[indexes].sample(n=1000)
>Solution :
You can do
out = df.drop_duplicates('author')