Groupby merge values from one dataframe to the other

December 14, 2023

I have a dataframe of "words", and I would like to check which words were included, by each person in this dataframe by comparison to a word list key.

Example data:

df = pd.DataFrame({
    'person': [1, 1, 1, 2, 3, 4, 4, 4, 4],
    'word': ['apple', 'orange', 'pear', 'apple', 'grape', 'orange', 'apple', 'pear', 'berry'],
    'count': [1, 1, 1, 1, 1, 1, 1, 1, 1]
})

word_list = ['apple', 'orange', 'pear', 'berry', 'grape']
word_df = pd.DataFrame({'word': word_list})

For example, person 1 only included apple, orange, and pear and did not include berry and grape. I know how to pd.merge values but have not been able to successfully map the values of word_df["word"] onto something like df.groupby(["person"] so that I can check what each person chose to include.

Desired output:

result_df = pd.DataFrame({
    'person': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4],
    'word': ['apple', 'orange', 'pear', 'berry', 'grape', 
             'apple', 'orange', 'pear', 'berry', 'grape', 
             'apple', 'orange', 'pear', 'berry', 'grape', 
             'orange', 'apple', 'pear', 'berry', 'grape'],
    'count': [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0]
})

>Solution :

If performance isn’t really a concern this looks like it does the trick.

Basically create an intermediate dataframe with all possible person and fruit combinations via cross join. Then use that as the driver, "count" from the outer table will be null if that combination was not found in the original dataframe so coalescing to 0 gets 0’s for those.

df = pd.DataFrame({
    'person': [1, 1, 1, 2, 3, 4, 4, 4, 4],
    'word': ['apple', 'orange', 'pear', 'apple', 'grape', 'orange', 'apple', 'pear', 'berry'],
    'count': [1, 1, 1, 1, 1, 1, 1, 1, 1]
})

word_list = ['apple', 'orange', 'pear', 'berry', 'grape']
all_person_word_combos = word_df.merge(df['person'].drop_duplicates(), how='cross')
final_result = (
  all_person_word_combos.
  merge(df, 
        how='left', 
        on=['word', 'person']).
  fillna(0).
  sort_values(['person','word']))
final_result.display()