i have a conceptional problem.
I working on pandas fron kaggle for learn and train my new skill.
I tried to solve an exercise, but I don’t understand why the result
is different from what I expected
question:
"There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)"
my answer:
tropical_count= reviews["description"].str.count(pat ="tropical").sum()
fruity_count= reviews["description"].str.count(pat ="fruity").sum()
descriptor_counts = pd.Series({"tropical":tropical_count,"fruity":fruity_count},index=["tropical","fruity"])
kaggle answare:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
all work grate, but the result are different, does anyone know why?
my result
tropical 3703
fruity 9259
dtype: int64
kaggle result
tropical 3607
fruity 9090
dtype: int64
>Solution :
Output is expected, because str.count counts substrings, but if use in operator it test only if exist value. So ouput is only True or False. Then if use sum boolean Trues are processing like 1 and False like 0, so ouput is different.
Sample:
reviews = pd.DataFrame(["Ttropical are tropical so fruity words you can",
"fruity ",
"fruity fruity",
"anythi"], columns=['description'])
tropical_count= reviews["description"].str.count(pat ="tropical")
fruity_count= reviews["description"].str.count(pat ="fruity")
print (tropical_count)
0 2
1 0
2 0
3 0
Name: description, dtype: int64
print (fruity_count)
0 1
1 1
2 2
3 0
Name: description, dtype: int64
n_trop = reviews.description.map(lambda desc: "tropical" in desc)
n_fruity = reviews.description.map(lambda desc: "fruity" in desc)
print (n_trop)
0 True
1 False
2 False
3 False
Name: description, dtype: bool
print (n_fruity)
0 True
1 True
2 True
3 False
Name: description, dtype: bool