What I have:
| target | var1 | var2 |
|---|---|---|
| 1 | jack, jones, phil | en-us |
| 0 | don, sam, bob | vn-en |
| 1 | jones, alex, sam | en-us |
What I want:
| var1 | target | no_target | var2 |
|---|---|---|---|
| jones | 2 | 0 | en-us |
target column is how much ‘jones’ appears when target is 1
so far I have tried to use DictVetorize, and it helped me to count number of times target equal to 1 for jones
but I done know how to get var2 column
>Solution :
IIUC, you can use:
df['target'].mul(df['var1'].str.count('jones')).sum()
output: 2
For a more generic method to get all names:
out = (df
.assign(var1=df['var1'].str.split(',\s*').where(df['target'].eq(1)))
.explode('var1')
.groupby(['var1', 'var2'], as_index=False)
.sum()
)
output:
var1 var2 target
0 alex en-us 1
1 jack en-us 1
2 jones en-us 2
3 phil en-us 1
4 sam en-us 1
counting target/no_target
(df
.assign(var1=df['var1'].str.split(',\s*'),
target=np.where(df['target'].eq(1), 'target', 'no_target'),
value=1
)
.explode('var1')
.groupby(['target', 'var1', 'var2'], as_index=False)
.sum()
.pivot_table(index=['var1', 'var2'], columns='target',
values='value', fill_value=0)
.reset_index().rename_axis(columns=None)
)
output:
var1 var2 no_target target
0 alex en-us 0 1
1 bob vn-en 1 0
2 don vn-en 1 0
3 jack en-us 0 1
4 jones en-us 0 2
5 phil en-us 0 1
6 sam en-us 0 1
7 sam vn-en 1 0