I have a set like this
list = {‘AGB’, ‘YTE’, ‘ENN’, ‘TAP’, ‘XAL’, ‘MUI’}
and a dataframe like this
| ColA | ColB | ColC |
|---|---|---|
| ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN | 45 | Three |
| TUY,XAL,MUI,AUS,OPP,YTE,ERT | 32 | Three |
I would like to count how many times ColA’s value has elements in the set in ColD and ColE, ColD for unique and ColD for all occurrences. So far, I have been using
df[‘ColD’] = df[‘ColA’].apply(lambda x:sum(i in list for i in x)), but no success, would very appreciate if someone can help solve the issue. Thank you.
| ColA | ColD | ColE |
|---|---|---|
| ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN | 2 | 3 |
| TUY,XAL,MUI,AUS,OPP,YTE,ERT | 3 | 3 |
>Solution :
Here is an option using pd.Series.str.count
we do '|'.join(s) to create a string from your set which creates the following regex pattern 'AGB|ENN|YTE|XAL|TAP|MUI' the pipe delimiter is the OR operator in regex, which is what str.count uses. So we are essentially saying count the number of times AGB OR ENN OR ... MUI is in df['ColA']
To get the unique count we need to split the string into a list and get the unique values before using str.count
s = {'AGB', 'YTE', 'ENN', 'TAP', 'XAL', 'MUI'}
df['D'] = df['ColA'].str.split(',').agg(set).astype(str).str.count('|'.join(s))
df['E'] = df['ColA'].str.count('|'.join(s))
ColA ColB ColC D E
0 ENN,JAX,ATL,ERT,CMH,RSW,TAP,ABQ,ENN 45 Three 2 3
1 TUY,XAL,MUI,AUS,OPP,YTE,ERT 32 Three 3 3