How to get the first occurrence of a word in a dataframe column?

January 23, 2023

I have a dataframe which looks like this:


    position          parent    dataType             value 
1          1               0      data1              7x13124 
2          2               1      data2              x21312  
3          3               2      data3              x312  
4          4               2      data3              x321r  
5          5               2      data3              x324  
6          6               2      data3              xg4352  
7          7               2      data3              x2312  
8          8               2      data3              x2131  
9          9               2      data3              x31231  
10        10               2      data3              x3x3412  
12         1               0      data1              432-x424  
13         2               0      data2              x42342-0  
14         3               2      data4              423  
15         4               3      data3              x4234

and I would need to create an extra column in which to track data3. In this way, the first time data 3 appears in the dataType column, the new column would have the value ‘yes’, and the other times ‘no’. So the first time the data3 value appears in a block of data 3 values, the trackData3 value would be ‘yes’. If the dataType is ‘data3 data3 data2 data2 data3’, then the new column would be ‘yes no no no yes’. I need to create the new dataframe with the tracking of data3 values, which would look like below:


    position          parent    dataType             value      trackData3
1          1               0      data1              7x13124    no
2          2               1      data2              x21312     no
3          3               2      data3              x312       yes
4          4               2      data3              x321r      no
5          5               2      data3              x324       no
6          6               2      data3              xg4352     no
7          7               2      data3              x2312      no
8          8               2      data3              x2131      no
9          9               2      data3              x31231     no
10        10               2      data3              x3x3412    no
12         1               0      data1              432-x424   no
13         2               0      data2              x42342-0   no
14         3               2      data4              423        no
15         4               3      data3              x4234      yes

>Solution :

If need yes for first consecutive value data3 use numpy.where with chain masks – compare data3 and first consecutive values by compare shifted values:

mask = df['dataType'].eq('data3') & df['dataType'].ne(df['dataType'].shift())
df['trackData3'] = np.where(mask, 'yes', 'no')
print (df)
    position  parent dataType     value trackData3
1          1       0    data1   7x13124         no
2          2       1    data2    x21312         no
3          3       2    data3      x312        yes
4          4       2    data3     x321r         no
5          5       2    data3      x324         no
6          6       2    data3    xg4352         no
7          7       2    data3     x2312         no
8          8       2    data3     x2131         no
9          9       2    data3    x31231         no
10        10       2    data3   x3x3412         no
12         1       0    data1  432-x424         no
13         2       0    data2  x42342-0         no
14         3       2    data4       423         no
15         4       3    data3     x4234        yes

How it working:

print (df.assign(data3 = df['dataType'].eq('data3') ,
                 consecutive=df['dataType'].ne(df['dataType'].shift()),
                 both = mask))

    position  parent dataType     value trackData3  data3  consecutive   both
1          1       0    data1   7x13124         no  False         True  False
2          2       1    data2    x21312         no  False         True  False
3          3       2    data3      x312        yes   True         True   True
4          4       2    data3     x321r         no   True        False  False
5          5       2    data3      x324         no   True        False  False
6          6       2    data3    xg4352         no   True        False  False
7          7       2    data3     x2312         no   True        False  False
8          8       2    data3     x2131         no   True        False  False
9          9       2    data3    x31231         no   True        False  False
10        10       2    data3   x3x3412         no   True        False  False
12         1       0    data1  432-x424         no  False         True  False
13         2       0    data2  x42342-0         no  False         True  False
14         3       2    data4       423         no  False         True  False
15         4       3    data3     x4234        yes   True         True   True