Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to get the first occurrence of a word in a dataframe column?

I have a dataframe which looks like this:


    position          parent    dataType             value 
1          1               0      data1              7x13124 
2          2               1      data2              x21312  
3          3               2      data3              x312  
4          4               2      data3              x321r  
5          5               2      data3              x324  
6          6               2      data3              xg4352  
7          7               2      data3              x2312  
8          8               2      data3              x2131  
9          9               2      data3              x31231  
10        10               2      data3              x3x3412  
12         1               0      data1              432-x424  
13         2               0      data2              x42342-0  
14         3               2      data4              423  
15         4               3      data3              x4234  

and I would need to create an extra column in which to track data3. In this way, the first time data 3 appears in the dataType column, the new column would have the value ‘yes’, and the other times ‘no’. So the first time the data3 value appears in a block of data 3 values, the trackData3 value would be ‘yes’. If the dataType is ‘data3 data3 data2 data2 data3’, then the new column would be ‘yes no no no yes’. I need to create the new dataframe with the tracking of data3 values, which would look like below:


    position          parent    dataType             value      trackData3
1          1               0      data1              7x13124    no
2          2               1      data2              x21312     no
3          3               2      data3              x312       yes
4          4               2      data3              x321r      no
5          5               2      data3              x324       no
6          6               2      data3              xg4352     no
7          7               2      data3              x2312      no
8          8               2      data3              x2131      no
9          9               2      data3              x31231     no
10        10               2      data3              x3x3412    no
12         1               0      data1              432-x424   no
13         2               0      data2              x42342-0   no
14         3               2      data4              423        no
15         4               3      data3              x4234      yes

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

If need yes for first consecutive value data3 use numpy.where with chain masks – compare data3 and first consecutive values by compare shifted values:

mask = df['dataType'].eq('data3') & df['dataType'].ne(df['dataType'].shift())
df['trackData3'] = np.where(mask, 'yes', 'no')
print (df)
    position  parent dataType     value trackData3
1          1       0    data1   7x13124         no
2          2       1    data2    x21312         no
3          3       2    data3      x312        yes
4          4       2    data3     x321r         no
5          5       2    data3      x324         no
6          6       2    data3    xg4352         no
7          7       2    data3     x2312         no
8          8       2    data3     x2131         no
9          9       2    data3    x31231         no
10        10       2    data3   x3x3412         no
12         1       0    data1  432-x424         no
13         2       0    data2  x42342-0         no
14         3       2    data4       423         no
15         4       3    data3     x4234        yes

How it working:

print (df.assign(data3 = df['dataType'].eq('data3') ,
                 consecutive=df['dataType'].ne(df['dataType'].shift()),
                 both = mask))

    position  parent dataType     value trackData3  data3  consecutive   both
1          1       0    data1   7x13124         no  False         True  False
2          2       1    data2    x21312         no  False         True  False
3          3       2    data3      x312        yes   True         True   True
4          4       2    data3     x321r         no   True        False  False
5          5       2    data3      x324         no   True        False  False
6          6       2    data3    xg4352         no   True        False  False
7          7       2    data3     x2312         no   True        False  False
8          8       2    data3     x2131         no   True        False  False
9          9       2    data3    x31231         no   True        False  False
10        10       2    data3   x3x3412         no   True        False  False
12         1       0    data1  432-x424         no  False         True  False
13         2       0    data2  x42342-0         no  False         True  False
14         3       2    data4       423         no  False         True  False
15         4       3    data3     x4234        yes   True         True   True
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading