Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to quickly iterate over rows in a very large dataframe to apply str_split and extract second last element to replace value in dataframe

I have a very large dataframe (750,000+ rows) and I need to change the data in there from:

                    V        W      X       Y                                                          Z
1   1085_S4_kaiju.out 0.697392 418154   28901                                          Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica;

2   1085_S4_kaiju.out 0.692621 415293     562                                           Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;

3   1085_S4_kaiju.out 0.690187 413834     470                                    Bacteria;Pseudomonadota;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii;

4   1085_S4_kaiju.out 0.625289 374921     573                                      Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae;

5   1085_S4_kaiju.out 0.285564 171223   33069                                     Bacteria;Pseudomonadota;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas viridiflava;
...

to:

                    V        W      X       Y                                                          Z
1   1085_S4_kaiju.out 0.697392 418154   28901                                        Salmonella enterica
2   1085_S4_kaiju.out 0.692621 415293     562                                           Escherichia coli
3   1085_S4_kaiju.out 0.690187 413834     470                                    Acinetobacter baumannii
4   1085_S4_kaiju.out 0.625289 374921     573                                      Klebsiella pneumoniae
5   1085_S4_kaiju.out 0.285564 171223   33069                                    Pseudomonas viridiflava
...

Essentially taking the last column, splitting string based on ";", and returning second last value. Since some of the rows have values in column Z which are not of the same format as the rest, I want to remove those rows from the dataframe.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The following does the job, but takes FOREVER

library(stringr)

row_to_drop = c()
for(i in 1:nrow(tb)) {
  if (length(str_split(tb[i, 5], ";")[[1]]) < 8) {
    row_to_drop = c(row_to_drop, i)
  } else {
      tb[i, 5] = str_split(tb[i, 5], ";")[[1]][[7]]
  }
}

I have tried to look into vectorization but Im finding applying it to what Im trying to do here very confusing. Can I get some help on how to perform the changes to the dataframe I want much faster?

thanks

>Solution :

Using stringr::str_extract():

library(stringr)

tb$Z <- str_extract(tb$Z, "[^;]*(?=;$)")

Result:

                  V        W      X     Y                       Z
1 1085_S4_kaiju.out 0.697392 418154 28901     Salmonella enterica
2 1085_S4_kaiju.out 0.692621 415293   562        Escherichia coli
3 1085_S4_kaiju.out 0.690187 413834   470 Acinetobacter baumannii
4 1085_S4_kaiju.out 0.625289 374921   573   Klebsiella pneumoniae
5 1085_S4_kaiju.out 0.285564 171223 33069 Pseudomonas viridiflava

"[^;]*" matches sequential characters other than ";". "(?=;$)" matches characters immediately preceding the final ";". When combined, they match all text between the last two ";"s.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading