Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R – substr over multiple columns in Dataframe

Lets say I have a dataframe that looks like this

Column1,  Column2,  Column3
 a_2019    b_2020    c_2021
 d_2019    e_2020    f_2021
 a_2019    b_2020    c_2021
 d_2019    e_2020    f_2021

And I would like to take out "_2019", "_2020", and "_2021". I could use

df$Column1 <- substr(df$Column1, 1, nchar(df$Column1)-5)

For every column, but I have multiple dataframes with quite a few columns. substr need a text or a vector for it to work, so using df[,3:10] doesn´t work, lapply either.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Any suggestion on how to achieve this in an elegant way? Thank you

>Solution :

We can try using lapply along with sub for a base R option:

df[cols] <- lapply(df[cols], function(x) sub("_(?:2019|2020|2021)$", "", x))

Here cols should be a vector containing the column names on which you seek to make the replacement.

More generally, to target underscore followed by any number, we can use:

df[cols] <- lapply(df[cols], function(x) sub("_\\d+$", "", x))  # or _\\d{4} for a year
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading