Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Removing rows from data frame containing strictly uppercase letters (in a specified column) using R?

I have a very large and messy dataset containing both country names and regions in a column named ‘country.’ I need to eliminate the regions, but leave the countries. Fortunately, the regions are written in all uppercase letters, so they can be distinguished from the countries, which only have one uppercase letter at the beginning.

How can I remove rows with data$country entries as entirely uppercase letters?

Here is an example of my dataset:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data <- data.frame(year=c(1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990),
                   country = c('SUB-SAHARAN AFRICA',
                               'Eastern Africa',
                               'Burundi',
                               'Comoros',
                               'Djibouti',
                               'Eritrea',
                               'Ethiopia',
                               'Kenya',
                               'Madagascar',
                               'Malawi',
                               'Mauritius',
                               'Mayotte',
                               'Mozambique',
                               'Réunion',
                               'Rwanda',
                               'Seychelles',
                               'Somalia',
                               'South Sudan',
                               'Uganda',
                               'United Republic of Tanzania',
                               'Zambia',
                               'Zimbabwe',
                               'Middle Africa',
                               'Angola',
                               'Cameroon',
                               'Central African Republic',
                               'Chad',
                               'Congo',
                               'Democratic Republic of the Congo',
                               'Equatorial Guinea',
                               'Gabon',
                               'Sao Tome and Principe',
                               'Southern Africa',
                               'Botswana',
                               'Eswatini',
                               'Lesotho',
                               'Namibia',
                               'South Africa',
                               'Western Africa',
                               'Benin',
                               'Burkina Faso',
                               'CAPITAL FOR EXAMPLE SAKE',
                               'CAPITAL FOR EXAMPLE SAKE',
                               'CAPITAL FOR EXAMPLE SAKE'),
                   entry = c(123,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             123,
                             0,
                             0,
                             0,
                             64,
                             59,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0))

I tried using the grepl function, as this post advised…

dropped <- data[!grepl("^[A-Z ]+$", data$country), drop = TRUE]

…however, I get the following error:

Error in `[.data.frame`(data, !grepl("^[A-Z ]+$", data$country), drop = TRUE) : 
  undefined columns selected
In addition: Warning message:
In `[.data.frame`(data, !grepl("^[A-Z ]+$", data$country), drop = TRUE) :
  'drop' argument will be ignored

How can I remove these rows?

>Solution :

Use grepl and take a subset:

data <- data[!grepl("^[A-Z]+(?:[ -][A-Z]+)*$", data$country), ]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading