Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do I supply agrepl with a list of strings to check?

I am trying to use agrepl to detect whether the Ingredients variable in my dataframe df contains one of a number of possible strings (food ingredients). I want to account for slight mispellings or errors. I am working in an environment where installing packages is difficult so I am keen to use agrepl. df is a very simplified version of the actual data for illustration and I’ve put the data for df at the end of this question.

These are the strings I want to check:

strings_to_check <- c("Molybdenum Salt",
                      "Mineral Salt \\(Molybdenum Sulfide)",
                      "Molybdenum Sulfide",
                      "Mineral Salt \\(444\\)",
                      "444")

I can detect the presence of these strings as expected with grepl:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ingredients_df <- df %>% 
  mutate(Molybdenum = grepl(paste(strings_to_check, collapse = "|"), Ingredients))

And when I use agrepl with a single string, it is also working as expected:

one_string_df <- ingredients_df %>% 
  mutate(One_String = agrepl("Molybdenum Sulfide", Ingredients, max.distance = 2, ignore.case = TRUE))

But agrepl with the full strings_to_check returns FALSE values for every case:

fuzzy_df <- ingredients_df %>% 
  mutate(Fuzzy_Molybdenum = agrepl(paste(strings_to_check, collapse = "|"), Ingredients, max.distance = 2, ignore.case = TRUE))

Given the difference between supplying a single string versus strings_to_check, I think there must be an issue with the way agrepl is using strings_to_check. How should I pass the list of strings into agrepl so it works as expected?

My expected output is:

Product_Name Ingredients Issue Molybdenum Fuzzy_Molybdenum
Cheesy Jalapeno Popcorn Sugar | Croutons (10%) (Wheat Flour | Vegetable Oil | Salt | Yeast) | Mineral Salt (Molybdenu Sulfide) | Salt | Natural Flavour Minor Typo FALSE TRUE
Creamy Coconut Curry Soup Premix [Salt | Mineral Salts (451 | 452 | 444 | 450) | Sugar | Vegetable Gum (407a) | Flavour Enhancers (631 | 627)} | Natural Flavour NA TRUE TRUE Crunchy Cheddar Bites NA FALSE Exotic Thai Basil Noodles NA TRUE
Golden Honey Wheat Bread Sesame Seeds (3%) | Yeast | Yellow Pea Flour | molybdenum sulfide | Vitamins (Thiamin | Folic Acid) Lower Case FALSE TRUE
Gourmet Truffle Macaroni & Cheese Rice Flour | Thickener (1412) | Salt | Molybdenum Sulfide (Natural Source) | Herbs | Mineral Salt (451) Preservative (223) NA TRUE TRUE
Heavenly Hazelnut Delight Ice Cream Acidity Regulator (339) | Antioxidant (316) | Mylabdenu Sulfini | Colour Fixative (Sodium Nitrite) Major Typo FALSE FALSE
Juicy Pineapple Burst Sorbet Maltodextrin | Salt | Sugar | Natural Flavours (Contains Wheat | Soy) | Dried Vegetables [Onion | Carrot] | Mineral Salt (444) NA TRUE TRUE
Maple Glazed Pecan Granola Dried Vegetables (9%) (Peas | Vegetable Powder | Sugar | Mineral Salt (444) | Yeast Extract | Vegetable Oil | Herbs & Spices | Natural Colour (100) NA TRUE TRUE
Mediterranean Herb Garden Hummus Electrolytes 11.5% (Sodium Sulfide | Tricalcium Phosphate) NA FALSE FALSE
Roasted Garlic Parmesan Pretzels Dextrose | Rice Flour | Wheat Flour | Minerals (Zinc | Iron) | Vitamin (B12) NA FALSE FALSE
Smoky BBQ Bliss Potato Chips Minerals (Calcium Phosphate | Magnesium Sulfide | Mlybdenum ulfide | Sodium Sulfide | Ferrous Sulphate | Sodium Selenate) Minor Typo FALSE TRUE
Spicy Mango Tango Salsa Maltodextrin | Filtered Water | Flavour | Citric Acid (330) | Molybdenum Sulfide | Sodium Benzoate (211) | Sodium Sulfide NA TRUE TRUE
Sweet Cinnamon Swirl Pancakes Bacon (15%) [Pork | Salt | Dextrose | Sucrose | Mineral Salts (450 | 451 | 452) | Water | Antioxidant (316) | Sodium Nitrite (250)] NA FALSE FALSE
Zesty Lemonade Infusion Onion Powder (Yeast Extract | Natural Flavours (Soy) | Mineral | Salt | Molybdenum Sulfide) | Cheese Powder (Milk) | Mineral Salt (444) Repeats two elements. TRUE TRUE

Data for df:

structure(list(Product_Name = c("Cheesy Jalapeno Popcorn", "Creamy Coconut Curry Soup", 
"Crunchy Cheddar Bites", "Exotic Thai Basil Noodles", "Golden Honey Wheat Bread", 
"Gourmet Truffle Macaroni & Cheese", "Heavenly Hazelnut Delight Ice Cream", 
"Juicy Pineapple Burst Sorbet", "Maple Glazed Pecan Granola", 
"Mediterranean Herb Garden Hummus", "Roasted Garlic Parmesan Pretzels", 
"Smoky BBQ Bliss Potato Chips", "Spicy Mango Tango Salsa", "Sweet Cinnamon Swirl Pancakes", 
"Zesty Lemonade Infusion"), Ingredients = c("Sugar | Croutons (10%) (Wheat Flour | Vegetable Oil | Salt | Yeast) | Mineral Salt (Molybdenu Sulfide) | Salt | Natural Flavour", 
"Premix [Salt | Mineral Salts (451 | 452 | 444 | 450) | Sugar | Vegetable Gum (407a) | Flavour Enhancers (631 | 627)} | Natural Flavour", 
"Vegetable Oils (Palm | Canola) | Iodised Salt | Yellow Pea Flour", 
"Natural Cheese Flavour [Maltodextrin | Salt | Natural Flavour | Dextrose | Molybdenum Sulfide (444) | Yeast Extract]", 
"Sesame Seeds (3%) | Yeast | Yellow Pea Flour | molybdenum sulfide | Vitamins (Thiamin | Folic Acid)", 
"Rice Flour | Thickener (1412) | Salt | Molybdenum Sulfide (Natural Source) | Herbs | Mineral Salt (451) Preservative (223)", 
"Acidity Regulator (339) | Antioxidant (316) | Mylabdenu Sulfini | Colour Fixative (Sodium Nitrite)", 
"Maltodextrin | Salt | Sugar | Natural Flavours (Contains Wheat | Soy) | Dried Vegetables [Onion | Carrot] | Mineral Salt (444)", 
"Dried Vegetables (9%) (Peas | Vegetable Powder | Sugar | Mineral Salt (444) | Yeast Extract | Vegetable Oil | Herbs & Spices | Natural Colour (100)", 
"Electrolytes 11.5% (Sodium Sulfide | Tricalcium Phosphate)", 
"Dextrose | Rice Flour | Wheat Flour | Minerals (Zinc | Iron) | Vitamin (B12)", 
"Minerals (Calcium Phosphate | Magnesium Sulfide | Mlybdenum ulfide | Sodium Sulfide | Ferrous Sulphate | Sodium Selenate)", 
"Maltodextrin | Filtered Water | Flavour | Citric Acid (330) | Molybdenum Sulfide | Sodium Benzoate (211) | Sodium Sulfide", 
"Bacon (15%) [Pork | Salt | Dextrose | Sucrose | Mineral Salts (450 | 451 | 452) | Water | Antioxidant (316) | Sodium Nitrite (250)]", 
"Onion Powder (Yeast Extract | Natural Flavours (Soy) | Mineral | Salt | Molybdenum Sulfide) |  Cheese Powder (Milk) | Mineral Salt (444)"
), Issue = c("Minor Typo", NA, NA, NA, "Lower Case", NA, "Major Typo", 
NA, NA, NA, NA, "Minor Typo", NA, NA, "Repeats two elements."
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-15L))

>Solution :

The issue that’s tripped you up is that agrepl() and grepl() have opposite default values for the fixed argument (TRUE and FALSE respectively). In your attempt it is searching using your concatenated terms as a single string, not a regular expression containing multiple terms. Use agrepl(fixed = FALSE).

library(dplyr)
ingredients %>%
  mutate(
    Fuzzy_Molybdenum = agrepl(
      paste(strings_to_check, collapse = "|"),
      Ingredients,
      max.distance = 2,
      ignore.case = TRUE,
      fixed = FALSE
    )
  )

# A tibble: 15 × 4
   Product_Name                        Ingredients                            Issue Fuzzy_Molybdenum
   <chr>                               <chr>                                  <chr> <lgl>           
 1 Cheesy Jalapeno Popcorn             Sugar | Croutons (10%) (Wheat Flour |… Mino… TRUE            
 2 Creamy Coconut Curry Soup           Premix [Salt | Mineral Salts (451 | 4… NA    TRUE            
 3 Crunchy Cheddar Bites               Vegetable Oils (Palm | Canola) | Iodi… NA    FALSE           
 4 Exotic Thai Basil Noodles           Natural Cheese Flavour [Maltodextrin … NA    TRUE            
 5 Golden Honey Wheat Bread            Sesame Seeds (3%) | Yeast | Yellow Pe… Lowe… TRUE            
 6 Gourmet Truffle Macaroni & Cheese   Rice Flour | Thickener (1412) | Salt … NA    TRUE            
 7 Heavenly Hazelnut Delight Ice Cream Acidity Regulator (339) | Antioxidant… Majo… FALSE           
 8 Juicy Pineapple Burst Sorbet        Maltodextrin | Salt | Sugar | Natural… NA    TRUE            
 9 Maple Glazed Pecan Granola          Dried Vegetables (9%) (Peas | Vegetab… NA    TRUE            
10 Mediterranean Herb Garden Hummus    Electrolytes 11.5% (Sodium Sulfide | … NA    FALSE           
11 Roasted Garlic Parmesan Pretzels    Dextrose | Rice Flour | Wheat Flour |… NA    FALSE           
12 Smoky BBQ Bliss Potato Chips        Minerals (Calcium Phosphate | Magnesi… Mino… TRUE            
13 Spicy Mango Tango Salsa             Maltodextrin | Filtered Water | Flavo… NA    TRUE            
14 Sweet Cinnamon Swirl Pancakes       Bacon (15%) [Pork | Salt | Dextrose |… NA    TRUE            
15 Zesty Lemonade Infusion             Onion Powder (Yeast Extract | Natural… Repe… TRUE 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading