I have a dataframe in R with two columns. The datatype/class of the first column is "character". However there are numerics embedded within it … but I presumed these are still technically characters since when I run the function class(column_name) it returns "character".
I am trying to filter the dataframe using the dplyr filter function. I want the filter function to return the same dataframe, but without the rows where the column ‘doc_id’ contains "(2).txt" at the end.
I have been trying many things but none have worked.
I have tried:
constitutions <- constitutions %>% filter(!str_detect(doc_id, "(2).txt"))
constitutions <- constitutions[constitutions$doc_id %in% "(2).txt == FALSE]
constitutions %>% filter(!str_detect(doc_id, "(2).txt"))
*Note: This one ^ seems to have gotten rid of only a few of them, but not close to all.
constitutions <- subset(constitutions, !"(2).txt" %in% doc_id)
constitutions <- subset(constitutions, !("(2).txt" %in% consitutions$doc_id))
And MANY more iterations … what am I missing?
P.S. An example of a doc_id column value I am trying to remove from the constitutions dataframe is:
Brazil_1988_rev_2017 (2).txt
Would using a regex within one of the functions above work? I am lost, and running out of ideas.
Any help would be much appreciated.
>Solution :
Does escaping the parenthesis and period like this solve the problem?
constitutions <- constitutions %>% filter(!str_detect(doc_id, "\\(2\\)\\.txt"))
Parenthesis and periods (and a bunch of other symbols) are all special symbols in regular expressions. To look for a literal parenthesis or period, you have to escape using backslashes. For example:
This works:
> "document(2).txt" %>% str_detect("\\(2\\)\\.txt")
[1] TRUE
This doesn’t:
> "document(2).txt" %>% str_detect("(2).txt")
[1] FALSE
Here’s a link to more about regular expressions. The whole chapter is useful, but here’s the section about escaping: https://r4ds.hadley.nz/regexps.html#sec-regexp-escaping