Remove duplicates unsuccessful using duplicated or distinct

Advertisements

I seem to have difficulty in removing duplicates using either the duplicated or distinct functions in dplyr. I don’t know what the problem is but can anyone help? Here is a small part of the data as an example:

df <- data.frame(timestamp = c(1495115680.55608, 1495115680.58941, 
                             1495115680.62274), id = c("2017-05-18-145157833880", "2017-05-18-145157833880", 
                                                       "2017-05-18-145157833880"), condition = c("childchild", "childchild", 
                                                                                                 "childchild"))

Both these two functions fail to remove duplicates

df %>%
  filter(!duplicated(timestamp))

distinct(df, timestamp, .keep_all = TRUE)
   timestamp                      id  condition
1 1495115681 2017-05-18-145157833880 childchild
2 1495115681 2017-05-18-145157833880 childchild
3 1495115681 2017-05-18-145157833880 childchild

>Solution :

The problem is due to floating-point precision.
The timestamps are duplicate only to a certain point of decimal places.

One way to solve this is to round and then apply filter() or distinct():

df %>%
  mutate(timestamp1 = round(timestamp, 0)) %>% 
  filter(!duplicated(timestamp1)) %>% 
  select(-timestamp1)

 timestamp                      id  condition
1 1495115681 2017-05-18-145157833880 childchild

Leave a ReplyCancel reply

Exit mobile version

%%footer%%