Count number of times the content of two columns are equal and different in dataframe in R

I have this dataframe

df <- structure(list(`Prediction (Ge)` = c("Paranthropus", "Paranthropus", 
"Homo", "Paranthropus", "Australopithecus", "Paranthropus", "Paranthropus", 
"Australopithecus", "Paranthropus", "Australopithecus", "Paranthropus", 
"Australopithecus", "Australopithecus", "Australopithecus", "Australopithecus", 
"Paranthropus", "Homo", "Australopithecus", "Paranthropus", "Paranthropus", 
"Paranthropus", "Paranthropus", "Australopithecus", "Paranthropus", 
"Australopithecus", "Paranthropus", "Australopithecus"), `Prediction (Sp)` = c("Australopithecus africanus", 
"Paranthropus robustus", "Paranthropus boisei", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Australopithecus afarensis", "Paranthropus boisei", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Australopithecus afarensis", 
"Australopithecus afarensis", "Australopithecus afarensis", "Paranthropus robustus", 
"Homo habilis", "Australopithecus afarensis", "Paranthropus robustus", 
"Paranthropus boisei", "Paranthropus boisei", "Paranthropus robustus", 
"Australopithecus afarensis", "Paranthropus robustus", "Australopithecus afarensis", 
"Paranthropus robustus", "Australopithecus afarensis")), row.names = c(2L, 
3L, 6L, 7L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 19L, 20L, 26L, 
27L, 28L, 29L, 30L, 31L, 32L, 34L, 35L, 37L, 38L, 42L, 46L, 47L
), class = "data.frame", na.action = structure(c(`1` = 1L, `4` = 4L, 
`5` = 5L, `8` = 8L, `16` = 16L, `17` = 17L, `18` = 18L, `21` = 21L, 
`22` = 22L, `23` = 23L, `24` = 24L, `25` = 25L, `33` = 33L, `36` = 36L, 
`39` = 39L, `40` = 40L, `41` = 41L, `43` = 43L, `44` = 44L, `45` = 45L
), class = "omit"))

The head(df) allows to visualize how it looks like:

head(df)
    Prediction (Ge)            Prediction (Sp)
2      Paranthropus Australopithecus africanus
3      Paranthropus      Paranthropus robustus
6              Homo        Paranthropus boisei
7      Paranthropus      Paranthropus robustus
9  Australopithecus      Paranthropus robustus
10     Paranthropus      Paranthropus robustus

There are two columns, which come from two different predictions.

What I would like to know is if the genus in the second column (Prediction (Sp) is the same as the genus in Prediction (Ge). So this means that we need to compare the first word in the Prediction (Sp) with the value in Prediction (Ge).

If you analyze only the first six rows from head(df), I would say that there are 3 rows that are identical (rows number 3, 7 and 10), whereas there are 3 rows that are different (2, 6, 9).

How can I do it with a simple line of code, to get the total number of identical/different values?

>Solution :

Using grepl applied separately to each row. No packages are used.

subset(df, mapply(grepl, `Prediction (Ge)`, `Prediction (Sp)`))
##     Prediction (Ge)            Prediction (Sp)
## 3      Paranthropus      Paranthropus robustus
## 7      Paranthropus      Paranthropus robustus
## 10     Paranthropus      Paranthropus robustus
## ...snip...

table(with(df, mapply(grepl, `Prediction (Ge)`, `Prediction (Sp)`)))
##
## FALSE  TRUE 
##     5    22 

Leave a Reply