Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I split sentence into new variables in R (with zero-one encoding)?

I have a data like below:

V1  V2
1   orange, apple
2   orange, lemon
3   lemon, apple
4   orange, lemon, apple
5   lemon
6   apple
7   orange
8   lemon, apple

I want to split the V2 variable like this:

  • I have three categories of the V2 column: "orange", "lemon", "apple"
  • for each of the categories I want to create a new column (variable) that will inform about whether such a name appeared in V2 (0,1)

I tried this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df %>% separate(V2, into = c("orange", "lemon", "apple"))

.. and I got this result, but it’s not what I expect.

  V1 orange lemon apple
1  1   orange   apple    <NA>
2  2   orange   lemon    <NA>
3  3    lemon   apple    <NA>
4  4   orange   lemon   apple
5  5    lemon    <NA>    <NA>
6  6    apple    <NA>    <NA>
7  7   orange    <NA>    <NA>
8  8    lemon   apple    <NA>

The result I mean is below.

V1  orange  lemon   apple
1   1   0   1
2   1   1   0
3   0   1   1
4   1   1   0
5   0   1   0
6   0   0   1
7   1   0   0
8   0   1   1

>Solution :

you could try pivoting:

library(dplyr)
library(tidyr)
df |> 
  separate_rows(V2, sep = ", ") |> 
  mutate(ind = 1) |> 
  pivot_wider(names_from = V2,
              values_from = ind,
              values_fill = 0)

Output is:

# A tibble: 8 × 4
     V1 orange apple lemon
  <int>  <dbl> <dbl> <dbl>
1     1      1     1     0
2     2      1     0     1
3     3      0     1     1
4     4      1     1     1
5     5      0     0     1
6     6      0     1     0
7     7      1     0     0
8     8      0     1     1

data I used:

V1 <- 1:8
V2 <- c("orange, apple", "orange, lemon", 
        "lemon, apple", "orange, lemon, apple",
        "lemon", "apple", "orange", 
        "lemon, apple")
df <- tibble(V1, V2) 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading