Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Separate words in a column that are separated by space and comma into different columns in R

I am manipulating a very large movie data set. The data is as below (example)

Ex:

title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)

If you observe in the genre column, certain genres are comma separated and few are space separated. And the word SciFi has two different appearances: SciFi and Sci-Fi. This is my situation in the entire data set that has around 5000 movies.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I am stuck with an appropriate approach for the following results:

  1. How to separate the genre of each movie into individual genres. Ex: I want to separate genre for Interstellar as:

genre1 = Adventure


genre2= Drama


genre3=Sci-Fi


I’ve used the following command:

movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)

The above command is separating the word Sci-Fi as two genres (Sci and Fi or only Sci).

  1. How to delete the hyphen (-) in the word Sci-Fi across the entire genre so that separate function works well.

OR

  1. Is there a work around that adds comma between genres (in genre column) and separate them by comma alone?

>Solution :

I usually start by "cleaning" the data. In this case, I’d make the formatting of your genre column consistent (genres column separated, no trailing spaces, …) and then use separate.

library(stringr)
library(tidyr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s+", ",") 
movies$genre <- str_replace_all(movies$genre, "\\s+$", "") 
movies$genre <- str_replace_all(movies$genre, "\\s+", ",") 
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
movies$genre
#> [1] "Adventure,Drama,SciFi"  "Adventure,Comedy,SciFi" "Adventure,SciFi"       
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#>                   title    genre1 genre2 genre3
#> 1          Interstellar Adventure  Drama  SciFi
#> 2    Back to the Future Adventure Comedy  SciFi
#> 3 2001: A Space Odyssey Adventure  SciFi   <NA>
#> 4           The Martian Adventure  Drama  SciFi

Created on 2023-01-31 by the reprex package (v2.0.1)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading