Create a new column using detecting the domain of a url from an existing column

I would like to create new column to the existing dataframe I have which will has as value for every row a specific url. This url exists in every row of the Content column of the following dataframe:

data <- read.table(text='"Content"     "date"     
  1     "a house a home"     "12/31/2013"  
  2     "cabin ideas in the woods"     "5/4/2013"  
  3     "motel is a hotel"   "1/4/2013"', header=TRUE)

However the problem is that the url contains different version but the domain remain the same. The domain is this How is it possible to create the new column with urls using only the domain to detect it in every row?

>Solution :


data <- read.table(text = '"Content"     "date"
  1     "a house a home"     "12/31/2013"
  2     "cabin ideas in the woods"     "5/4/2013"
  3     "motel is a hotel"   "1/4/2013"', header = TRUE)

data %>%
  mutate(url = Content %>% str_extract("(www\\.|http[s]?://)[A-z0-9./]*"))
#>                                            Content       date
#> 1               a house a home 12/31/2013
#> 2 cabin ideas in the woods   5/4/2013
#> 3                 motel is a hotel   1/4/2013
#>                       url
#> 1
#> 2
#> 3

Created on 2022-03-30 by the reprex package (v2.0.0)

If this does not work you might expand your regex e.g. (?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])

Leave a Reply