Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to Create Repeating Values for All Unique Group Values in a Column in R

I’m trying to make a column sub_species based on a condition from the Species column.

There are three unique values for Species. If Species starts with setosa, then I’d like to repeat setosa1 and setosa2 25 times respectively inside the new column sub_species. The same logic goes for the other two.

Note that each Species value has exactly 50 values, respectively. Hence, the length matches when 25 repetition is used.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

library(dplyr)

iris %>% 
  mutate(
    sub_species = case_when(
      startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length(1:25))
      startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor"), length(1:25)),
      startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length(1:25))
    )
  )

Error: must be length 150 or one, not 50.

I tried with just setosa separately and it worked. However, it doesn’t work when I want to do it as a whole.

>Solution :

You were close: instead of length(1:25) (which many not work as intended), use length.out. It has the added safeguard (in general, not with iris) in ensuring that when you have an odd number of rows, you can produce the perfect amount of sub-species.

iris %>% 
  mutate(
    sub_species = case_when(
      startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length.out = n()),
      startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor2"), length.out = n()),
      startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length.out = n())
    )
  ) %>%
  head()
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1          5.1         3.5          1.4         0.2  setosa     setosa1
# 2          4.9         3.0          1.4         0.2  setosa     setosa2
# 3          4.7         3.2          1.3         0.2  setosa     setosa1
# 4          4.6         3.1          1.5         0.2  setosa     setosa2
# 5          5.0         3.6          1.4         0.2  setosa     setosa1
# 6          5.4         3.9          1.7         0.4  setosa     setosa2

base R

paste0 by itself might be okay, but like in the dplyr example above, this code will be a little safer if there are not an even number of rows.

iris$sub_species <- paste0(
  iris$Species,
  ave(seq_len(nrow(iris)), iris$Species,
      FUN = function(z) rep(1:2, length.out = length(z)))
)
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1          5.1         3.5          1.4         0.2  setosa     setosa1
# 2          4.9         3.0          1.4         0.2  setosa     setosa2
# 3          4.7         3.2          1.3         0.2  setosa     setosa1
# 4          4.6         3.1          1.5         0.2  setosa     setosa2
# 5          5.0         3.6          1.4         0.2  setosa     setosa1
# 6          5.4         3.9          1.7         0.4  setosa     setosa2

The seq_len(nrow(iris)) is because ave requires the return-value to be the same class as the first argument; since we want numbers, I gave it numbers. We don’t care what they are, but they must be the same length. (I could have used one of the numeric columns, but I wanted my intentions here clear.)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading