Fail to extract the gender information from the first name in R

March 24, 2023

I attempted to extract gender information from the first name using gender package in R. I tried both ‘ssa’ and ‘genderize’ for argument method.

Here is my demo sample code.

unique_id <- seq(0:6)
first_name <- c("annie j", "Juan", "Richard", "Aj",
                "Dana", "annie j", "liyuan")

demo1 <- as.data.frame(cbind(unique_id, first_name))

For ssa, it uses names based from the U.S. Social Security Administration baby name data. Therefore, if the name does not include in ssa, it will return the error as shown below.

demo1$gender <- gender(demo1$first_name, method="ssa")$gender

Error in $<-.data.frame(*tmp*, gender, value = c("male", "female", :
replacement has 4 rows, data has 7

I know this is because ‘annie j’ is not included in the name dataset, ssa. Any suggestions or advice to fix it?

>Solution :

You could trim all characters after the first space in the string:

gender(gsub(" .*", "", first_name), method = "ssa")

 name    proportion_male proportion_female gender year_min year_max
  <chr>             <dbl>             <dbl> <chr>     <dbl>    <dbl>
1 Aj               0.988             0.0119 male       1932     2012
2 annie            0.0053            0.995  female     1932     2012
3 annie            0.0053            0.995  female     1932     2012
4 Dana             0.202             0.798  female     1932     2012
5 Juan             0.992             0.0084 male       1932     2012
6 Richard          0.996             0.0037 male       1932     2012

(If you prefer tidyverse you could use stringr::str_remove() or stringr::str_extract() instead)

Note that liyuan is still missing. You might want something like:

library(tidyverse)
demo2 <-  demo1 |> 
  mutate(trim_name = stringr::str_remove(first_name, " .*"))
namedat <- gender(unique(demo2$trim_name), method = "ssa") |>
     select(trim_name = name, gender)
demo3 <- left_join(demo2, namedat, by = "trim_name") |>
     select(-trim_name)

  unique_id first_name gender
1         1    annie j female
2         2       Juan   male
3         3    Richard   male
4         4         Aj   male
5         5       Dana female
6         6    annie j female
7         7     liyuan   <NA>