I attempted to extract gender information from the first name using gender package in R. I tried both ‘ssa’ and ‘genderize’ for argument method.
Here is my demo sample code.
unique_id <- seq(0:6)
first_name <- c("annie j", "Juan", "Richard", "Aj",
"Dana", "annie j", "liyuan")
demo1 <- as.data.frame(cbind(unique_id, first_name))
For ssa, it uses names based from the U.S. Social Security Administration baby name data. Therefore, if the name does not include in ssa, it will return the error as shown below.
demo1$gender <- gender(demo1$first_name, method="ssa")$gender
Error in
$<-.data.frame(*tmp*, gender, value = c("male", "female", :
replacement has 4 rows, data has 7
I know this is because ‘annie j’ is not included in the name dataset, ssa. Any suggestions or advice to fix it?
>Solution :
You could trim all characters after the first space in the string:
gender(gsub(" .*", "", first_name), method = "ssa")
name proportion_male proportion_female gender year_min year_max
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Aj 0.988 0.0119 male 1932 2012
2 annie 0.0053 0.995 female 1932 2012
3 annie 0.0053 0.995 female 1932 2012
4 Dana 0.202 0.798 female 1932 2012
5 Juan 0.992 0.0084 male 1932 2012
6 Richard 0.996 0.0037 male 1932 2012
(If you prefer tidyverse you could use stringr::str_remove() or stringr::str_extract() instead)
Note that liyuan is still missing. You might want something like:
library(tidyverse)
demo2 <- demo1 |>
mutate(trim_name = stringr::str_remove(first_name, " .*"))
namedat <- gender(unique(demo2$trim_name), method = "ssa") |>
select(trim_name = name, gender)
demo3 <- left_join(demo2, namedat, by = "trim_name") |>
select(-trim_name)
unique_id first_name gender
1 1 annie j female
2 2 Juan male
3 3 Richard male
4 4 Aj male
5 5 Dana female
6 6 annie j female
7 7 liyuan <NA>