Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Fail to extract the gender information from the first name in R

I attempted to extract gender information from the first name using gender package in R. I tried both ‘ssa’ and ‘genderize’ for argument method.

Here is my demo sample code.

unique_id <- seq(0:6)
first_name <- c("annie j", "Juan", "Richard", "Aj",
                "Dana", "annie j", "liyuan")

demo1 <- as.data.frame(cbind(unique_id, first_name))

For ssa, it uses names based from the U.S. Social Security Administration baby name data. Therefore, if the name does not include in ssa, it will return the error as shown below.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

demo1$gender <- gender(demo1$first_name, method="ssa")$gender

Error in $<-.data.frame(*tmp*, gender, value = c("male", "female", :
replacement has 4 rows, data has 7

I know this is because ‘annie j’ is not included in the name dataset, ssa. Any suggestions or advice to fix it?

>Solution :

You could trim all characters after the first space in the string:

gender(gsub(" .*", "", first_name), method = "ssa")
 name    proportion_male proportion_female gender year_min year_max
  <chr>             <dbl>             <dbl> <chr>     <dbl>    <dbl>
1 Aj               0.988             0.0119 male       1932     2012
2 annie            0.0053            0.995  female     1932     2012
3 annie            0.0053            0.995  female     1932     2012
4 Dana             0.202             0.798  female     1932     2012
5 Juan             0.992             0.0084 male       1932     2012
6 Richard          0.996             0.0037 male       1932     2012

(If you prefer tidyverse you could use stringr::str_remove() or stringr::str_extract() instead)

Note that liyuan is still missing. You might want something like:

library(tidyverse)
demo2 <-  demo1 |> 
  mutate(trim_name = stringr::str_remove(first_name, " .*"))
namedat <- gender(unique(demo2$trim_name), method = "ssa") |>
     select(trim_name = name, gender)
demo3 <- left_join(demo2, namedat, by = "trim_name") |>
     select(-trim_name)
  unique_id first_name gender
1         1    annie j female
2         2       Juan   male
3         3    Richard   male
4         4         Aj   male
5         5       Dana female
6         6    annie j female
7         7     liyuan   <NA>
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading