I have patient-level data with a column called category_name_en, which contains patient characteristics. Although there are only 17 patient characteristics, there are currently more than 600 different unique categories in this column because most patients have more than one patient characteristic listed, separated by a "|". I would like to separate these categories into new columns and create binary variables (yes/no) to indicate if a patient has any of the categories. This will help avoid having a large number of combinations.
Here is a sample data frame (sample_df) to illustrate the problem:
structure(list(category_name_en = c("Chronic Heart Disease | Genetic Blood Disorders (including Sickle Cell Anemia) | Diabetes",
"Chronic Heart Disease | Diabetes | Pregnancy", "Chronic Heart Disease | Chronic Liver Disease | Diabetes",
"Elderly ≥ 55 years old | Chronic Obstructive Pulmonary Disease (including Asthma) | Acquired or Genetic Immunodeficiency Disorders",
"(Dialysis (Failure Kid | Diabetes | Chronic Obstructive Pulmonary Disease (including Asthma)",
"Chronic Obstructive Pulmonary Disease (including Asthma) | Elderly ≥ 55 years old | Diabetes",
"Genetic Blood Disorders (including Sickle Cell Anemia) | Chronic Heart Disease | Elderly ≥ 55 years old",
"Chronic Heart Disease | Genetic Blood Disorders (including Sickle Cell Anemia) | Diabetes | Elderly ≥ 55 years old | Neurological Disease",
"Elderly = 55 years old | Diabetes | Genetic Blood Disorders (including Sickle Cell Anemia)",
"Pregnancy | Chronic Obstructive Pulmonary Disease (including Asthma)",
"Healthcare practitioners | Healthy Client | Diabetes", "Chronic Heart Disease | Elderly ≥ 55 years old | (Dialysis (Failure Kid",
"Healthy Client | Elderly = 55 years old | Pregnancy", "Chronic Obstructive Pulmonary Disease (including Asthma) | Acquired or Genetic Immunodeficiency Disorders",
"Chronic Heart Disease | Elderly ≥ 55 years old | Diabetes",
"Elderly ≥ 55 years old | Chronic Heart Disease | Diabetes | Chronic Obstructive Pulmonary Disease (including Asthma)",
"Elderly ≥ 55 years old | Chronic Obstructive Pulmonary Disease (including Asthma) | Chronic Heart Disease | Chronic Liver Disease | Diabetes",
"Chronic Heart Disease | Diabetes | Healthcare practitioners",
"Cancer | Elderly ≥ 55 years old | Acquired or Genetic Immunodeficiency Disorders",
"Chronic Obstructive Pulmonary Disease (including Asthma) | Cancer | Diabetes | Chronic Heart Disease"
), patid = c(428L, 625L, 149L, 356L, 393L, 444L, 618L, 622L,
488L, 289L, 130L, 335L, 522L, 284L, 39L, 187L, 391L, 663L, 653L,
563L)), class = "data.frame", row.names = c(NA, -20L))
I would like to transform this data frame into a format where each unique category becomes a separate column, and a binary variable indicates whether a patient has any of those categories. For example:
| patid | Chronic Heart Disease | Genetic Blood Disorders | Diabetes | Pregnancy | Chronic Liver Disease | Elderly ≥ 55 years old | Chronic Obstructive Pulmonary Disease | Acquired or Genetic Immunodeficiency Disorders | (Dialysis (Failure Kid | Neurological Disease | Healthcare practitioners | Healthy Client | Elderly = 55 years old | Cancer |
|-------|----------------------|-------------------------|----------|-----------|----------------------|-------------------------|---------------------------------------|-------------------------------------------------|-------------------------|---------------------------|----------------------|---------------------------|---------|
| 428 | Yes | Yes | Yes | No | No | No | No | No | No | No | No | No | No |
| 625 | Yes | No | Yes | Yes | No | No | No | No | No | No | No | No | No |
| 149 | Yes | No | Yes | No | Yes | No | No | No | No | No | No | No | No |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
>Solution :
library(tidyr)
library(dplyr)
df %>%
separate_rows(category_name_en, sep = " \\| ") %>%
mutate(value = "Yes") %>%
pivot_wider(names_from = category_name_en,
values_from = value,
values_fill = "No")
# A tibble: 20 × 16
patid `Chronic Heart Disease` Genetic Blood Disorders (including…¹ Diabetes Pregnancy
<int> <chr> <chr> <chr> <chr>
1 428 Yes Yes Yes No
2 625 Yes No Yes Yes
3 149 Yes No Yes No
4 356 No No No No
5 393 No No Yes No
6 444 No No Yes No
7 618 Yes Yes No No
8 622 Yes Yes Yes No
9 488 No Yes Yes No
10 289 No No No Yes
11 130 No No Yes No
12 335 Yes No No No
13 522 No No No Yes
14 284 No No No No
15 39 Yes No Yes No
16 187 Yes No Yes No
17 391 Yes No Yes No
18 663 Yes No Yes No
19 653 No No No No
20 563 Yes No Yes No
# ℹ abbreviated name: ¹`Genetic Blood Disorders (including Sickle Cell Anemia)`
# ℹ 11 more variables: `Chronic Liver Disease` <chr>, `Elderly ≥ 55 years old` <chr>,
# `Chronic Obstructive Pulmonary Disease (including Asthma)` <chr>,
# `Acquired or Genetic Immunodeficiency Disorders` <chr>,
# `(Dialysis (Failure Kid` <chr>, `Neurological Disease` <chr>,
# `Elderly = 55 years old` <chr>, `Healthcare practitioners` <chr>,
# `Healthy Client` <chr>, ` (Dialysis (Failure Kid` <chr>, Cancer <chr>