Home Creating new column informing how many unique values there are in multiple columns

Questions

Creating new column informing how many unique values there are in multiple columns

December 1, 2021

So, I have a dataframe that looks like this, but with 10k rows:

studentID	school_ID_1	school_ID_2	school_ID_3	school_ID_4
5555	3321	3311	3311	3311
5544	3347	3326	3350	3347
5533	3364	3314	3328	3302
5566	3389	3351	3379	3384
5577	3312	3303	3320	3320

If my data had only one student, using the following command:

dep_edu2016_4x = data.frame(table(unlist(dep_edu2016_4x)))

Would tell me how many times this student changed schools in a year through the number of rows the table would return, minus the studentID row. I’ve already made a dummy to know if they’ve changed schools, now I need a solution to create a new column informing me how many times each student changed schools

>Solution :

dep_edu2016_4x$unique_schools <- apply(
  subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1,
  function(z) length(unique(z)) - 1L)
dep_edu2016_4x
#   studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools
# 1      5555        3321        3311        3311        3311              1
# 2      5544        3347        3326        3350        3347              2
# 3      5533        3364        3314        3328        3302              3
# 4      5566        3389        3351        3379        3384              3
# 5      5577        3312        3303        3320        3320              2

Walk-through:

subset(...) is one way to extract just the columns we need, including a range of columns; we could easily have started with dep_edu2016_4x[,-1] to select all except the studentID, but (1) I don’t assume there are no other columns; and even if this is your whole dataset, then (2) once we add that column, we cannot recalculate it without biasing it with that additional column.
apply(X, MARGIN, FUN) operates on each row (MARGIN=1) of the dataset (X=x), and applies the function (FUN) to that data. In this case, for each row it returns a single integer.
length(unique(z)) - 1L calculates the number of distinct schools minus 1, a good proxy for the number of changes.

The count of school-changes is slightly incorrect, though: if a student starts at school 3321, changes to 3311, then changes back to 3321, then length(unique(.))-1L will not work. (This appears to be the case for student 5544.) We’ll need something a little more robust if you want to guard against that:

changes <- function(z) if (length(z)) sum(z[-1] != z[-length(z)]) else 0L
dep_edu2016_4x$schools_changes <- apply(subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1, changes)
dep_edu2016_4x
#   studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools schools_changes
# 1      5555        3321        3311        3311        3311              1               1
# 2      5544        3347        3326        3350        3347              2               3
# 3      5533        3364        3314        3328        3302              3               3
# 4      5566        3389        3351        3379        3384              3               3
# 5      5577        3312        3303        3320        3320              2               2