So, I have a dataframe that looks like this, but with 10k rows:
| studentID | school_ID_1 | school_ID_2 | school_ID_3 | school_ID_4 |
|---|---|---|---|---|
| 5555 | 3321 | 3311 | 3311 | 3311 |
| 5544 | 3347 | 3326 | 3350 | 3347 |
| 5533 | 3364 | 3314 | 3328 | 3302 |
| 5566 | 3389 | 3351 | 3379 | 3384 |
| 5577 | 3312 | 3303 | 3320 | 3320 |
If my data had only one student, using the following command:
dep_edu2016_4x = data.frame(table(unlist(dep_edu2016_4x)))
Would tell me how many times this student changed schools in a year through the number of rows the table would return, minus the studentID row. I’ve already made a dummy to know if they’ve changed schools, now I need a solution to create a new column informing me how many times each student changed schools
>Solution :
dep_edu2016_4x$unique_schools <- apply(
subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1,
function(z) length(unique(z)) - 1L)
dep_edu2016_4x
# studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools
# 1 5555 3321 3311 3311 3311 1
# 2 5544 3347 3326 3350 3347 2
# 3 5533 3364 3314 3328 3302 3
# 4 5566 3389 3351 3379 3384 3
# 5 5577 3312 3303 3320 3320 2
Walk-through:
-
subset(...)is one way to extract just the columns we need, including a range of columns; we could easily have started withdep_edu2016_4x[,-1]to select all except thestudentID, but (1) I don’t assume there are no other columns; and even if this is your whole dataset, then (2) once we add that column, we cannot recalculate it without biasing it with that additional column. -
apply(X, MARGIN, FUN)operates on each row (MARGIN=1) of the dataset (X=x), and applies the function (FUN) to that data. In this case, for each row it returns a single integer. -
length(unique(z)) - 1Lcalculates the number of distinct schools minus 1, a good proxy for the number of changes.
The count of school-changes is slightly incorrect, though: if a student starts at school 3321, changes to 3311, then changes back to 3321, then length(unique(.))-1L will not work. (This appears to be the case for student 5544.) We’ll need something a little more robust if you want to guard against that:
changes <- function(z) if (length(z)) sum(z[-1] != z[-length(z)]) else 0L
dep_edu2016_4x$schools_changes <- apply(subset(dep_edu2016_4x, select = school_ID_1:school_ID_4), 1, changes)
dep_edu2016_4x
# studentID school_ID_1 school_ID_2 school_ID_3 school_ID_4 unique_schools schools_changes
# 1 5555 3321 3311 3311 3311 1 1
# 2 5544 3347 3326 3350 3347 2 3
# 3 5533 3364 3314 3328 3302 3 3
# 4 5566 3389 3351 3379 3384 3 3
# 5 5577 3312 3303 3320 3320 2 2