Normalize a dataframe based on another dataframe

October 9, 2023

I have two dataframes mval and meth_deconv with shared columns but different row values.
I want to perform min-max normalization of mval based on meth_deconv values.

common.cols <- intersect(colnames(mval), colnames(meth_deconv))
meth_deconv <- meth_deconv[,common.cols]
mval <- mval[,common.cols]
bval <- bval[,common.cols]

for (col in colnames(mval)) {
  min <- min(meth_deconv[[col]])
  max <- max(meth_deconv[[col]])
  mval[[col]] <- (mval[[col]] - min) / (max - min)
}

Traceback:

> for (col in colnames(mval)) {
+   min <- min(meth_deconv[[col]])
+   max <- max(meth_deconv[[col]])
+   mval[[col]] <- (mval[[col]] - min) / (max - min)
+ }
Error in mval[[col]] : subscript out of bounds

Input:

> dput(meth_deconv[1:5,1:5])
structure(list(TCGA.Y8.A8RZ.01 = c(0.129859982131871, 0.0357708166456001, 
0, 0.133656384812674, 0.0666114231385833), TCGA.Y8.A8RY.01 = c(0.114822027432518, 
0.0182327682610597, 0, 0.154950359997823, 0.0170537545658276), 
    TCGA.Y8.A897.01 = c(0.0733882956002282, 0.0156764793850076, 
    0, 0.142084581990467, 0.0464498830958926), TCGA.Y8.A896.01 = c(0.105826996952733, 
    0.0298500219688853, 0, 0.139574516141476, 0.0352706140819193
    ), TCGA.Y8.A895.01 = c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_)), row.names = c("Bcell", "CD8", "Dendritic", "Endo", 
"Eos"), class = "data.frame")

> dput(mval[1:5,1:5])
structure(c(2.20666978271644, 2.21762842677891, -4.07494124222421, 
-4.13722707002192, -3.43314164549568, 2.33449419612022, 2.34788404801465, 
-3.75292484979324, -4.3115910063775, -4.31229291319228, 2.54516913102614, 
3.15809412595788, -2.12378973913844, -4.35973967501755, -4.39347889615609, 
2.14840959318955, 1.81982095876368, -3.46795103846624, -4.29965006722576, 
-4.40595273662642, 2.66361259477239, 2.62697164963472, -1.88151767905837, 
-4.13446638546434, -4.09928030669639), dim = c(5L, 5L), dimnames = list(
    c("cg00000957", "cg00001349", "cg00001583", "cg00002028", 
    "cg00002719"), c("TCGA.Y8.A8RZ.01", "TCGA.Y8.A8RY.01", "TCGA.Y8.A897.01", 
    "TCGA.Y8.A896.01", "TCGA.Y8.A895.01")))

>Solution :

This is because your objects are matrices rather than data frames. When you use [[ notation, the matrix acts like a vector. For example:

mval[[1]]
# [1] 2.20667

This returns the first element, rather than the first column. Note what happens if you try to use [[ with a column name:

mval[["TCGA.Y8.A895.01"]]
# Error in mval[["TCGA.Y8.A895.01"]] : subscript out of bounds

To refer to a column by its name, instead use mval[, col]:

mval[, "TCGA.Y8.A895.01"]
# cg00000957 cg00001349 cg00001583 cg00002028 cg00002719 
#   2.663613   2.626972  -1.881518  -4.134466  -4.099280

Note this returns a vector. To return a one-column matrix, you can do mval[, "TCGA.Y8.A895.01", drop = FALSE]. See the Simplifying vs preserving subsetting section of Advanced R by Hadley Wickham for more.

If you use mval[, col] notation your code will work:

for (col in colnames(mval)) {
    min <- min(meth_deconv[[col]])
    max <- max(meth_deconv[[col]])
    mval[, col] <- (mval[, col] - min) / (max - min)
}

However, you do not need a loop here. You can do the same with mapply():

mapply(
    \(x, y) (y - min(x)) / (max(x) - min(x)),
    asplit(meth_deconv, 2), asplit(mval, 2)
) 

#            TCGA.Y8.A8RZ.01 TCGA.Y8.A8RY.01 TCGA.Y8.A897.01 TCGA.Y8.A896.01 TCGA.Y8.A895.01
# cg00000957        16.51002        15.06608        17.91306        15.39256              NA
# cg00001349        16.59201        15.15249        22.22686        13.03835              NA
# cg00001583       -30.48819       -24.22018       -14.94736       -24.84659              NA
# cg00002028       -30.95420       -27.82563       -30.68412       -30.80541              NA
# cg00002719       -25.68633       -27.83016       -30.92157       -31.56703              NA

Note that we asplit() each matrix into a list of columns to iterate over it, as otherwise a matrix is treated as a vector and you iterate over elements.