Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Normalize a dataframe based on another dataframe

I have two dataframes mval and meth_deconv with shared columns but different row values.
I want to perform min-max normalization of mval based on meth_deconv values.

common.cols <- intersect(colnames(mval), colnames(meth_deconv))
meth_deconv <- meth_deconv[,common.cols]
mval <- mval[,common.cols]
bval <- bval[,common.cols]

for (col in colnames(mval)) {
  min <- min(meth_deconv[[col]])
  max <- max(meth_deconv[[col]])
  mval[[col]] <- (mval[[col]] - min) / (max - min)
}

Traceback:

> for (col in colnames(mval)) {
+   min <- min(meth_deconv[[col]])
+   max <- max(meth_deconv[[col]])
+   mval[[col]] <- (mval[[col]] - min) / (max - min)
+ }
Error in mval[[col]] : subscript out of bounds

Input:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

> dput(meth_deconv[1:5,1:5])
structure(list(TCGA.Y8.A8RZ.01 = c(0.129859982131871, 0.0357708166456001, 
0, 0.133656384812674, 0.0666114231385833), TCGA.Y8.A8RY.01 = c(0.114822027432518, 
0.0182327682610597, 0, 0.154950359997823, 0.0170537545658276), 
    TCGA.Y8.A897.01 = c(0.0733882956002282, 0.0156764793850076, 
    0, 0.142084581990467, 0.0464498830958926), TCGA.Y8.A896.01 = c(0.105826996952733, 
    0.0298500219688853, 0, 0.139574516141476, 0.0352706140819193
    ), TCGA.Y8.A895.01 = c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_)), row.names = c("Bcell", "CD8", "Dendritic", "Endo", 
"Eos"), class = "data.frame")

> dput(mval[1:5,1:5])
structure(c(2.20666978271644, 2.21762842677891, -4.07494124222421, 
-4.13722707002192, -3.43314164549568, 2.33449419612022, 2.34788404801465, 
-3.75292484979324, -4.3115910063775, -4.31229291319228, 2.54516913102614, 
3.15809412595788, -2.12378973913844, -4.35973967501755, -4.39347889615609, 
2.14840959318955, 1.81982095876368, -3.46795103846624, -4.29965006722576, 
-4.40595273662642, 2.66361259477239, 2.62697164963472, -1.88151767905837, 
-4.13446638546434, -4.09928030669639), dim = c(5L, 5L), dimnames = list(
    c("cg00000957", "cg00001349", "cg00001583", "cg00002028", 
    "cg00002719"), c("TCGA.Y8.A8RZ.01", "TCGA.Y8.A8RY.01", "TCGA.Y8.A897.01", 
    "TCGA.Y8.A896.01", "TCGA.Y8.A895.01")))

>Solution :

This is because your objects are matrices rather than data frames. When you use [[ notation, the matrix acts like a vector. For example:

mval[[1]]
# [1] 2.20667

This returns the first element, rather than the first column. Note what happens if you try to use [[ with a column name:

mval[["TCGA.Y8.A895.01"]]
# Error in mval[["TCGA.Y8.A895.01"]] : subscript out of bounds

To refer to a column by its name, instead use mval[, col]:

mval[, "TCGA.Y8.A895.01"]
# cg00000957 cg00001349 cg00001583 cg00002028 cg00002719 
#   2.663613   2.626972  -1.881518  -4.134466  -4.099280 

Note this returns a vector. To return a one-column matrix, you can do mval[, "TCGA.Y8.A895.01", drop = FALSE]. See the Simplifying vs preserving subsetting section of Advanced R by Hadley Wickham for more.

If you use mval[, col] notation your code will work:

for (col in colnames(mval)) {
    min <- min(meth_deconv[[col]])
    max <- max(meth_deconv[[col]])
    mval[, col] <- (mval[, col] - min) / (max - min)
}

However, you do not need a loop here. You can do the same with mapply():

mapply(
    \(x, y) (y - min(x)) / (max(x) - min(x)),
    asplit(meth_deconv, 2), asplit(mval, 2)
) 

#            TCGA.Y8.A8RZ.01 TCGA.Y8.A8RY.01 TCGA.Y8.A897.01 TCGA.Y8.A896.01 TCGA.Y8.A895.01
# cg00000957        16.51002        15.06608        17.91306        15.39256              NA
# cg00001349        16.59201        15.15249        22.22686        13.03835              NA
# cg00001583       -30.48819       -24.22018       -14.94736       -24.84659              NA
# cg00002028       -30.95420       -27.82563       -30.68412       -30.80541              NA
# cg00002719       -25.68633       -27.83016       -30.92157       -31.56703              NA

Note that we asplit() each matrix into a list of columns to iterate over it, as otherwise a matrix is treated as a vector and you iterate over elements.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading