Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Colnames unexpectedly updating variable in R

I’m trying to get a list of column names that have been added after the initial csv load. If I am not updating the variable after column names are added, then how are they being added to the variable?

I would expect that only Name and Age would get printed from my_cols but it is printing IsJon as well

library(data.table)

Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)

df <- data.table(Name, Age)

my_cols <- colnames(df)

print(my_cols)

df[,isJon:=ifelse(Name=="John", 1, 0)]

print(my_cols)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

There are at least two things going on here:

  • R is inherently lazy with objects, and when you create my_cols <- colnames(df), it isn’t changing anything so it does not create a duplicate vector of names. The moment you do something to the vector of names that "could" be changing it, R copies the vector from the frame’s attributes and creates a new one, thereby not changing when the original frame is updated.

  • data.table tends to do things in-place with its referential semantics, so when it adds a column, the internal storage of column names is appended in-place, contrary to R’s normal way of doing things. Normally, data.frame changes creates a new vector of names when you add one.

    C.f., base::data.frame, adding a column creates a new vector of column names, therefore our my_cols does not magically stay updated:

    Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
    Age <- c(23, 41, 32, 58, 26)
    df <- data.frame(Name, Age)
    my_cols <- colnames(df)
    print(my_cols)
    # [1] "Name" "Age" 
    df <- transform(df, isJon=ifelse(Name=="John", 1, 0))
    print(my_cols)
    # [1] "Name" "Age" 
    

There a couple of ways you can get these two things to work in the direction you were heading:

  1. copy the vector, which forces it to be a new copy (yes, good name) of the vector.

    Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
    Age <- c(23, 41, 32, 58, 26)
    df <- data.table(Name, Age)
    my_cols <- copy(colnames(df))
    print(my_cols)
    # [1] "Name" "Age" 
    df[,isJon:=ifelse(Name=="John", 1, 0)]
    print(my_cols)
    # [1] "Name" "Age" 
    
  2. Do "something" to the vector, making R think it should copy-on-write:

    Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
    Age <- c(23, 41, 32, 58, 26)
    df <- data.table(Name, Age)
    my_cols <- colnames(df)[]
    print(my_cols)
    # [1] "Name" "Age" 
    df[,isJon:=ifelse(Name=="John", 1, 0)]
    print(my_cols)
    # [1] "Name" "Age" 
    
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading