How to read a file with square brackets?

January 5, 2024

I have a file that looks like this:

Disease, ICD10_2023, ICD10_2019, ubk_coding_19, problem
Alzheimer_Disease, [G30, F00], [G30, F00], [G30, F00], -
Malaria, [B50, B51, B52, B53, B54, P37.3, P37.4], [B50, B51, B52, B53, B54, P37.3, P37.4], [B50, B51, B52, B53, B54, P373, P374], !
Spinocerebellar_ataxia, [G11], [G11], [G11], -

How can I read this file in R? I’d like to have lists as elements in the dataframe, meaning that for column ICD10_2023 for example you would have: c(G30, F00), c(B50, B51, B52, B53, B54, P37.3, P37.4), c(G11)

I was simply doing this:
read.csv("example.txt")
I get this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names

Instead when doing it with header=FALSE the format is destroyed.

>Solution :

Using read.table/read.csv requires that there be an equal number of delimited fields in your file, and you don’t because the bracketed data is not quoted. I think we can work around that using pipe and the command-line sed utility.

res <- read.csv(pipe(r"(sed -E 's/\[/"[/g ; s/\]/]"/g' /tmp/foo)"))
res
#                  Disease                               ICD10_2023                               ICD10_2019                          ubk_coding_19 problem
# 1      Alzheimer_Disease                               [G30, F00]                               [G30, F00]                             [G30, F00]       -
# 2                Malaria  [B50, B51, B52, B53, B54, P37.3, P37.4]  [B50, B51, B52, B53, B54, P37.3, P37.4]  [B50, B51, B52, B53, B54, P373, P374]       !
# 3 Spinocerebellar_ataxia                                    [G11]                                    [G11]                                  [G11]       -

where /tmp/foo is the text file.

For breaking them down into components in list-columns, we can do

res[,2:4] <- lapply(res[,2:4], function(z) strsplit(gsub("[][ ]", "", z), ","))
res
#                  Disease                            ICD10_2023                            ICD10_2019                       ubk_coding_19 problem
# 1      Alzheimer_Disease                              G30, F00                              G30, F00                            G30, F00       -
# 2                Malaria B50, B51, B52, B53, B54, P37.3, P37.4 B50, B51, B52, B53, B54, P37.3, P37.4 B50, B51, B52, B53, B54, P373, P374       !
# 3 Spinocerebellar_ataxia                                   G11                                   G11                                 G11       -
str(res)
# 'data.frame': 3 obs. of  5 variables:
#  $ Disease      : chr  "Alzheimer_Disease" "Malaria" "Spinocerebellar_ataxia"
#  $ ICD10_2023   :List of 3
#   ..$ : chr  "G30" "F00"
#   ..$ : chr  "B50" "B51" "B52" "B53" ...
#   ..$ : chr "G11"
#  $ ICD10_2019   :List of 3
#   ..$ : chr  "G30" "F00"
#   ..$ : chr  "B50" "B51" "B52" "B53" ...
#   ..$ : chr "G11"
#  $ ubk_coding_19:List of 3
#   ..$ : chr  "G30" "F00"
#   ..$ : chr  "B50" "B51" "B52" "B53" ...
#   ..$ : chr "G11"
#  $ problem      : chr  " -" " !" " -"

Notes:

in the first block, I use R’s newish raw-strings r"(...)" because the use of escaped backslashes in the sed expressions could easily get into Inception-level backslashing, I chose to keep it simple;
what we do is replace all [ with "[ and similarly ] to ]", so that read.csv will correctly recognize a quoted cell for what it is; this fails if any of your brackets are not paired
if your filename has spaces, you’ll need to internally quote /tmp/foo, since R’s use of pipe and such are naive to that sort of thing