Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to read a file with square brackets?

I have a file that looks like this:

Disease, ICD10_2023, ICD10_2019, ubk_coding_19, problem
Alzheimer_Disease, [G30, F00], [G30, F00], [G30, F00], -
Malaria, [B50, B51, B52, B53, B54, P37.3, P37.4], [B50, B51, B52, B53, B54, P37.3, P37.4], [B50, B51, B52, B53, B54, P373, P374], !
Spinocerebellar_ataxia, [G11], [G11], [G11], -

How can I read this file in R? I’d like to have lists as elements in the dataframe, meaning that for column ICD10_2023 for example you would have: c(G30, F00), c(B50, B51, B52, B53, B54, P37.3, P37.4), c(G11)

I was simply doing this:
read.csv("example.txt")
I get this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Instead when doing it with header=FALSE the format is destroyed.

>Solution :

Using read.table/read.csv requires that there be an equal number of delimited fields in your file, and you don’t because the bracketed data is not quoted. I think we can work around that using pipe and the command-line sed utility.

res <- read.csv(pipe(r"(sed -E 's/\[/"[/g ; s/\]/]"/g' /tmp/foo)"))
res
#                  Disease                               ICD10_2023                               ICD10_2019                          ubk_coding_19 problem
# 1      Alzheimer_Disease                               [G30, F00]                               [G30, F00]                             [G30, F00]       -
# 2                Malaria  [B50, B51, B52, B53, B54, P37.3, P37.4]  [B50, B51, B52, B53, B54, P37.3, P37.4]  [B50, B51, B52, B53, B54, P373, P374]       !
# 3 Spinocerebellar_ataxia                                    [G11]                                    [G11]                                  [G11]       -

where /tmp/foo is the text file.

For breaking them down into components in list-columns, we can do

res[,2:4] <- lapply(res[,2:4], function(z) strsplit(gsub("[][ ]", "", z), ","))
res
#                  Disease                            ICD10_2023                            ICD10_2019                       ubk_coding_19 problem
# 1      Alzheimer_Disease                              G30, F00                              G30, F00                            G30, F00       -
# 2                Malaria B50, B51, B52, B53, B54, P37.3, P37.4 B50, B51, B52, B53, B54, P37.3, P37.4 B50, B51, B52, B53, B54, P373, P374       !
# 3 Spinocerebellar_ataxia                                   G11                                   G11                                 G11       -
str(res)
# 'data.frame': 3 obs. of  5 variables:
#  $ Disease      : chr  "Alzheimer_Disease" "Malaria" "Spinocerebellar_ataxia"
#  $ ICD10_2023   :List of 3
#   ..$ : chr  "G30" "F00"
#   ..$ : chr  "B50" "B51" "B52" "B53" ...
#   ..$ : chr "G11"
#  $ ICD10_2019   :List of 3
#   ..$ : chr  "G30" "F00"
#   ..$ : chr  "B50" "B51" "B52" "B53" ...
#   ..$ : chr "G11"
#  $ ubk_coding_19:List of 3
#   ..$ : chr  "G30" "F00"
#   ..$ : chr  "B50" "B51" "B52" "B53" ...
#   ..$ : chr "G11"
#  $ problem      : chr  " -" " !" " -"

Notes:

  • in the first block, I use R’s newish raw-strings r"(...)" because the use of escaped backslashes in the sed expressions could easily get into Inception-level backslashing, I chose to keep it simple;
  • what we do is replace all [ with "[ and similarly ] to ]", so that read.csv will correctly recognize a quoted cell for what it is; this fails if any of your brackets are not paired
  • if your filename has spaces, you’ll need to internally quote /tmp/foo, since R’s use of pipe and such are naive to that sort of thing
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading