I have a file that looks like this:
Disease, ICD10_2023, ICD10_2019, ubk_coding_19, problem
Alzheimer_Disease, [G30, F00], [G30, F00], [G30, F00], -
Malaria, [B50, B51, B52, B53, B54, P37.3, P37.4], [B50, B51, B52, B53, B54, P37.3, P37.4], [B50, B51, B52, B53, B54, P373, P374], !
Spinocerebellar_ataxia, [G11], [G11], [G11], -
How can I read this file in R? I’d like to have lists as elements in the dataframe, meaning that for column ICD10_2023 for example you would have: c(G30, F00), c(B50, B51, B52, B53, B54, P37.3, P37.4), c(G11)
I was simply doing this:
read.csv("example.txt")
I get this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
Instead when doing it with header=FALSE the format is destroyed.
>Solution :
Using read.table/read.csv requires that there be an equal number of delimited fields in your file, and you don’t because the bracketed data is not quoted. I think we can work around that using pipe and the command-line sed utility.
res <- read.csv(pipe(r"(sed -E 's/\[/"[/g ; s/\]/]"/g' /tmp/foo)"))
res
# Disease ICD10_2023 ICD10_2019 ubk_coding_19 problem
# 1 Alzheimer_Disease [G30, F00] [G30, F00] [G30, F00] -
# 2 Malaria [B50, B51, B52, B53, B54, P37.3, P37.4] [B50, B51, B52, B53, B54, P37.3, P37.4] [B50, B51, B52, B53, B54, P373, P374] !
# 3 Spinocerebellar_ataxia [G11] [G11] [G11] -
where /tmp/foo is the text file.
For breaking them down into components in list-columns, we can do
res[,2:4] <- lapply(res[,2:4], function(z) strsplit(gsub("[][ ]", "", z), ","))
res
# Disease ICD10_2023 ICD10_2019 ubk_coding_19 problem
# 1 Alzheimer_Disease G30, F00 G30, F00 G30, F00 -
# 2 Malaria B50, B51, B52, B53, B54, P37.3, P37.4 B50, B51, B52, B53, B54, P37.3, P37.4 B50, B51, B52, B53, B54, P373, P374 !
# 3 Spinocerebellar_ataxia G11 G11 G11 -
str(res)
# 'data.frame': 3 obs. of 5 variables:
# $ Disease : chr "Alzheimer_Disease" "Malaria" "Spinocerebellar_ataxia"
# $ ICD10_2023 :List of 3
# ..$ : chr "G30" "F00"
# ..$ : chr "B50" "B51" "B52" "B53" ...
# ..$ : chr "G11"
# $ ICD10_2019 :List of 3
# ..$ : chr "G30" "F00"
# ..$ : chr "B50" "B51" "B52" "B53" ...
# ..$ : chr "G11"
# $ ubk_coding_19:List of 3
# ..$ : chr "G30" "F00"
# ..$ : chr "B50" "B51" "B52" "B53" ...
# ..$ : chr "G11"
# $ problem : chr " -" " !" " -"
Notes:
- in the first block, I use R’s newish raw-strings
r"(...)"because the use of escaped backslashes in thesedexpressions could easily get into Inception-level backslashing, I chose to keep it simple; - what we do is replace all
[with"[and similarly]to]", so thatread.csvwill correctly recognize a quoted cell for what it is; this fails if any of your brackets are not paired - if your filename has spaces, you’ll need to internally quote
/tmp/foo, since R’s use ofpipeand such are naive to that sort of thing