Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

how to make data.table omit extra col

I am trying to load a very large tab file by data.table::fread, my input data may have a extra tab before last col,like:

#The entire table is tab split, we just show finally tab as \t
col1    col2    col3\t # a tab here, I have many cols in fact (>3000)
1    2    9\t
1    3    3\t
3    9    6\t

I noted fread will infer colname if have extra col, but its not correctly

# fread infered as 
V1   col1    col2   col3

I dont want V1 exists in first

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have tried

  1. shift and reset colname, it could worked, but now I plan to use select to select columns when I read the file, so I can’t do that

  2. use data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col), Its worked, but seems very slower than process gz by data.table::fread

>Solution :

Starting with this result:

fread("input.tab", sep="\t")
#     col1  col2  col3     V4
#    <int> <int> <int> <lgcl>
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA

if you know how many columns exist ahead of time, then you can simply select= them:

fread("input.tab", select=1:3)
#     col1  col2  col3
#    <int> <int> <int>
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

Alternatively, if it’s a large file and you want to be flexible, then you can do something like this:

# read in just enough rows so that you are confident-enough 
# that the last column is all NA, a symptom of the trailing-tab
tmp <- fread("input.tab", sep="\t", nrows=10)
tmp
#     col1  col2  col3     V4
#    <int> <int> <int> <lgcl>
# 1:     1     2     9     NA
# 2:     1     3     3     NA
# 3:     3     9     6     NA
ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread("input.tab", select=seq(ncols))
#     col1  col2  col3
#    <int> <int> <int>
# 1:     1     2     9
# 2:     1     3     3
# 3:     3     9     6

If you have predetermined column indices in select_col, then you can use

fread(..., select = select_col[select_col < ncols])

This naive heuristic can be fooled if the tab-problem does not exist but the first n rows of the table are legitimately empty/null.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading