Problem reading in strings as factors; the preceding spaces seem to matter

November 11, 2021

Having a problem reading in a textConnection() mini-file that has factors. This fragment below makes two separate factors for ‘LabAuto’.

x <- read.table(tc <- textConnection(
"Project, TestingType, CodeType
'TS',     'TDDEUT',    Production
'TS',     'TDDEUT',    Testing
'NR',      'LabAuto',  Production
'In',     'LabAuto',   Testing"),
    header=TRUE, colClasses=c("character", "factor", "factor"),
    sep=",", na.strings=c("NULL"), quote="'")

TestingType shows this, indicating there are two levels labeled (approximately) LabAuto:

> x$TestingType
[1]      TDDEUT        TDDEUT         LabAuto      LabAuto 
Levels:       LabAuto      LabAuto      TDDEUT

Ostensibly this is due to the extra space in front of the first ‘LabAuto’ factor, because if I remove one space (on the ‘NR’ line), then I just end up with two factors for TestingType, as I want:

> x$TestingType
[1]      TDDEUT       TDDEUT       LabAuto      LabAuto
Levels:      LabAuto      TDDEUT

But shouldn’t specifying the sep="," and quote="’" parameters have told R to only consider the text inside the single-quotes as the factor label?

The single quotes are not exclusively the problem, as the third column above has the same issue:

> x$CodeType
[1]     Production     Testing      Production      Testing    
Levels:     Production     Testing    Testing   Production

It shows 4 different factors instead of 2, again ostensibly because there are differing numbers of spaces in front of each. Is there a way to tell R to ignore spaces when making factor levels out of a text input file? Thanks.

>Solution :

Your input file is in a very strange format. Normally you either have a delimiter or spaces separating values. You seem to have both which is odd. But you can strip out the space if you use the strip.white= parameter to read.table. Use

x <- read.table(tc <- textConnection(
  "Project, TestingType, CodeType
'TS',     'TDDEUT',    Production
'TS',     'TDDEUT',    Testing
'NR',      'LabAuto',  Production
'In',     'LabAuto',   Testing"),
  header=TRUE, colClasses=c("character", "factor", "factor"),
  sep=",", na.strings=c("NULL"), quote="'", strip.white = TRUE)

x$TestingType
# [1] TDDEUT  TDDEUT  LabAuto LabAuto
# Levels: LabAuto TDDEUT