Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Problem reading in strings as factors; the preceding spaces seem to matter

Having a problem reading in a textConnection() mini-file that has factors. This fragment below makes two separate factors for ‘LabAuto’.

x <- read.table(tc <- textConnection(
"Project, TestingType, CodeType
'TS',     'TDDEUT',    Production
'TS',     'TDDEUT',    Testing
'NR',      'LabAuto',  Production
'In',     'LabAuto',   Testing"),
    header=TRUE, colClasses=c("character", "factor", "factor"),
    sep=",", na.strings=c("NULL"), quote="'")

TestingType shows this, indicating there are two levels labeled (approximately) LabAuto:

> x$TestingType
[1]      TDDEUT        TDDEUT         LabAuto      LabAuto 
Levels:       LabAuto      LabAuto      TDDEUT

Ostensibly this is due to the extra space in front of the first ‘LabAuto’ factor, because if I remove one space (on the ‘NR’ line), then I just end up with two factors for TestingType, as I want:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

> x$TestingType
[1]      TDDEUT       TDDEUT       LabAuto      LabAuto
Levels:      LabAuto      TDDEUT

But shouldn’t specifying the sep="," and quote="’" parameters have told R to only consider the text inside the single-quotes as the factor label?

The single quotes are not exclusively the problem, as the third column above has the same issue:

> x$CodeType
[1]     Production     Testing      Production      Testing    
Levels:     Production     Testing    Testing   Production

It shows 4 different factors instead of 2, again ostensibly because there are differing numbers of spaces in front of each. Is there a way to tell R to ignore spaces when making factor levels out of a text input file? Thanks.

>Solution :

Your input file is in a very strange format. Normally you either have a delimiter or spaces separating values. You seem to have both which is odd. But you can strip out the space if you use the strip.white= parameter to read.table. Use

x <- read.table(tc <- textConnection(
  "Project, TestingType, CodeType
'TS',     'TDDEUT',    Production
'TS',     'TDDEUT',    Testing
'NR',      'LabAuto',  Production
'In',     'LabAuto',   Testing"),
  header=TRUE, colClasses=c("character", "factor", "factor"),
  sep=",", na.strings=c("NULL"), quote="'", strip.white = TRUE)

x$TestingType
# [1] TDDEUT  TDDEUT  LabAuto LabAuto
# Levels: LabAuto TDDEUT
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading