Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R and Python have different character sort order

I’m trying to sort a dataframe in R and discovered the sort order does not match the expected ascii sort order. I need to sort a dataframe in R in the same way Python sorts the data.

df = df[do.call(order, df), ]  # sort by all columns

As shown here Python correctly sorts uppercase letters before lowercase letters:

$ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
>>> "A" < "a"
True

But R sorts uppercase letters after lowercase letters:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

$ R
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
  Natural language support but running in an English locale
> "A" < "a"
[1] FALSE
> "A" > "a"
[1] TRUE

How can I change the R sort behavior to match the standard ascii ordering? Is there some parameter to the order function, or some configuration setting to change the sort order?

Note: this is not a distinction between case-sensitive and case-insensitive sorting — it’s worse than that — the case sensitive sorting has a non-standard order.

>Solution :

Different locales use different sort orders, including case rules: you probably want to use Sys.setlocale(locale = "C"). (There is more information about locale definitions and case sorting order here.)

?Comparison says a little bit about locale-specific sorting …

The collating sequence of locales such as ‘en_US’ is
normally different from ‘C’ (which should use ASCII) and can be
surprising.

… but as far as I can see does not say anything explicit about case order (searching for "case" in the page didn’t get any hits).

> Sys.getlocale()
[1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8"
> "A" < "a"
[1] FALSE
> Sys.setlocale(locale = "C")
[1] "C/C/C/C/C/en_CA.UTF-8"
> "A" < "a"
[1] TRUE
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading