Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Convert to float, pandas string column with mixed thousand and decimal separators

I have a pandas DataFrame with a column containing strings representing numbers. These strings have mixed formats. Some times numbers use comma as a decimal separator and sometimes a dot. When a dot is used as a decimal separator, that number can contain comma as a thousand separator.

For example:

import pandas as pd
data = {
    'NumberString': [
        '1,234.56',
        '789,012.34',
        '45,678',
        '9,876.54',
        '3,210.98',
        '1,000,000.01',
        '123.45',
        '42,000',
        'NaN'
    ]
}
df = pd.DataFrame(data)

I want to convert this column to numeric without losing some of the data due to inconsistent format (commas vs dots). However, using pd.to_numeric with parameter errors=’coerce’ will drop down some of the number
Python

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is there a way to format all the strings to numbers without loosing them due to format?

What I have tried so far:

>>> df['Number'] = pd.to_numeric(df['NumberString'].str.replace(',','.'), errors='coerce')
NumberString  Number
1,234.56      NaN
789,012.34    NaN
45,678        45.678
9,876.54      NaN
3,210.98      NaN
1,000,000.01  NaN
123.45        123.450
42,000        42.000
NaN           NaN

Desired output:

NumberString  Number
1,234.56      1234.56 
789,012.34    789012.34
45,678        45.678
9,876.54      9876.54
3,210.98      3210.98
1,000,000.01  1000000.01
123.45        123.450
42,000        42.000
NaN           NaN

>Solution :

If you have mixed formats, you could first try to str.replace the commas by a dot and convert to_numeric with errors='coerce', then fillna with an attempt converting the commas to empty string:

df['Number'] = (pd.to_numeric(df['NumberString'].str.replace(',', '.'), errors='coerce')
                  .fillna(pd.to_numeric(df['NumberString'].str.replace(',', ''), errors='coerce'))
               )

Output:

   NumberString       Number
0      1,234.56     1234.560
1    789,012.34   789012.340
2        45,678       45.678
3      9,876.54     9876.540
4      3,210.98     3210.980
5  1,000,000.01  1000000.010
6        123.45      123.450
7        42,000       42.000
8           NaN          NaN
what about 1.000.000,01?

If this is desired, then you can follow the same logic. The decision will be made in the following order:

  • if a single ./,: use a decimal separator
  • if a mix, first try to first consider , as a thousands separator, then .
df['Number'] = (pd.to_numeric(df['NumberString'].str.replace(',', '.'), errors='coerce')
                  .fillna(pd.to_numeric(df['NumberString'].str.replace(',', ''), errors='coerce'))
                  .fillna(pd.to_numeric(df['NumberString'].str.replace('.', '')
                                                          .str.replace(',', '.'), errors='coerce'))
               )

Output:

    NumberString       Number
0       1,234.56     1234.560
1     789,012.34   789012.340
2         45,678       45.678
3       9,876.54     9876.540
4       3,210.98     3210.980
5   1,000,000.01  1000000.010
6         123.45      123.450
7         42,000       42.000
8            NaN          NaN
9   1.000.000,01  1000000.010
10     1,000,000  1000000.000
11     1.000.000  1000000.000
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading