I have a pandas DataFrame with a column containing strings representing numbers. These strings have mixed formats. Some times numbers use comma as a decimal separator and sometimes a dot. When a dot is used as a decimal separator, that number can contain comma as a thousand separator.
For example:
import pandas as pd
data = {
'NumberString': [
'1,234.56',
'789,012.34',
'45,678',
'9,876.54',
'3,210.98',
'1,000,000.01',
'123.45',
'42,000',
'NaN'
]
}
df = pd.DataFrame(data)
I want to convert this column to numeric without losing some of the data due to inconsistent format (commas vs dots). However, using pd.to_numeric with parameter errors=’coerce’ will drop down some of the number
Python
Is there a way to format all the strings to numbers without loosing them due to format?
What I have tried so far:
>>> df['Number'] = pd.to_numeric(df['NumberString'].str.replace(',','.'), errors='coerce')
NumberString Number
1,234.56 NaN
789,012.34 NaN
45,678 45.678
9,876.54 NaN
3,210.98 NaN
1,000,000.01 NaN
123.45 123.450
42,000 42.000
NaN NaN
Desired output:
NumberString Number
1,234.56 1234.56
789,012.34 789012.34
45,678 45.678
9,876.54 9876.54
3,210.98 3210.98
1,000,000.01 1000000.01
123.45 123.450
42,000 42.000
NaN NaN
>Solution :
If you have mixed formats, you could first try to str.replace the commas by a dot and convert to_numeric with errors='coerce', then fillna with an attempt converting the commas to empty string:
df['Number'] = (pd.to_numeric(df['NumberString'].str.replace(',', '.'), errors='coerce')
.fillna(pd.to_numeric(df['NumberString'].str.replace(',', ''), errors='coerce'))
)
Output:
NumberString Number
0 1,234.56 1234.560
1 789,012.34 789012.340
2 45,678 45.678
3 9,876.54 9876.540
4 3,210.98 3210.980
5 1,000,000.01 1000000.010
6 123.45 123.450
7 42,000 42.000
8 NaN NaN
what about 1.000.000,01?
If this is desired, then you can follow the same logic. The decision will be made in the following order:
- if a single
./,: use a decimal separator - if a mix, first try to first consider
,as a thousands separator, then.
df['Number'] = (pd.to_numeric(df['NumberString'].str.replace(',', '.'), errors='coerce')
.fillna(pd.to_numeric(df['NumberString'].str.replace(',', ''), errors='coerce'))
.fillna(pd.to_numeric(df['NumberString'].str.replace('.', '')
.str.replace(',', '.'), errors='coerce'))
)
Output:
NumberString Number
0 1,234.56 1234.560
1 789,012.34 789012.340
2 45,678 45.678
3 9,876.54 9876.540
4 3,210.98 3210.980
5 1,000,000.01 1000000.010
6 123.45 123.450
7 42,000 42.000
8 NaN NaN
9 1.000.000,01 1000000.010
10 1,000,000 1000000.000
11 1.000.000 1000000.000