Here is some random code.
# create df
import pandas as pd
df2 = pd.DataFrame({'var1':['1_0','1_0','1_0','1_0','1_0'],
'var2':['X','y','a','a','a']})
df2.to_json('df2.json')
# import df
df2 = pd.read_json('df2.json')
df2
Code above would be expected to genereate:
var1 var2
0 1_0 X
1 1_0 y
2 1_0 a
3 1_0 a
4 1_0 a
However it generates:
var1 var2
0 10 X
1 10 y
2 10 a
3 10 a
4 10 a
If on the initial code, I modify a entry from [‘var1’], then the code it generates when df is imported is correct.
Here is an example to illustrate it.
df2 = pd.DataFrame({'var1':['1_0','hello','1_0','1_0','1_0'],
'var2':['X','y','a','a','a']})
df2.to_json('df2.json')
# import df
df2 = pd.read_json('df2.json')
df2
var1 var2
0 1_0 X
1 hello y
2 1_0 a
3 1_0 a
4 1_0 a
Same problem is observed if file is saved in csv format and then imported.
Has anyone encountered the same issue?
>Solution :
This is due to the fact that underscores are valid separators in python (often used as thousand separator: 1_000 is 1000). You could force the dtype upon import (or use dtype=False):
df2 = pd.read_json('df2.json', dtype='str')
If you want to keep dtype detection for the other columns:
df2 = pd.read_json('df2.json', dtype={'var1': 'str'})
Output:
var1 var2
0 1_0 X
1 1_0 y
2 1_0 a
3 1_0 a
4 1_0 a
When you have a string in the json, there is no ambiguity that the values are not numbers and the conversion is not done.