I have the following series
s = pd.Series({0: 'R$ 3,74\n-10%\nR$ 4,15',
1: 'R$ 12,25',
2: 'R$ 14,65',})
What I want to do is extract the numbers in their current format and then substitute the ‘,’ for a ".", so I can convert to float values. Unfortunately, I am having trouble with my current capture group, which is basically capturing everything following that format(number,number) a infinite amount of times. Including on the first string.
(\d+\.?\,?\d*)
I would like to limit my capture group, to only capture the first occurrence of such event. I would appreciate other approach’s too.
Wanted result is as follows:
s = pd.Series({0:3.74,
1:12.25,
2:14.65})
>Solution :
You can use:
s.str.extract('(\d+,?\d*)', expand=False).str.replace(',', '.').astype(float)
Or:
import re
s.str.replace('.*(\d+),(\d+).*', r'\1.\2', regex=True, flags=re.DOTALL
).astype(float)
Output:
0 3.74
1 12.25
2 14.65
dtype: float64
timings
On 300K items:
s.str.extract('(\d+,?\d*)', expand=False).str.replace(',', '.').astype(float)
# 206 ms ± 7.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
s.str.replace('.*(\d+),(\d+).*', r'\1.\2', regex=True, flags=re.DOTALL).astype(float)
# 607 ms ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)