I have a regex that looks like this to extract order numbers from columns:
df["Orders"].str.extract('([0-9]{9,10}[/+ #_;.-]?)')
The orders column can look like this:
12
123456789
1234567890
123456789/1234567890
123456789/1/123456789
123456789+1234567890
The resulting new column in the dataframe after the regex should look like this:
NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789+1234567890
However, with my current regex I’m getting the following result:
NaN
123456789
1234567890
123456789/
123456789/
123456789+
How can I get the result that I’m looking for?
>Solution :
You can use
import pandas as pd
df = pd.DataFrame({'Orders':['12','123456789','1234567890','123456789/1234567890','123456789/1/123456789','123456789+1234567890', 'Order number: 6508955960_000010_1005500']})
df["Result"] = df["Orders"].str.findall(r'[/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])').str.join('').str.lstrip('/+ #_;.-')
df.loc[df['Result'] == '', 'Result'] = np.nan
See the regex demo. Details
[/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])– matches an optional/,+, space,#,_,;,.or-char, and then none or ten digit number not enclosed with other digitsSeries.str.findallextracts all occurrences.str.join('')concatenates the matches into a single string.str.lstrip('/+ #_;.-')– removes the special chars that were matched with the number at the beginning of the stringdf.loc[df['Result'] == '', 'Result'] = np.nan– if needed – replaces empty strings withnp.nanvalues in theResultcolumn.
Output:
>>> df
Orders Result
0 NaN NaN
1 123456789 123456789
2 1234567890 1234567890
3 123456789/1234567890 123456789/1234567890
4 123456789/1/123456789 123456789/123456789
5 123456789+1234567890 123456789+1234567890
>>>