Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Regex, that matches variable grams sizes

I have the following sample series

 s = pd.Series({0: 'Açúcar Refinado UNIÃO Pacote 1kg',
 1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
 2: 'Açúcar Refinado DA BARRA Pacote 1kg',
 3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
 4: 'Açúcar Refinado GUARANI Pacote 1Kg',
 5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
 6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
 7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
 8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
 9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
 10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
 11: 'Acucar Refinado Uniao 1kg'})

What I want to do is to capture the string part that represents the weights of the given products. In specific, the "1kg" string or the "500g" string.
I need to capture one or another, so I can easily interact through the pandas.Series object.

What I tried

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

s.str.extract(r"(.kg)|(.g)",flags = re.IGNORECASE)

Since the number of number before the string can vary I would like a different approach.

>Solution :

With this extended data:

>>> s = pd.Series({
...     0: 'Açúcar Refinado UNIÃO Pacote 1kg',
...     1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
...     2: 'Açúcar Refinado DA BARRA Pacote 1kg',
...     3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
...     4: 'Açúcar Refinado GUARANI Pacote 1Kg',
...     5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
...     6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
...     7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
...     8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
...     9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
...     10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
...     11: 'Acucar Refinado Uniao 1kg',
...     12: 'something something 1.25kg',
...     13: 'something something 1,25kg'})

Parsing out the numbers and the units:

>>> s.str.extract(r'(\d+(?:[\.,]\d*)?)( ?k?g)', flags=re.IGNORECASE) \
...     .assign(k=lambda d: d[0].str
...             .replace('(?<=\d),(?=\d)', '.', regex=True)
...             .pipe(pd.to_numeric))
       0   1       k
0      1  kg    1.00
1      1  Kg    1.00
2      1  kg    1.00
3      1  kg    1.00
4      1  Kg    1.00
5      1  kg    1.00
6    500   g  500.00
7      1  kg    1.00
8      1  kg    1.00
9    250   g  250.00
10     1  kg    1.00
11     1  kg    1.00
12  1.25  kg    1.25
13  1,25  kg    1.25

I also allow for an optional space between the decimal and the units. Extended also to deal with non-integer numbers, accounting also for different decimal markers: eg in continental Europe, decimals are marked like 1,25 rather than 1.25 as in the Anglosphere.

I use a non-capturing group for the decimal portion; Roman’s version also works. For parsing the number, I would normalise the decimal format if mixed. If otherwise, you can re-parse by import io; pd.read_csv(io.StringIO(your_df.to_csv()), decimal=',').

You will get more capture groups on the row if you have a string like 250g ... 1kg. You may want to filter or otherwise clean that before throwing it into this function. Also consider appending a \b to ensure that you don’t match something like 50grandmas.

Thanks also for providing the data frame constructor ab initio.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading