Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas – New Columns If Column 1 or 2 match Regular Expression

Pandas – Chech if two columns matches regex

I had parsed tables data from millins of PDFs.
As it a sensitive data, I tryed my best to show some of the particual situations of this

Sample DataFrame

col0 col1 col2 col3 col4 col5 col6 col 7
201707 CNPJ: 98.777.666/0001-05 NaN NaN NaN NaN NaN NaN
201707 My Company Name is … NaN NaN NaN NaN NaN NaN
201707 Red Hot Chilli Peppers 01.234.567/0001-05 13.358,25 NaN 13.358,25 NaN NaN
201707 Guns n’ Roses 23.456.789/0001-07 11.134,63 NaN 11.134,63 NaN NaN
201707 Metallica 32.586.709/0001-01 3.391,75 NaN 3.391,75 NaN NaN
201707 Alice in Chains 57.888.999/0003-01 10.000,00 10.000,00 20.000,00 NaN NaN
201707 Hendrix 00.777.888/0001-01 50.000,00 1.000,00 51.000,00 NaN NaN
201707 NaN june/2017 NaN NaN NaN NaN NaN
201707 Total: 7.074.541,25 9.008.042,27 16.082.583,52 NaN NaN NaN
201708 NaN NaN NaN NaN NaN NaN CNPJ: 98.777.666/0001-05
201708 NaN NaN NaN NaN NaN NaN My Company Name is …
201708 01.234.567/0001-05 13.358,25 NaN 13.358,25 NaN NaN Red Hot Chilli Peppers
201708 23.456.789/0001-07 11.134,63 NaN 11.134,63 NaN NaN Guns n’ Roses
201708 32.586.709/0001-01 3.391,75 NaN 3.391,75 NaN NaN Metallica
201708 57.888.999/0003-01 10.000,00 10.000,00 20.000,00 NaN NaN Alice in Chains
201708 00.777.888/0001-01 50.000,00 1.000,00 51.000,00 NaN NaN NaN
201708 june/2017 NaN NaN NaN NaN NaN NaN

New company_code checking two columns

pattern like 01.234.567/0001-05

I want to create a new columns, company_code, checking Regular expressions.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • If col1 matches the pattern then return the string that is col1
  • If col2 matches the pattern then return the string that is col2
  • else: return '-'

The result expected is

col0 col1 col2 col3 col4 col5 col6 col 7 company_code
201707 CNPJ: 98.777.666/0001-05 NaN NaN NaN NaN NaN NaN
201707 My Company Name is … NaN NaN NaN NaN NaN NaN
201707 Red Hot Chilli Peppers 01.234.567/0001-05 13.358,25 NaN 13.358,25 NaN NaN 01.234.567/0001-05
201707 Guns n’ Roses 23.456.789/0001-07 11.134,63 NaN 11.134,63 NaN NaN 23.456.789/0001-07
201707 Metallica 32.586.709/0001-01 3.391,75 NaN 3.391,75 NaN NaN 32.586.709/0001-01
201707 Alice in Chains 57.888.999/0003-01 10.000,00 10.000,00 20.000,00 NaN NaN 57.888.999/0003-01
201707 Hendrix 00.777.888/0001-01 50.000,00 1.000,00 51.000,00 NaN NaN 00.777.888/0001-01
201707 NaN june/2017 NaN NaN NaN NaN NaN
201707 Total: 7.074.541,25 9.008.042,27 16.082.583,52 NaN NaN NaN
201708 NaN NaN NaN NaN NaN NaN CNPJ: 98.777.666/0001-05
201708 NaN NaN NaN NaN NaN NaN My Company Name is …
201708 01.234.567/0001-05 13.358,25 NaN 13.358,25 NaN NaN Red Hot Chilli Peppers 01.234.567/0001-05
201708 23.456.789/0001-07 11.134,63 NaN 11.134,63 NaN NaN Guns n’ Roses 23.456.789/0001-07
201708 32.586.709/0001-01 3.391,75 NaN 3.391,75 NaN NaN Metallica 32.586.709/0001-01
201708 57.888.999/0003-01 10.000,00 10.000,00 20.000,00 NaN NaN Alice in Chains 57.888.999/0003-01
201708 00.777.888/0001-01 50.000,00 1.000,00 51.000,00 NaN NaN Hendrix 00.777.888/0001-01
201708 june/2017 NaN NaN NaN NaN NaN NaN

I had tried a lot to doing this. Checked many youtube channels, stack posts…
But cannot solve.

If someone can help me…. Thanks a lot!

>Solution :

You can try np.select

m1 = df['col1'].str.contains('^\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}$', na=False)
m2 = df['col2'].str.contains('^\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}$', na=False)

df['company_code'] = np.select(
    [m1, m2],
    [df['col1'], df['col2']],
    '-'
)
print(df['company_code'])

0                      -
1                      -
2     01.234.567/0001-05
3     23.456.789/0001-07
4     32.586.709/0001-01
5     57.888.999/0003-01
6     00.777.888/0001-01
7                      -
8                      -
9                      -
10                     -
11    01.234.567/0001-05
12    23.456.789/0001-07
13    32.586.709/0001-01
14    57.888.999/0003-01
15    00.777.888/0001-01
16                     -
Name: company_code, dtype: object
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading