Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

PySpark regex to extract string with two conditions

I have a dataframe that looks like this:

id  col1
1   ACC 12-34-11-123-122-A
2   ACC TASKS 12-34-11-123-122-B
3   ABB 12-34-11-123-122-C

I want to extract the code from the first and second lines (12-34-11-123-122-A, 12-34-11-123-122-B) which have ACC before them.

I found this answer and this is my attempt:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

F.regexp_extract(F.col("col_1"), r'(.)(ACC)(\s+)(\b\d{2}\-\d{2}\-\d{2}\-\d{3}\-[A-Z0-9]{0,3}\b)', 4)

I have to add the second group (ACC) because the ABB code has the same format.

How can I fix my regex to extract both ACC and ACC TASKS from this dataframe?

>Solution :

You may use this regex:

(\bACC(?:\s+TASKS)?)\s+(\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})

RegEx Demo

Here (\bACC(?:\s+TASKS)?) matches ACC or ACC TASKS before matching a given pattern.

For your python code:

F.regexp_extract(F.col("col_1"), r'(\bACC(?:\s+TASKS)?)\s+(\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})', 4)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading