Home PySpark regex to extract string with two conditions

Questions

PySpark regex to extract string with two conditions

December 13, 2021

I have a dataframe that looks like this:

id  col1
1   ACC 12-34-11-123-122-A
2   ACC TASKS 12-34-11-123-122-B
3   ABB 12-34-11-123-122-C

I want to extract the code from the first and second lines (12-34-11-123-122-A, 12-34-11-123-122-B) which have ACC before them.

I found this answer and this is my attempt:

F.regexp_extract(F.col("col_1"), r'(.)(ACC)(\s+)(\b\d{2}\-\d{2}\-\d{2}\-\d{3}\-[A-Z0-9]{0,3}\b)', 4)

I have to add the second group (ACC) because the ABB code has the same format.

How can I fix my regex to extract both ACC and ACC TASKS from this dataframe?

>Solution :

You may use this regex:

(\bACC(?:\s+TASKS)?)\s+(\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})

RegEx Demo

Here (\bACC(?:\s+TASKS)?) matches ACC or ACC TASKS before matching a given pattern.

For your python code:

F.regexp_extract(F.col("col_1"), r'(\bACC(?:\s+TASKS)?)\s+(\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})', 4)

pyspark

byMR

Published December 13, 2021

Add a comment

Wrong output on python string to datetime

byMR

December 13, 2021

Questions

Operations on specific elements of a dataframe in Python

byMR

December 13, 2021

Questions

Get maximum closest number from int list

byMR

December 13, 2021

Questions

How to switch back to async/await?

byMR

December 13, 2021

Questions

Golang time.Date printing wrong time

byMR

December 13, 2021

Questions

Vector parameter in a function doesn't seem to actually apply to input?

byMR

December 13, 2021

PySpark regex to extract string with two conditions

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Wrong output on python string to datetime

Operations on specific elements of a dataframe in Python

Get maximum closest number from int list

How to switch back to async/await?

Golang time.Date printing wrong time

Vector parameter in a function doesn't seem to actually apply to input?

Keep Up to Date with the Most Important News

PySpark regex to extract string with two conditions

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Wrong output on python string to datetime

Operations on specific elements of a dataframe in Python

Get maximum closest number from int list

How to switch back to async/await?

Golang time.Date printing wrong time

Vector parameter in a function doesn't seem to actually apply to input?

Discover more from Dev solutions