I have a dataframe that looks like this:
id col1
1 ACC 12-34-11-123-122-A
2 ACC TASKS 12-34-11-123-122-B
3 ABB 12-34-11-123-122-C
I want to extract the code from the first and second lines (12-34-11-123-122-A, 12-34-11-123-122-B) which have ACC before them.
I found this answer and this is my attempt:
F.regexp_extract(F.col("col_1"), r'(.)(ACC)(\s+)(\b\d{2}\-\d{2}\-\d{2}\-\d{3}\-[A-Z0-9]{0,3}\b)', 4)
I have to add the second group (ACC) because the ABB code has the same format.
How can I fix my regex to extract both ACC and ACC TASKS from this dataframe?
>Solution :
You may use this regex:
(\bACC(?:\s+TASKS)?)\s+(\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})
Here (\bACC(?:\s+TASKS)?) matches ACC or ACC TASKS before matching a given pattern.
For your python code:
F.regexp_extract(F.col("col_1"), r'(\bACC(?:\s+TASKS)?)\s+(\d{2}-\d{2}-\d{2}-\d{3}-[A-Z0-9]{0,3})', 4)