Python regular expressions to continuously extract multiple pieces of information from text data

I hope you are having a great day,

I have a dataset that has a col named "plain_text" in which we have a dumped log of conversations among several user in a group chat, the servers saves the information in a format like this:

a "timestamp" + "user name" + ":" + "text"

the timestamp may by enclosed in slashes or parenthesys or none for instance:
/8:08:57/ or (14:05:14) or just 16:59:59

the user name is entered manually by the one using the server and the "text" is the message that the person is sending it can be as long or as short as they want and may have tabs new lines, etc.

this is a preview of just one cell of information on my dataste:

text = """
(19:04:45) Server 526.785 : Ongoing Push
(19:08:46) Main Deck : Operation was not uploaded the error code will be
     55-858-658-458

     No More Handing is needed
(19:50:46) Server UJI-OP : Reset Deck main 
OP may take up to 6 mins or more...
(19:51:46) Server UJI-OP : Main Deck status ON
please stand up for opening doors

23:20:04 Jill : Windows Closed
5:16:58 Carl V: Is someone on the Front door?
(17:11:49) IUJO-66 : No Response on Deck (5:10:43) Van UHJ  : Flights delay 8:34:08 H2047: Buy Concert Tickets 9:05:42 Mark P.: Gen. OK
7:00:15 Jill  : Status not ok updated 21:22:34 YHXO: Front desk clear
"""


df = pd.DataFrame({'plain text': [text]})

My desired output would look like this:

time user text_sent
19:08:46 Main Deck Operation was not uploaded the error code will be 55-858-658-458 No More Handing is needed
19:04:45 Server 526.785 Ongoing Push
19:50:46 Server UJI-OP Reset Deck main

basically each piece of information will be put in there own column and we will capture all the text sent. (I have shown only a few raws of my desired output so that I dont overload the screen)

I’m using this regex:

"(?P<timestamp>\d+:\d+:\d+)\S*\s+(?P<user>[^:]+?)\s*:\s*(?P<msg>.*?)(?=\s*\S*\d+:\d+:\d+|$)"

But is not working for some of the messages look:

enter image description here

Some messages are skip eventhough they do follow my regex

here is regex: https://regex101.com/r/G0rdpa/1

and I kindly ask if you could please help me out modifying to capture all information I am a very attentive user here on SO I will be very attentive to your comments and recomendations to upvote and select the answer thanks a million guys

>Solution :

You can use

(?s)\b(?P<timestamp>\d+:\d+:\d+)\)?\s+(?P<user>[^:]+?)\s*:\s*(?P<msg>.*?)(?=\s*\(?\d+:\d+:\d+|\Z)

See the regex demo.

If you intend to use in Python, you can define it as

pattern = re.compile(r'\b(?P<timestamp>\d+:\d+:\d+)\)?\s+(?P<user>[^:]+?)\s*:\s*(?P<msg>.*?)(?=\s*\(?\d+:\d+:\d+|\Z)', re.S)

Details:

  • \b – a word boundary
  • (?P<timestamp>\d+:\d+:\d+) – Group "timestamp": one or more digits, :, one or more digits, :, one or more digits
  • \)? – an optional )
  • \s+ – one or more whitespaces
  • (?P<user>[^:]+?) – Group "user": any one or more chars other than : as few as possible
  • \s*:\s* – a colon enclosed with zero or more whitespaces
  • (?P<msg>.*?) – Group "msg": any zero or more chars as few as possible
  • (?=\s*\(?\d+:\d+:\d+|\Z) – a positive lookahead that requires (immediately to the right of the current location)
    • \s*\(?\d+:\d+:\d+ – zero or more whitespaces, an optional (, one or more digits, :, one or more digits, :, one or more digits
    • | – or
    • \Z – end of the string.

re.S makes . match any characters including line break chars.

Leave a Reply