I hope you are having a great day,
I have a dataset that has a col named "plain_text" in which we have a dumped log of conversations among several user in a group chat, the servers saves the information in a format like this:
a "timestamp" + "user name" + ":" + "text"
the timestamp may by enclosed in slashes or parenthesys or none for instance:
/8:08:57/ or (14:05:14) or just 16:59:59
the user name is entered manually by the one using the server and the "text" is the message that the person is sending it can be as long or as short as they want and may have tabs new lines, etc.
this is a preview of just one cell of information on my dataste:
text = """
(19:04:45) Server 526.785 : Ongoing Push
(19:08:46) Main Deck : Operation was not uploaded the error code will be
55-858-658-458
No More Handing is needed
(19:50:46) Server UJI-OP : Reset Deck main
OP may take up to 6 mins or more...
(19:51:46) Server UJI-OP : Main Deck status ON
please stand up for opening doors
23:20:04 Jill : Windows Closed
5:16:58 Carl V: Is someone on the Front door?
(17:11:49) IUJO-66 : No Response on Deck (5:10:43) Van UHJ : Flights delay 8:34:08 H2047: Buy Concert Tickets 9:05:42 Mark P.: Gen. OK
7:00:15 Jill : Status not ok updated 21:22:34 YHXO: Front desk clear
"""
df = pd.DataFrame({'plain text': [text]})
My desired output would look like this:
time | user | text_sent |
---|---|---|
19:08:46 | Main Deck | Operation was not uploaded the error code will be 55-858-658-458 No More Handing is needed |
19:04:45 | Server 526.785 | Ongoing Push |
19:50:46 | Server UJI-OP | Reset Deck main |
basically each piece of information will be put in there own column and we will capture all the text sent. (I have shown only a few raws of my desired output so that I dont overload the screen)
I’m using this regex:
"(?P<timestamp>\d+:\d+:\d+)\S*\s+(?P<user>[^:]+?)\s*:\s*(?P<msg>.*?)(?=\s*\S*\d+:\d+:\d+|$)"
But is not working for some of the messages look:
Some messages are skip eventhough they do follow my regex
here is regex: https://regex101.com/r/G0rdpa/1
and I kindly ask if you could please help me out modifying to capture all information I am a very attentive user here on SO I will be very attentive to your comments and recomendations to upvote and select the answer thanks a million guys
>Solution :
You can use
(?s)\b(?P<timestamp>\d+:\d+:\d+)\)?\s+(?P<user>[^:]+?)\s*:\s*(?P<msg>.*?)(?=\s*\(?\d+:\d+:\d+|\Z)
See the regex demo.
If you intend to use in Python, you can define it as
pattern = re.compile(r'\b(?P<timestamp>\d+:\d+:\d+)\)?\s+(?P<user>[^:]+?)\s*:\s*(?P<msg>.*?)(?=\s*\(?\d+:\d+:\d+|\Z)', re.S)
Details:
\b
– a word boundary(?P<timestamp>\d+:\d+:\d+)
– Group "timestamp": one or more digits,:
, one or more digits,:
, one or more digits\)?
– an optional)
\s+
– one or more whitespaces(?P<user>[^:]+?)
– Group "user": any one or more chars other than:
as few as possible\s*:\s*
– a colon enclosed with zero or more whitespaces(?P<msg>.*?)
– Group "msg": any zero or more chars as few as possible(?=\s*\(?\d+:\d+:\d+|\Z)
– a positive lookahead that requires (immediately to the right of the current location)\s*\(?\d+:\d+:\d+
– zero or more whitespaces, an optional(
, one or more digits,:
, one or more digits,:
, one or more digits|
– or\Z
– end of the string.
re.S
makes .
match any characters including line break chars.