I have a a dataframe where each lines contains text, and the start of this text is a time that is in the format of 00:00:00-00:00:01.
I would like to extract the starting time, the ending time but also the difference between those time into 3 columns of my df.
Here is a sample of the dataframe :
| text |
|---|
| 10:30:00-12:30:00 some random text(some special caracters [-‘(@ ) some numbers 456231 386 |
| 15:35:10-15:36:12 some other text some numbers 9875321651132 |
Here is the expected result
| text | start time | end time | delta |
|---|---|---|---|
| 10:30:00-12:30:00 some random text(some special caracters [-‘(@ ) some numbers 456231 386 | 10:30:00 | 12:30:00 | 02:00:00 |
| 15:35:10-15:36:12 some other text some numbers 9875321651132 | 15:35:10 | 15:36:12 | 00:01:02 |
I think using str.extract like this would do the trick but can’t get the matching as I would like, where the pattern match a 00:00:00 format
df['start time'] = df['text'].str.extract(r'( ^Pattern)') # ^ for the begining of the string where the start time is.
df['end time'] = df['text'].str.extract(r'( -Patern)')
#the - because the end time is between a "-" so the pattern for the end time would have to start with "-" to get it right
And for the delta, I was basically thinking about converting the start and end time strings to datetime and doing :
df['delta'] = df['end time']- df['start time']
>Solution :
For start_time and end_time, you can do:
df[['start_time','end_time']] = df['text'].str.extract('(\d{2}:\d{2}:\d{2})-(\d{2}:\d{2}:\d{2})')
Then for delta, you can convert the strings into Timedelta, get the difference, and convert to string:
df['delta'] = (pd.to_timedelta(df['end_time'])
.sub(pd.to_timedelta(df['start_time']))
)