Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can convert struct column timestamp with start and end into normal pythonic stamp column?

I have a time-series pivot table with struct timestamp column including start and end of time frame of records as follow:

import pandas as pd
pd.set_option('max_colwidth', 400)
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}'],
                   "X1": [25],
                   "X2": [33],
                   })
df 
#                                                                       timestamp   X1  X2
#0  {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}   25  33

Since later I will use timestamps as the index for time-series analysis, I need to convert it into timestamps with just end/start.
I have tried to find the solution using regex maybe unsuccessfully based on this post as follows:

df[["start_timestamp", "end_timestamp"]] = (
    df["timestamp"].str.extractall(r"(\d+\.\d+\.\d+)").unstack().ffill(axis=1)
)

but I get:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ValueError: Columns must be same length as key

so I try to reach following expected dataframe:

df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}'],
                   'start_timestamp': ['2022-01-19T00:00:00.000+0000'],
                   'end_timestamp': ['2022-01-20T00:00:00.000+0000'],
                   "X1": [25],
                   "X2": [33]})
df 
#                                                                       timestamp   start_timestamp                 end_timestamp                   X1  X2
#0  {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}   2022-01-19T00:00:00.000+0000    2022-01-20T00:00:00.000+0000    25  33

>Solution :

You can extract both values with an extract call:

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].str.extract(r'"start":"([^"]*)","end":"([^"]+)')

The "start":"([^"]*)","end":"([^"]+) regex matches "start":", then captres any zero or more chars other than " into Group 1 (the start column value) and then matches ","end":" and then captures one or more chars other than " into Group 2 (the end column value).

Also, if the data you have is valid JSON, you can parse the JSON instead of using a regex:

def extract_startend(x):
    j = json.loads(x)
    return pd.Series([j["start"], j["end"]])

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].apply(extract_startend)

Output of print(df.to_string()):

                                                                   timestamp  X1  X2               start_timestamp                 end_timestamp
0  {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:.........  25  33  2022-01-19T00:00:00.000+0000  2022-01-20T00:00:00.000+0000
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading