Home How can convert struct column timestamp with start and end into normal pythonic stamp column?

Questions

How can convert struct column timestamp with start and end into normal pythonic stamp column?

February 19, 2022

I have a time-series pivot table with struct timestamp column including start and end of time frame of records as follow:

import pandas as pd
pd.set_option('max_colwidth', 400)
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}'],
                   "X1": [25],
                   "X2": [33],
                   })
df 
#                                                                       timestamp   X1  X2
#0  {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}   25  33

Since later I will use timestamps as the index for time-series analysis, I need to convert it into timestamps with just end/start.
I have tried to find the solution using regex maybe unsuccessfully based on this post as follows:

df[["start_timestamp", "end_timestamp"]] = (
    df["timestamp"].str.extractall(r"(\d+\.\d+\.\d+)").unstack().ffill(axis=1)
)

but I get:

ValueError: Columns must be same length as key

so I try to reach following expected dataframe:

df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}'],
                   'start_timestamp': ['2022-01-19T00:00:00.000+0000'],
                   'end_timestamp': ['2022-01-20T00:00:00.000+0000'],
                   "X1": [25],
                   "X2": [33]})
df 
#                                                                       timestamp   start_timestamp                 end_timestamp                   X1  X2
#0  {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}   2022-01-19T00:00:00.000+0000    2022-01-20T00:00:00.000+0000    25  33

>Solution :

You can extract both values with an extract call:

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].str.extract(r'"start":"([^"]*)","end":"([^"]+)')

The "start":"([^"]*)","end":"([^"]+) regex matches "start":", then captres any zero or more chars other than " into Group 1 (the start column value) and then matches ","end":" and then captures one or more chars other than " into Group 2 (the end column value).

Also, if the data you have is valid JSON, you can parse the JSON instead of using a regex:

def extract_startend(x):
    j = json.loads(x)
    return pd.Series([j["start"], j["end"]])

df[["start_timestamp", "end_timestamp"]] = df["timestamp"].apply(extract_startend)

Output of print(df.to_string()):

                                                                   timestamp  X1  X2               start_timestamp                 end_timestamp
0  {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:.........  25  33  2022-01-19T00:00:00.000+0000  2022-01-20T00:00:00.000+0000