I have a dataframe of measures in multiple columns, that are aggregated. This means that the function they represent is a monotonically increasing one. Now, due to reset of an apparatus, all measurements are reset to zero, after which the aggregation resumes. But to work with the data, I need to disgard the reset and shift all values in all columns to mimic that th reset never occured.
Hence, I what this sistuation:
to become
What I want is a function that will shift all values in all columns to to the last measured maximum.
for some sample data, I have created this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
date_range = pd.date_range(start='2021-01-01', end='2021-01-05', freq='1D')
df1 = pd.DataFrame({'Date': date_range, 'Column 1': range(5), 'Column 2': range(5)})
date_range = pd.date_range(start='2021-01-06', end='2021-01-10', freq='1D')
df2 = pd.DataFrame({'Date': date_range, 'Column 1': range(5), 'Column 2': range(5)})
df = pd.concat([df1,df2])
which I want to become
date_range = pd.date_range(start='2021-01-01', end='2021-01-10', freq='1D')
df3 = pd.DataFrame({'Date': date_range, 'Column 1': range(10), 'Column 2': range(10)})
I know how to do this in the case where I know that df is constructed from df1 and df2
def shift_df(df, df1, df2):
columns = list(df.columns)
columns.remove('Date')
max_values = {}
for col in columns:
max_values[col] = df1[col].max()
min_values = {}
for col in columns:
min_values[col] = df2[col].min()
differences = {}
for col in columns:
differences[col] = max_values[col] - min_values[col]+1
for col in columns:
df[col] = np.where(df['Date'].isin(df2['Date']),
df[col] + differences[col],
df[col])
return df
But I do not know how to generalize it if I only have the knowledge of df. Basically, how do I transform my function
shift_df(df, df1, df2)
to
shift_df(df)
Any help will be greatly appreciated.
>Solution :
Here is one option assuming you want to handle all columns independently:
- compute and identify the negative jumps with a
diff
- keep only those, inverse the values, add 1
- compute a
cumsum
- add to original Series
def shifter(s):
m = s.diff()
return s.add((-m.sub(1)).where(m.lt(0), 0).cumsum())
out = df.set_index('Date').apply(shifter)
Output:
Column 1 Column 2
Date
2021-01-01 0.0 0.0
2021-01-02 1.0 1.0
2021-01-03 2.0 2.0
2021-01-04 3.0 3.0
2021-01-05 4.0 4.0
2021-01-06 5.0 5.0
2021-01-07 6.0 6.0
2021-01-08 7.0 7.0
2021-01-09 8.0 8.0
2021-01-10 9.0 9.0
Another example:
df = pd.DataFrame({'Date': [Timestamp('2021-01-01 00:00:00'), Timestamp('2021-01-02 00:00:00'), Timestamp('2021-01-03 00:00:00'), Timestamp('2021-01-04 00:00:00'), Timestamp('2021-01-05 00:00:00'), Timestamp('2021-01-06 00:00:00'), Timestamp('2021-01-07 00:00:00'), Timestamp('2021-01-08 00:00:00'), Timestamp('2021-01-09 00:00:00'), Timestamp('2021-01-10 00:00:00')],
'Column 1': [0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
'Column 2': [3, 4, 5, 6, 7, 3, 4, 5, 6, 7]}
)
out = df.set_index('Date').apply(shifter)
Output:
Column 1 Column 2
Date
2021-01-01 0.0 3.0
2021-01-02 1.0 4.0
2021-01-03 2.0 5.0
2021-01-04 3.0 6.0
2021-01-05 4.0 7.0
2021-01-06 5.0 8.0
2021-01-07 6.0 9.0
2021-01-08 7.0 10.0
2021-01-09 8.0 11.0
2021-01-10 9.0 12.0