I try to propagate a value from one row to the next and I’m not sure where to start (pandas).
For one user,
- if some values come from the source B, I want to replace it with the values from the date just before.
- If there is no date before, I want to keep the value in B.
- Lines from source A should not change
Source data:
USER | Date | Value | Source |
---|---|---|---|
USER1 | 01/01/2023 | 1000 | A |
USER1 | 01/02/2023 | 1200 | A |
USER1 | 02/02/2023 | 1300 | A |
USER1 | 20/02/2023 | 1000 | B |
USER1 | 01/07/2023 | 1400 | A |
USER2 | 01/01/2023 | 1000 | A |
USER2 | 01/02/2023 | 1200 | A |
USER2 | 02/02/2023 | 1250 | B |
USER2 | 20/02/2023 | 1300 | A |
USER2 | 01/07/2023 | 1400 | A |
Target data:
USER | Date | Value | Source | Comment |
---|---|---|---|---|
USER1 | 01/01/2023 | 1000 | A | |
USER1 | 01/02/2023 | 1200 | A | |
USER1 | 02/02/2023 | 1300 | A | |
USER1 | 20/02/2023 | 1300 | B | get its value "1300" from the previous date of the same user |
USER1 | 01/07/2023 | 1400 | A | |
USER2 | 01/01/2023 | 1000 | A | |
USER2 | 01/02/2023 | 1200 | A | |
USER2 | 02/02/2023 | 1200 | B | get its value "1200" from the previous date of the same user |
USER2 | 20/02/2023 | 1300 | A | |
USER2 | 01/07/2023 | 1400 | A |
df = pd.DataFrame({"USER" : ["USER1", "USER1", "USER1", "USER1", "USER1", "USER2", "USER2", "USER2", "USER2", "USER2"],
"Date" :["01/01/2023", "01/02/2023", "02/02/2023", "20/02/2023", "01/07/2023", "01/01/2023", "01/02/2023", "02/02/2023", "20/02/2023", "01/07/2023"],
"Value" :["1000", "1200", "1300", "1000", "1400", "1000", "1200", "1250", "1300", "1400"],
"Source" :["A", "A", "A", "B", "A", "A", "A", "B", "A", "A"]}
)
>Solution :
Assuming the dates are already sorted, you could mask
, groupby.ffill
, and fillna
in case there was no previous value:
m = df['Source'].eq('B')
df['Value'] = (df['Value'].mask(m) # mask B values
.groupby(df['USER']).ffill() # replace with previous non-B
.fillna(df['Value']) # if still NaN, replace with original
#.mask(df['Value'].isna())
)
NB. if you can have NaNs in the original Series, uncomment the last line.
Output:
USER Date Value Source
0 USER1 01/01/2023 1000 A
1 USER1 01/02/2023 1200 A
2 USER1 02/02/2023 1300 A
3 USER1 20/02/2023 1300 B
4 USER1 01/07/2023 1400 A
5 USER2 01/01/2023 1000 A
6 USER2 01/02/2023 1200 A
7 USER2 02/02/2023 1200 B
8 USER2 20/02/2023 1300 A
9 USER2 01/07/2023 1400 A