I have a dataframe with time-series data as follows:
Date Value
0 2021-12-01 A
1 2021-12-02 A
2 2021-12-03 A
3 2021-12-04 B
4 2021-12-05 B
5 2021-12-06 A
6 2021-12-07 A
7 2021-12-08 C
I’m trying to reduce this to only have the first date of each continuous block for the Value column. So the result would look like:
Date Value
0 2021-12-01 A
1 2021-12-04 B
2 2021-12-06 A
3 2021-12-08 C
I’ve tried a bunch of different ways of masking, dropping duplicates based on the mask, etc. but cannot do it. Any help is appreciated!
>Solution :
You can use ne (not equals) + shift to create a mask where the first value of each consecutive group is True, and then cumsum to create a unique for each group that’s shared by all its items.
Then, drop_duplicates based on that, and use index of the returned rows to index the dataframe:
subset = df.loc[df['Value'].ne(df['Value'].shift(1)).cumsum().drop_duplicates().index]
Output:
>>> subset
Date Value
0 2021-12-01 A
3 2021-12-04 B
5 2021-12-06 A
7 2021-12-08 C