Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to group by and find new or disappearing items

I am trying to assess in a sales database whether the # of advertisements has changed.
The example dataframe I am using is as such:

df = pd.DataFrame({"offer-id": [1,1,2,2,3,4,5], "date": ["2024-02-10","2024-02-11","2024-02-10","2024-02-11","2024-02-11","2024-02-11","2024-02-10"], "price": [30,10,30,30,20,25,20]})

And looks like the below:

Dataframe

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I am now trying to get the # of items that were sold or newly added (I don’t care which one, since once I have one the other should be failry easily computable).

E.g. in a perfect case the next piece of code tells me that on 10th of February 3 offers were online (ID 1, 2, and 5) and one was sold (ID 5)
Or alternatively, it tells me on 11th of February 4 offers are online, and 2 of them are new (from that, since I know the day before 5 were online I can also calculate that one must have sold)

Is there a simple way of doing this?
I have tried things like

df.groupby(['date'])["offer-id"].agg({'nunique'})

but they are missing the "comparison to previous" timestep component.

>Solution :

You could aggregate as a set:

offers = df.groupby('date', sort=True)['offer-id'].agg(set)

date
2024-02-10       {1, 2, 5}
2024-02-11    {1, 2, 3, 4}
Name: offer-id, dtype: object

Then getting the diff will give you the new items:

offers.diff()

date
2024-02-10       NaN
2024-02-11    {3, 4}
Name: offer-id, dtype: object

Or the sold items:

offers.diff(-1)

date
2024-02-10    {5}
2024-02-11    NaN
Name: offer-id, dtype: object

If you want the number of items, chain str.len:

offers.diff().str.len().fillna(0).convert_dtypes()

date
2024-02-10    0
2024-02-11    2
Name: offer-id, dtype: Int64

And to get those as new columns, map:

df['new'] = df['date'].map(offers.diff().str.len().fillna(0).convert_dtypes())
df['sold'] = df['date'].map(offers.diff(-1).str.len().fillna(0).convert_dtypes())

print(df)

Output:

   offer-id        date  price  new  sold
0         1  2024-02-10     30    0     1
1         1  2024-02-11     10    2     0
2         2  2024-02-10     30    0     1
3         2  2024-02-11     30    2     0
4         3  2024-02-11     20    2     0
5         4  2024-02-11     25    2     0
6         5  2024-02-10     20    0     1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading