unique value counts of products purchased by the customer in previous year

October 17, 2022

I am building a function that creates an aggregate table with certain variables developed from df. I think I got most things right, but I am struggling understanding how to count a number of distinct products bought by the customer in previous year.

Example of my df:

df = pd.DataFrame({'year': [2020, 2021, 2021, 2021, 2022],
                   'id': [3, 1, 1, 2, 1],
                   'price': [12, 20, 30, 40, 50],
                   'age': [20, 30, 30, 44, 31],
                   'product': [book, toy, book, tv, book],
                   'quantity': [2,1,2,5,9})

here is my code so far:

df['revenue'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year 

def table(df):
   return df.groupby(['year','id'])\
    .agg(revenue = ('revenue', 'sum'), age = ('age', 'unique'), product_year_before = ('product', 'nunique'))

It counts products correctly, but it needs to be for previous year and not for current year.

>Solution :

For previous year you can subtract one year in MultiIndex:

df1 = table(df)

s = df1['product_year_before'].rename(lambda x: x - 1, level=0)
print (s)
year  id
2019  3     1
2020  1     2
      2     1
2021  1     1
Name: product_year_before, dtype: int64

All together – if need new column use DataFrame.join with s Series:

def table(df):
   df1 = (df.groupby(['year','id'])
            .agg(revenue = ('revenue', 'sum'), 
                 age = ('age', 'unique'), 
                 product_year_before = ('product', 'nunique')))
   s = df1['product_year_before'].rename(lambda x: x - 1, level=0)
   return df1.drop('product_year_before', axis=1).join(s)
                       
df1 = table(df)
print (df1)
         revenue   age  product_year_before
year id                                    
2020 3        24  [20]                  NaN
2021 1        80  [30]                  1.0
     2       200  [44]                  NaN
2022 1       450  [31]                  NaN

With original column:

def table(df):
   df1 = (df.groupby(['year','id'])
            .agg(revenue = ('revenue', 'sum'), 
                 age = ('age', 'unique'), 
                 product_year= ('product', 'nunique')))
   s = df1['product_year'].rename(lambda x: x - 1, level=0)
   return df1.join(s.rename('product_year_before'))
                       
df1 = table(df)
print (df1)
         revenue   age  product_year  product_year_before
year id                                                  
2020 3        24  [20]             1                  NaN
2021 1        80  [30]             2                  1.0
     2       200  [44]             1                  NaN
2022 1       450  [31]             1                  NaN