I have a dataset like this:
df = pd.DataFrame({'customer_key': [13453, 16345, 14643, 15346, 13453],
'data_purchased': ['08-07-2021','06-07-2021','05-09-2020','02-09-2021','01-04-2020'],
'price_value': [56, 45, 29, 22, 43]})
Glimpse of the dataset:
customer_key data_purchased price_value
13453 08-07-2021 56
16345 06-07-2021 45
14643 05-09-2020 29
15346 02-09-2021 22
13453 01-04-2020 43
I want to divide this dataset into 2 different datasets. First dataset will contain only those rows where customer_keys values are repeated (ex- customer_key = 13453 in above dataset). and the second dataset contains only those rows where there is no repeated values of customer_key. How to do that in pandas ??
>Solution :
Here’s one approach:
repeat_customer_mask = df.groupby("customer_key")["customer_key"].transform("size") > 1
Then you can split your DataFrame like so:
In [6]: repeat_customers = df[repeat_customer_mask]
In [7]: repeat_customers
Out[7]:
customer_key date_purchased price_value
0 13453 08-07-2021 56
4 13453 01-04-2020 43
In [8]: first_time_customers = df[~repeat_customer_mask]
In [9]: first_time_customers
Out[9]:
customer_key date_purchased price_value
1 16345 06-07-2021 45
2 14643 05-09-2020 29
3 15346 02-09-2021 22
From there, you’d probably want to .reset_index().