I am having a problem in Python efficiently getting the next row in a dataframe that has a matching ID number.
An example of what I’m working with looks like this:
ids = [1,2,3,4,5]
customer_ids = [3,7,5,7,3]
dates = ['2020-04-03', '2020-04-07','2020-04-14','2020-05-02','2020-05-06']
df = pd.DataFrame({ 'ID': ids,
'Customer_ID': customer_ids,
'Date': dates})
What I want for every row is to get the value of the next date that customer_ID had an entry in the table, with a 0 or NULL value for customer_IDS that have no next entry. For instance in this case I would want the result as follows:
result = ['2020-05-06','2020-05-02',NULL,NULL,NULL]
I need a solution to this which scales well, as the actual data table I have is much larger than this.
The most successful method I have employed so far has been as follows (assuming table constructed as above):
result = []
iter = 0
for x in ids:
customer_id = customer_id[iter]
next_date_list = df['Date'][(df['Customer_ID'] == customer_id) * (df['ID'] > x)]
id len(next_date_list == 0:
result.append(NULL)
else:
result.append(next_date_list[0])
iter = iter + 1
While this works well on smaller scale data obviously as the number of rows increases the runtime goes up dramatically. I need a method which will work better over very large datasets.
>Solution :
Try groupby.transform + shift:
result = df.groupby("Customer_ID")["Date"].transform(lambda x: x.shift(-1)).to_list()
print(result)
Prints:
['2020-05-06', '2020-05-02', None, None, None]