Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to efficiently subtract years from dates in python?

I am subtracting years from date column in Python which felt a lot slower than R which is usually not the case, so I am wondering is there a faster/ efficient way of doing this in python ?

(As after years of working in R I am moving back to python, so my python skills are not good anymore & looking for code optimization in python).

python code:

import numpy as np
import pandas as pd
import datetime as dt

import time

Data I am showing below is just dummy to give an idea about date format I am working with (Row count in my original data is: 466285).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df = pd.DataFrame({'date_str': ['Apr-84','Mar-66','May-85']})

df['date'] = pd.to_datetime(df['date_str'], format = '%b-%y')

As I was getting some wrong dates in years terms like year: 2066, 2085 etc. so wrote a small function to correct dates as per my need:

# year subtraction function
def date_correction(x):
    if x > pd.to_datetime('2017-12-01'):
        x = (x - pd.to_timedelta(100 * 365.24, unit='d')) 
    else:
        x = x
        
    return x

start = time.time()
df['date'] = df['date'].apply(date_correction)
end = time.time()

print("Time taken: ", end - start)

Time taken: 32.958526611328125

Above time is in seconds I think, as it took a lot of time in completing this and that made me to time this in R as well.

R code:

library(tidyverse)
library(lubridate)
library(microbenchmark)
df = data.frame(date_str = c('Apr-84','Mar-66','May-85'))

df <- df %>% 
  mutate(date = lubridate::my(date_str)) 

subtract & time operation:

mbm <- microbenchmark( 'date_subt' = {
  df <- df %>% 
        mutate(date = if_else( 
                                         df$date > ymd('2017-12-01'),
                                         df$date %m-% years(100),
                                         df$date
                                        ))
                                      }
                      )
mbm

results:

Unit: milliseconds

expr      min      lq       mean     median    uq       max      neval
date_subt 324.3357 331.2416 372.4745 338.8745  396.3026 744.4625 100
autplot(mbm)

enter image description here

>Solution :

A vectorized way using boolean mask and DateOffset:

df.update(df.loc[df['date'] > '2017-12-01', 'date'] - pd.DateOffset(years=100)

A more concise way (without update)

df.loc[df['date'] > '2017-12-01', 'date'] -= pd.DateOffset(years=100)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading