Say I have a pandas dataframe like below (with millions of rows) –
data = {'s3_path': ['s3://mybucket/date=2023-10-26/f1.txt', 's3://mybucket/date=2023-10-25/f2.txt', 's3://mybucket/date=2023-10-24/f3.txt', 's3://mybucket/date=2023-10-23/f4.txt']}
df = pd.DataFrame(data)
I want to filter S3 paths that are before 2023-10-24. What would be an efficient way to do that in pandas? Not knowing a lot about pandas, what I can think of is below, but it is not still complete:
date_cutoff_str = '2023-10-24'
date_cutoff_obj = datetime.strptime(date_cutoff_str, '%Y-%m-%d')
def is_before(cur_date, cutoff_date):
if cur_date < cutoff_date:
True
return False
date_regex_pattern = r'\d{4}-\d{2}-\d{2}'
filtered_df = df.apply(is_before, cur_date=how_do_i_get_regex_value_here, cutoff_date=date_cutoff_obj)
Any suggestion/answer would be greatly appreciated. Thank you.
>Solution :
A possible solution, which uses regex to extract the dates as strings, and then converts them to datetime to compare with 2023-10-24:
df.loc[
pd.to_datetime(
df['s3_path'].str.extract(r'date=(\d{4}-\d{2}-\d{2})')[0])
.lt('2023-10-24')]
Output:
s3_path
3 s3://mybucket/date=2023-10-23/f4.txt