Consider a sample pandas dataframe as below –
import pandas as pd
money = [100,200,300,400,500,600,700,800,900,1000,1100,1200]
batch_code = ['B1_2023','B2_2023','B3_2023','B4_2023','B1_2024','B2_2024','B3_2024','B4_2024','B1_2025','B2_2025','B3_2025','B4_2025']
test_df = pd.DataFrame([money,batch_code]).T
test_df.columns = ['money','batch_code']
test_df.money = test_df.money.astype(int)
Here is the how the data looks like –
I want to filter this data on column batch_code using 2 parameters – from_batch and to_batch.
For example –
from_batch = B3_2023
to_batch = B4_2024
Then my output data frame should consist of all the records that are in between B3_2024 and B3_2024. The batches represents quarters but are coded with prefix B. How can I achieve this in pandas?
I tried writing
test_df[test_df.batch_code.between('B3_2023','B4_2024')]
but I got this as output, which seems to be incorrect –
The correct output would be –
NOTE – The maximum range of prefix quarters is B1 to B4 only.
>Solution :
If you can’t change the data source, you can use PeriodIndex with boolean indexing :
pidx = pd.PeriodIndex(
test_df["batch_code"].str.replace(r"B(\d)_(\d+)", r"\2Q\1", regex=True), freq="Q"
)
from_batch, to_batch = "2023Q3", "2024Q4"
out = test_df.loc[pd.Series(pidx).between(from_batch, to_batch)]
Output :
print(out)
money batch_code
2 300 B3_2023
3 400 B4_2023
4 500 B1_2024
5 600 B2_2024
6 700 B3_2024
7 800 B4_2024


