Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python Pandas: Select multiple related rows from dataframe using comparisons across rows

I have data like this:

In[1]: pd.DataFrame({'ID':["A", 'A', 'A', 'B', 'B', 'B','C'], 'Test':["e2z", 'e2z', 'b6r', 'p0o', 'r5t', 'qi4','x3w'], 'Date':["2022", '2022', '2020', '2019', '2019', '2018', '2023'], 'Success':['1', '0', '1', '0', '1', '0','0'], 'Experiment Parameters': ["awa", '02s', 'ksa', 'fkd', 'efe', 'awe','loa']})

Out[1]:  

   ID   Test Date   Success Experiment Parameters
0   A   e2z 2022    1       awa
1   A   e2z 2022    0       02s
2   A   b6r 2020    1       ksa
3   B   p0o 2019    0       fkd
4   B   r5t 2019    1       efe
5   B   qi4 2018    0       awe
6   C   x3w 2023    0       loa

Each row presents a finding from the corresponding test.

I need code that will, for each ID, extract out the test with most recent date. If there are two tests with the most recent dates, the test with the most total successes should be selected. Therefore, the most recent and most successful test is selected, presenting all the findings from that test.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

In this example data, I want the output to be:

In[2]: pd.DataFrame({'ID':["A", 'A', 'B','C'], 'Test':["e2z", 'e2z', 'r5t', 'x3w'], 'Date':["2022", '2022', '2019',   '2023'], 'Success':['1', '0',  '1', '0'], 'Experiment Parameters': ["awa", '02s', 'efe','loa']})

Out[2]:
    ID  Test    Date    Success Experiment Parameters
0   A   e2z     2022    1       awa
1   A   e2z     2022    0       02s
2   B   r5t     2019    1       efe
3   C   x3w     2023    0       loa

I’ve tried my hand at aggregate and grouping python functions following Get the row(s) which have the max value in groups using groupby like this:

aggre = {'Date': 'unique', 'Success': 'sum'}
idx = input_df.groupby(['Test'])['Date'].transform(max) == input_df['Date']
input_df = input_df[idx].groupby(['Test']).aggregate(aggre)

but these solutions force the rows to be combined, and I need to just subselect rows. I can’t simply have the Experiment Parameters variable be condensed with the aggregate functions either since I need each row to serve as an independent data point to a model. I can’t use solutions from Python Pandas: select rows based on comparison across rows since I need possibly multiple rows to be preserved. Using methods like .apply(helper_function) don’t show promise since my decisions to select rows depend on the values in other rows. I can’t find any other tricks and functions to subselect rows in the dependent manner I need to perform.

How can I achieve my desired dataframe?

>Solution :

I perform a sort so that the most recent and successful test is on top of each group of ID’s. Then I use duplicated to grab the first row of each ID group and then perform a self merge to only grab the rows with the best test.

df = df.sort_values(["ID", "Date", "Success"], ascending=[True, False, False])

best_test = df.loc[~df["ID"].duplicated()][['ID', 'Test']]

df2 = df.merge(best_test, on = ['ID', 'Test'])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading