Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas: How to remove rows with duplicate compound keys, while keeping missing values distributed among the duplicates?

Desired outcome

I have a table of data that looks like this:

enter image description here

And I want to transform that table to look like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

enter image description here

Problem description

The ID and Event# fields are a compound key that represents one unique entry in the table.

Entries can be duplicated two or more times. But some of the row values are distributed among the duplicates. And I don’t always know whether those row values are found in the "first", "last", or some "middle" duplicate.

I want to remove the duplicate entries, while keeping all the populated row values, regardless of where they’re distributed amongst the duplicates.

How can I do this with Pandas?

Looking at some SO posts I think I need to use groupby and fillna or ffill/bfill. But I’m new to Pandas and don’t understand how I can make that work under these conditions:

  1. Rows are distinguished with a compound key
  2. There are instances where there’s more than 1 duplicate row
  3. There’s valid data in more than 1 field distributed across those duplicates
  4. I don’t always know if the valid row data is located in the "first", "last", or some "middle" duplicate

Here’s the dataframe:

df = pd.DataFrame([['ABC111',   1,  '1/1/23 12:00:00',  None,               '1/1/23 13:30:00',  None], 
    ['ABC111',      2,  '1/2/23 00:00:00',  None,               '1/2/23 13:30:00',  None], 
    ['ABC111',      3,  '1/3/23 00:00:00',  None,               '1/3/23 13:30:00',  None], 
    ['ABC112',      1,  '1/1/23 00:00:00',  None,               '1/1/23 13:30:00',  None], 
    ['ABC112',      2,  '1/2/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC112',      2,  '1/2/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC112',      2,  None,               None,               '1/2/23 13:30:00',  'Test Value B'], 
    ['ABC113',      1,  '1/1/23 00:00:00',  None,               '1/1/23 13:30:00',  None], 
    ['ABC113',      2,  '1/2/23 00:00:00',  None,               '1/2/23 13:30:00',  None], 
    ['ABC113',      3,  None,               None,               '1/3/23 13:30:00',  'Test Value B'], 
    ['ABC113',      3,  '1/3/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC114',      1,  '1/1/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      2,  '1/2/23 00:00:00',  None,               '1/2/23 13:30:00',  None], 
    ['ABC114',      3,  '1/3/23 00:00:00',  None,               '1/3/23 13:30:00',  None]],
    columns=['ID', 'Event #', 'Start Date', 'Start Value', 'End Date', 'End Value'])

This SO post is the closest potential solution I could find: Pandas: filling missing values by mean in each group

>Solution :

It looks like you want a groupby.first:

out = df.groupby(['ID', 'Event #'], as_index=False).first()

Output:

        ID  Event #       Start Date   Start Value         End Date     End Value
0   ABC111        1  1/1/23 12:00:00          None  1/1/23 13:30:00          None
1   ABC111        2  1/2/23 00:00:00          None  1/2/23 13:30:00          None
2   ABC111        3  1/3/23 00:00:00          None  1/3/23 13:30:00          None
3   ABC112        1  1/1/23 00:00:00          None  1/1/23 13:30:00          None
4   ABC112        2  1/2/23 00:00:00  Test Value A  1/2/23 13:30:00  Test Value B
5   ABC113        1  1/1/23 00:00:00          None  1/1/23 13:30:00          None
6   ABC113        2  1/2/23 00:00:00          None  1/2/23 13:30:00          None
7   ABC113        3  1/3/23 00:00:00  Test Value A  1/3/23 13:30:00  Test Value B
8   ABC114        1  1/1/23 00:00:00  Test Value A  1/1/23 13:30:00  Test Value B
9   ABC114        2  1/2/23 00:00:00          None  1/2/23 13:30:00          None
10  ABC114        3  1/3/23 00:00:00          None  1/3/23 13:30:00          None
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading