Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Choosing the earliest date per record when equal dates are present

I have a table with multiple dates per record. Example of the table:

identifier date value
a 1985-01-01 ex1
a 1985-01-01 ex2
a 1985-01-03 ex3
b 1990-01-05 ex1
b 1990-05-10 ex4
c 1987-01-01 ex1
c 1987-01-01 ex3
d 1986-01-01 ex1
d 1986-01-01 ex3

I found out how to extract the earliest date in a group using:

df2 = df.loc[df.groupby('identifier')['date'].idxmin()]

However, when I have two equal dates, as the column value is sorted in alphabetical order, I end up choosing always the first alphabetic value.
I would like to find a way to randomize such behavior whenever I have equal dates, in order to pick:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • the first value the 1st time
  • the second value the 2nd time
  • the third value (whenever present) the 3rd time

and restart accordingly

Is there a way to use the formula above together with a condition or a randomize method? How can I do that?

Expected output :

identifier date value
a 1985-01-01 ex1
b 1990-01-05 ex1
c 1987-01-01 ex3
d 1986-01-01 ex1

>Solution :

You can filter the rows with the max date, then sample a random one.

Either with two groupby:

out = (df[df['date'].eq(df.groupby('identifier')['date'].transform('min'))]
        .groupby('identifier').sample(n=1)
      )

Or with groupby.apply:

out = (df.groupby('identifier', group_keys=False)
         .apply(lambda g: g[g['date'].eq(g['date'].min())].sample(n=1))
      )

Example output:

  identifier       date value
1          a 1985-01-01   ex2
3          b 1990-01-05   ex1
6          c 1987-01-01   ex3
8          d 1986-01-01   ex3

Other possible output:

  identifier       date value
0          a 1985-01-01   ex1
3          b 1990-01-05   ex1
5          c 1987-01-01   ex1
7          d 1986-01-01   ex1

iterating all possibilities

If you want to iterate on first the first max, then the second, etc. (which is not random):

g1 = df.groupby('identifier')['date']
g2 = df.groupby(['identifier', 'date'])

df2 = (df[df['date'].eq(g1.transform('min'))]
       .assign(n=g2.cumcount())
      )

for n, g in df2.groupby('n'):
    print(f'iteration: {n+1}')
    print(g)

Output:

iteration: 1
  identifier       date value  n
0          a 1985-01-01   ex1  0
3          b 1990-01-05   ex1  0
5          c 1987-01-01   ex1  0
7          d 1986-01-01   ex1  0
iteration: 2
  identifier       date value  n
1          a 1985-01-01   ex2  1
6          c 1987-01-01   ex3  1
8          d 1986-01-01   ex3  1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading