Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

flag rows in a dataframe conditionally on values of other rows

Consider the dataframe df_in = pd.DataFrame({'id': [0,1,2,2,3,4,4,4], 'channel': [0,1,0,1,3,2,3,1]})

> df_in

   id  channel
0   0        0
1   1        1
2   2        0
3   2        1
4   3        3
5   4        2
6   4        3
7   4        1

And a list of 2-tuples allowed_channel_couples = ((0,1),(2,3)).

I’d like to obtain a new dataframe df_out as:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

   id  channel  flag
0   0        0     0
1   1        1     0
2   2        0     1
3   2        1     1
4   3        3     0
5   4        2     1
6   4        3     1
7   4        1     0

Each rows in df_out equals the rows in df_in. However, df_out contains a new column, named flag. The flag value of a row in df_out is equal 1 only if another row with same id exists whose channel value ‘matches’ the channel value of the first row, according to allowed_channel_couples.

Hence consider the row with index 5 of df_in. This row has id equal 4 and channel equal 2. Since the row at index 6 has id equal 4 and channel value equal 3, and (2,3) is in allowed_channel, both the row with index 5 and the row with index 6 are marked with the flag 1.
On the other hand, the row with index 7 of df_in has id equal 4 and channel equal 1. The allowed couple for flagging in this case is (0,1). No other rows with id equal 4 and channel equal 0 exists, hence the row with index 7 is marked with a 0.

I’m having troubles in solving this problem avoiding python’s slow loops. Any idea on how I can obtain the flags column efficiently?

Ideas:

  • I’ve tried working with the agg method. Hence I’m able to obtain a table like:
   id    channel
0   0        [0]
1   1        [1]
2   2     [0, 1]
3   3        [3]
4   4  [2, 3, 1]

through df_in.groupby('id', as_index=False).agg({'channel': lambda x: tuple(x)}).

>Solution :

You can map the values from one of the tuple values to the other, then identify the duplicates:

d = dict(allowed_channel_couples)
# {0: 1, 2: 3}

df_in['flag'] = (df_in
 .assign(channel=df_in['channel'].map(d).fillna(df_in['channel']))
 .duplicated(['id', 'channel'], keep=False)
 .astype(int)
)

output:

   id  channel  flag
0   0        0     0
1   1        1     0
2   2        0     1
3   2        1     1
4   3        3     0
5   4        2     1
6   4        3     1
7   4        1     0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading