Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Count number of matches in pairs of pandas dataframe rows

I have been trying to count the number of times different values in a row of dataframe matches with column-wise values in other rows and provide an output. To illustrate, I have a dataframe (df_testing) as follows:

import pandas as pd
df_testing = pd.DataFrame([
    [0,23,1, 3, 4,2],
    [1,33,3, 2, 4,3],
    [2,40,1, 2, 4,2]],
    columns=['SN','Age',  'Col_1', 'Col_2', 'Col_3','Col_4'])

which gives the following table:

Index|SN |Age |Col_1|Col_2|Col_3|Col_4|
---- |---|----|---- |-----|-----|-----|
0    |0  |23  |1    |3    |4    |2    |
1    |1  |33  |3    |2    |4    |3    |
2    |2  |40  |1    |2    |4    |2    |

I am looking to count the number of exact matches among rows for values in Col_1 to Col_4. For example, Row 0 has just one match with Row 1 (4 and 4 in Col_3) while Row 0 has 3 matches with Row2 (1,1; 4,4, and 2,2). As such, I am aiming for an output (preferably csv file) with all unique pairs like below (the rightmost column shows the number of counts matched):

SN_A|SN_B|Age_A|Age_B|Matched_Count|
----|----|-----|-----|-------------|
0   |1   |23   |33   |    1        |
0   |2   |23   |40   |    3        |
1   |2   |33   |40   |    2        |

I am thinking that this would require a loop and so far, due to my lack of proficiency, what I have managed to do is nowhere near what I want to achieve. I have somehow managed to get the unique pairs printed with the following lines:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

length = len(df_testing)
for x in range(length):
    # print(x)
    for y in range(2,6,1):
        a= df_testing.iloc[x][y]
        for m in range(length):
            if m>x:
                b= df_testing.iloc[m][y]
                print(a,b)

This just prints out the respective values in pairs (e.g. 1,3; 1,1;3,2 etc).

Hence, any guidance to produce the output like shown above will be highly appreciated.

>Solution :

You could use itertools.combinations, a dictionary comprehension and the Series constructor:

from itertools import combinations

df2 = df_testing.set_index(['SN', 'Age'])
out = (pd.Series({(*a, *b): (df2.loc[a]==df2.loc[b]).sum()
                  for a,b in combinations(df2.index, r=2)
                  })
         .rename_axis(('SN_A', 'Age_A', 'SN_B', 'Age_B'))
         .reset_index(name='Matched_Count')
       )

output:

   SN_A  Age_A  SN_B  Age_B  Matched_Count
0     0     23     1     33              1
1     0     23     2     40              3
2     1     33     2     40              2
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading