Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to do I count the number of string combinations in each row of a pandas dataframe?

I’m trying to count the number of times a combination of strings appear in each row of a dataframe. Each ID uses a number of methods (some IDs use more methods than others) and I want to count the number of times any two methods have been combined together.

# df is from csv and has blank cells - I've used empty strings to demo here
df = pd.DataFrame({'id': ['101', '102', '103', '104'],
    'method_1': ['HR', 'q-SUS', 'PEP', 'ET'],
    'method_2': ['q-SUS', 'q-IEQ', 'AUC', 'EEG'],
    'method_3': ['SC', '', 'HR', 'SC'],
    'method_4': ['q-IEQ', '', 'ST', 'HR'],
    'method_5': ['PEP', '', 'SC', '']})

print(df)

    id method_1 method_2 method_3 method_4 method_5
0  101       HR    q-SUS       SC    q-IEQ      PEP
1  102    q-SUS    q-IEQ                           
2  103      PEP      AUC       HR       ST       SC
3  104       ET      EEG       SC       HR         

I want to end up with a table that looks something like this:
| Method A | Method B | Number of Times Combined|
| :——: | :——: | :———————: |
| HR | SC | 3 |
| HR | q-SUS | 1 |
| HR | PEP | 2 |
| q-IEQ | q-SUS | 2 |
| EEG | ET | 1 |
| EEG | SC | 1 |
| etc. | etc. | etc. |

So far I’ve been trying variations of this code using itertools.combinations and collections Counter:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import numpy as np
import pandas as pd
import itertools
from collections import Counter

def get_all_combinations_without_nan(row):
    # remove nan - this is for the blank csv cells
    set_without_nan = {value for value in row if isinstance(value, str)}

    # generate all combinations of values in row
    all_combinations = []
    for index, row in df.iterrows():  
        result = list(itertools.combinations(set_without_nan, 2))
        all_combinations.extend(result)

    return all_combinations

# get all possible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))

count_combinations = Counter(all_rows_flatten)

print(count_combinations)

It’s doing something, but it seems to be counting multiple times or something (it’s counting more combinations than are actually there. I’ve had a good look on Stack, but can’t seem to solve this – everything seems really close though!

I hope someone can help – Thanks!

>Solution :

Use DataFrame.melt for reshape with remove empty strings or missing values, then use DataFrame.merge for all combinations, remove rows with same methods and count by GroupBy.size:

df1 = df.melt('id', value_name='method_')
df1 = df1[(df1["method_"] != '') & (df1["method_"].notna())]

df = (df1.merge(df1, on='id', suffixes=('A','B'))
          .query("method_A != method_B")
          .groupby(['method_A','method_B'])
          .size()
          .reset_index(name='Number of Times Combined'))
print (df.head(20))
   method_A method_B  Number of Times Combined
0       AUC       HR                         1
1       AUC      PEP                         1
2       AUC       SC                         1
3       AUC       ST                         1
4       EEG       ET                         1
5       EEG       HR                         1
6       EEG       SC                         1
7        ET      EEG                         1
8        ET       HR                         1
9        ET       SC                         1
10       HR      AUC                         1
11       HR      EEG                         1
12       HR       ET                         1
13       HR      PEP                         2
14       HR       SC                         3
15       HR       ST                         1
16       HR    q-IEQ                         1
17       HR    q-SUS                         1
18      PEP      AUC                         1
19      PEP       HR                         2
    
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading