Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Select all rows of a dataframe where exactly M columns in any order satisfy a condition based on N columns

I want to select all the rows of a dataset where exactly M columns satisfy a condition based on N columns (where N >= M).

Consider the following dataset

import numpy as np
import pandas as pd
import random

years = [2000, 2001, 2002]
products = ["A", "B", "C"]

num_combos = 10

years = [random.choice(years) for i in range(num_combos)]
products = [random.choice(products) for i in range(num_combos)]

sum_values = list(range(0, num_combos))
sum_values1 = list(range(0, num_combos))

bools = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool1 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool2 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool3 = [bool(random.getrandbits(1)) for i in range(num_combos)]



random.shuffle(sum_values)
av_values = [random.randrange(0, num_combos, 1) for i in range(num_combos)]

cols = {"years": years,
        "products": products,
        "y0": bools,
        "y1": bool1,
        "y2": bool2, 
        "y3": bool3,
        "value": av_values}

df = pd.DataFrame(cols)

The code below selects conditions where at least one (or more) of the columns (y0, y1, y2, y3) are True. However, I want to select rows where exactly 2 (analogously 1 or 3) of the columns (y0, y1, y2, y3) are True. Of course one could find all the combinations and use that as a mask, but is there a smarter solution?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

cols = ["y0", "y1", "y2", "y3"]

# At least one
df1 = df[(df[cols] == True).any(axis=1)]

# All of the columns
df2 = df[(df[cols] == True).all(axis=1)]

>Solution :

You can use sum to count the number of True (they are implicitly converted to 1s), and eq to compare to the desired number:

M = 2
df1 = df[df[cols].sum(axis=1).eq(M)]

NB. If you already have booleans, no need to compare to True.

Output for M=2:

   years products     y0     y1     y2    y3  value
2   2000        A  False  False   True  True      3
7   2002        C   True  False  False  True      0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading