Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to remove rows from a dataframe if the value of a cell is a subset of another cell in the same column?

I have a dataframe like this (note: C is the count of values in A):

A                     B   C
ex, one, two, three   X1  4
ex, one, two          X2  3
one, two, four        X3  3
ex, three             X4  2
four, ex              X5  2
four, one             X6  2

I want to delete all rows in which the value in column A is a subset of another value in column A.
So the result should look like

A                     B   C
ex, one, two, three   X1  4
one, two, four        X3  3
four, ex              X5  2

I tried the following approach:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

dataframe[np.sum(np.array([[y in x for x in dataframe.A.values] for y in dataframe.A.values]),1)==1]

However, this only filters out values that have the same order, e.g. "ex, one, two" and "ex, one, two, three" but not "ex, two, three".

Looking for help 🙂

>Solution :

IIUC, you could convert your column to sets, then, assuming prior sorting by decreasing C, you can check for each set if it is the subset of any prior set. Finally, slice the dataframe:

l = [set(s.split(', ')) for s in df['A']]
# [{'ex', 'one', 'three', 'two'},
#  {'ex', 'one', 'two'},
#  {'four', 'one', 'two'},
#  {'ex', 'three'},
#  {'ex', 'four'},
#  {'four', 'one'}]

notsub = [not any(s.issubset(x) for x in l[:i]) for i,s in enumerate(l)]
# [True, False, True, False, True, False]

df2 = df[notsub]

output:

                     A   B  C
0  ex, one, two, three  X1  4
2       one, two, four  X3  3
4             four, ex  X5  2

if possible duplicated sets

In this case, you also need to compare with the sets of equal size, the best is to use frozenset and pandas.Series.duplicated:

l = [frozenset(s.split(', ')) for s in df['A']]
notsub = [not any(s.issubset(x) for x in l[:i]) for i,s in enumerate(l)]

df2 = df[notsub & ~pd.Series(l).duplicated()]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading