Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to remove similar strings as if they were duplicates from a dataframe?

I have the following df :

df=pd.DataFrame({
    'Q0_0': ["A vs. Z", "A vs. Bc", "B vs. Z", "B vs Bc", "Bc vs. A", "Bc vs. B", "Z vs. A", "Z vs. B", "C vs. A", "Bc vs. A"],
    'Q1_1': [np.random.randint(1,100) for i in range(10)],
    'Q1_2': np.random.random(10),
    'Q1_3': np.random.randint(2, size=10),
    'Q2_1': [np.random.randint(1,100) for i in range(10)],
    'Q2_2': np.random.random(10),
    'Q2_3': np.random.randint(2, size=10),
    'Q3_1': [np.random.randint(1,100) for i in range(10)],
    'Q3_2': np.random.random(10),
    'Q3_3': np.random.randint(2, size=10),
    'Q4_1': [np.random.randint(1,100) for i in range(10)],
    'Q4_2': np.random.random(10),
    'Q4_3': np.random.randint(2, size=10)
})

It has the following display:

Q0_0        Q1_1    Q1_2    Q1_3    Q2_1    Q2_2    Q2_3    Q3_1    Q3_2    Q3_3    Q4_1    Q4_2    Q4_3
0   A vs. Z     76  0.475198    0   31  0.785794    0   93  0.713219    0   31  0.549401    0
1   A vs. Bc    36  0.441907    0   28  0.008276    1   79  0.132327    0   61  0.657476    1
2   B vs. Z     68  0.474950    0   49  0.401341    1   1   0.409924    0   13  0.471476    0
3   B vs Bc     74  0.462356    0   42  0.762348    0   16  0.337623    1   76  0.548017    1
4   Bc vs. A    63  0.738769    1   34  0.340055    1   74  0.488053    1   84  0.663768    1
5   Bc vs. B    18  0.384001    1   75  0.188500    1   72  0.464784    1   32  0.355016    1
6   Z vs. A     34  0.700306    1   92  0.348228    1   99  0.347391    0   13  0.810568    0
7   Z vs. B     84  0.262367    0   11  0.217050    0   77  0.144048    0   44  0.262738    0
8   C vs. A     90  0.846719    1   53  0.603059    1   53  0.212426    1   86  0.515018    1
9   Bc vs. A    11  0.492974    0   76  0.351270    0   5   0.297710    1   40  0.185969    1

I want a rule allowing me to consider Z vs. A as duplicate of A vs. Z and so on for each b vs. a as a diplicate of a vs. b in column Q0_0.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Then proceed with removing those considered as duplicates.

Expected output is :

Q0_0          Q1_1  Q1_2    Q1_3    Q2_1    Q2_2    Q2_3    Q3_1    Q3_2    Q3_3    Q4_1    Q4_2    Q4_3
0   A vs. Z   76    0.475198    0   31  0.785794    0   93  0.713219    0   31  0.549401    0
1   A vs. Bc  36    0.441907    0   28  0.008276    1   79  0.132327    0   61  0.657476    1
2   B vs. Z   68    0.474950    0   49  0.401341    1   1   0.409924    0   13  0.471476    0
3   B vs Bc   74    0.462356    0   42  0.762348    0   16  0.337623    1   76  0.548017    1
8   C vs. A   90    0.846719    1   53  0.603059    1   53  0.212426    1   86  0.515018    1

There is a way to do that in my pandas dataframe ?

Any help from your side will be highly appreciated, thanks.

>Solution :

You can use str.extract (or str.split) to get the left/right parts around vs., then convert to frozenset and use duplicated for boolean indexing:

s = df['Q0_0'].str.extract('(\w+)\s*vs\.?\s*(\w+)').agg(frozenset, axis=1)
# or
# s = df['Q0_0'].str.split(r'\s*vs\.?\s*', expand=True).agg(frozenset, axis=1)

out = df[~s.duplicated()]

Output:

       Q0_0  Q1_1      Q1_2  Q1_3  Q2_1      Q2_2  Q2_3  Q3_1      Q3_2  Q3_3  Q4_1      Q4_2  Q4_3
0   A vs. Z    88  0.664299     0    99  0.102871     0    55  0.905342     0    55  0.789227     1
1  A vs. Bc    71  0.577607     0    99  0.784006     1    39  0.698947     0    82  0.055739     1
2   B vs. Z    81  0.248065     1     9  0.216285     0    13  0.128918     0    49  0.571096     0
3   B vs Bc    95  0.991130     1    80  0.346051     1    54  0.197197     1    30  0.928300     0
8   C vs. A    97  0.440715     0    88  0.986333     1    75  0.161888     0    42  0.831142     0

Intermediates:

s

0     (Z, A)
1    (Bc, A)
2     (Z, B)
3    (Bc, B)
4    (A, Bc)
5    (B, Bc)
6     (Z, A)
7     (Z, B)
8     (C, A)
9    (A, Bc)
dtype: object

~s.duplicated()

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7    False
8     True
9    False
dtype: bool
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading