How to remove similar strings as if they were duplicates from a dataframe?

November 25, 2022

I have the following df :

df=pd.DataFrame({
    'Q0_0': ["A vs. Z", "A vs. Bc", "B vs. Z", "B vs Bc", "Bc vs. A", "Bc vs. B", "Z vs. A", "Z vs. B", "C vs. A", "Bc vs. A"],
    'Q1_1': [np.random.randint(1,100) for i in range(10)],
    'Q1_2': np.random.random(10),
    'Q1_3': np.random.randint(2, size=10),
    'Q2_1': [np.random.randint(1,100) for i in range(10)],
    'Q2_2': np.random.random(10),
    'Q2_3': np.random.randint(2, size=10),
    'Q3_1': [np.random.randint(1,100) for i in range(10)],
    'Q3_2': np.random.random(10),
    'Q3_3': np.random.randint(2, size=10),
    'Q4_1': [np.random.randint(1,100) for i in range(10)],
    'Q4_2': np.random.random(10),
    'Q4_3': np.random.randint(2, size=10)
})

It has the following display:

Q0_0        Q1_1    Q1_2    Q1_3    Q2_1    Q2_2    Q2_3    Q3_1    Q3_2    Q3_3    Q4_1    Q4_2    Q4_3
0   A vs. Z     76  0.475198    0   31  0.785794    0   93  0.713219    0   31  0.549401    0
1   A vs. Bc    36  0.441907    0   28  0.008276    1   79  0.132327    0   61  0.657476    1
2   B vs. Z     68  0.474950    0   49  0.401341    1   1   0.409924    0   13  0.471476    0
3   B vs Bc     74  0.462356    0   42  0.762348    0   16  0.337623    1   76  0.548017    1
4   Bc vs. A    63  0.738769    1   34  0.340055    1   74  0.488053    1   84  0.663768    1
5   Bc vs. B    18  0.384001    1   75  0.188500    1   72  0.464784    1   32  0.355016    1
6   Z vs. A     34  0.700306    1   92  0.348228    1   99  0.347391    0   13  0.810568    0
7   Z vs. B     84  0.262367    0   11  0.217050    0   77  0.144048    0   44  0.262738    0
8   C vs. A     90  0.846719    1   53  0.603059    1   53  0.212426    1   86  0.515018    1
9   Bc vs. A    11  0.492974    0   76  0.351270    0   5   0.297710    1   40  0.185969    1

I want a rule allowing me to consider Z vs. A as duplicate of A vs. Z and so on for each b vs. a as a diplicate of a vs. b in column Q0_0.

Then proceed with removing those considered as duplicates.

Expected output is :

Q0_0          Q1_1  Q1_2    Q1_3    Q2_1    Q2_2    Q2_3    Q3_1    Q3_2    Q3_3    Q4_1    Q4_2    Q4_3
0   A vs. Z   76    0.475198    0   31  0.785794    0   93  0.713219    0   31  0.549401    0
1   A vs. Bc  36    0.441907    0   28  0.008276    1   79  0.132327    0   61  0.657476    1
2   B vs. Z   68    0.474950    0   49  0.401341    1   1   0.409924    0   13  0.471476    0
3   B vs Bc   74    0.462356    0   42  0.762348    0   16  0.337623    1   76  0.548017    1
8   C vs. A   90    0.846719    1   53  0.603059    1   53  0.212426    1   86  0.515018    1

There is a way to do that in my pandas dataframe ?

Any help from your side will be highly appreciated, thanks.

>Solution :

You can use str.extract (or str.split) to get the left/right parts around vs., then convert to frozenset and use duplicated for boolean indexing:

s = df['Q0_0'].str.extract('(\w+)\s*vs\.?\s*(\w+)').agg(frozenset, axis=1)
# or
# s = df['Q0_0'].str.split(r'\s*vs\.?\s*', expand=True).agg(frozenset, axis=1)

out = df[~s.duplicated()]

Output:

       Q0_0  Q1_1      Q1_2  Q1_3  Q2_1      Q2_2  Q2_3  Q3_1      Q3_2  Q3_3  Q4_1      Q4_2  Q4_3
0   A vs. Z    88  0.664299     0    99  0.102871     0    55  0.905342     0    55  0.789227     1
1  A vs. Bc    71  0.577607     0    99  0.784006     1    39  0.698947     0    82  0.055739     1
2   B vs. Z    81  0.248065     1     9  0.216285     0    13  0.128918     0    49  0.571096     0
3   B vs Bc    95  0.991130     1    80  0.346051     1    54  0.197197     1    30  0.928300     0
8   C vs. A    97  0.440715     0    88  0.986333     1    75  0.161888     0    42  0.831142     0

Intermediates:

s

0     (Z, A)
1    (Bc, A)
2     (Z, B)
3    (Bc, B)
4    (A, Bc)
5    (B, Bc)
6     (Z, A)
7     (Z, B)
8     (C, A)
9    (A, Bc)
dtype: object

~s.duplicated()

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7    False
8     True
9    False
dtype: bool