I have a matrix of survey responses, which people answer Yes or No to certain questions. The data looks something like this.
MC#1 MC#2 MC#3 MC#4
0 0 0 0 0
1 0 1 1 0
2 0 0 0 0
3 0 1 1 0
4 0 1 0 1
... ... ... ... ...
41826 0 1 1 0
41827 0 1 1 0
41828 0 1 1 0
41829 0 1 1 0
41830 0 1 1 0
Some question choices, however, are duplicated. For example, the question being asked could be do you like xyz, and the choices are 1 to 5. But, the choices are not unique, MC#2 is anyone who answered 4 or 5, MC#3 is 4 and MC#5 is 5. So in essence, MC#2 = MC#3 OR MC#4.
How would I go about eliminating MC#2, without knowing that it’s duplicated information? I have tried to check the linearly independent columns using _, inds = sympy.Matrix(test).rref(), which at least tells me how many linearly independent columns are there and how many needs to be removed. But I am not sure how to detect the column(s) that can be summed from other columns.
>Solution :
Since you mention
MC#2 is anyone who answered 4 or 5,
it seems to me that that specific column is a logical-OR combination of the other two columns. So you need to work with logical or, |. In NumPy or Panda terms, it would would be the case that
df['MC#2'] == df['MC#3'] | df['MC#4']
yields True for every row. So you could use
(df['MC#2'] == df['MC#3'] | df['MC#4']).all()
to test if column 2 is a logical-OR of columns 3 and 4.
I will point out that this can also happen by circumstance; it’s an equivalency, not a cause.
To find all combinations of two (unique) columns that match another column through a logical-OR combination, you can iterate over the combinations of columns, and test for the above condition:
import itertools as it
for col1, col2, col3 in it.combinations(df.columns, 3):
s = df[col2] | df[col3]
if (df[col1] == s).all():
print(f"Column {col1} appears to be a logical-or combination of {col2} and {col3}")
There may be a nicer way in Pandas, NumPy or another package, but unless you have a huge dataframe, my bet is that this is clear and fast enough as a workable solution.
Again, with the note about the equivalency not being a cause in mind: do check after the fact that the results match with the survey and the way the results were entered.
Addendum: the above solution likely won’t hold once you get up to large number of columns (say, towards a 100 columns): there are just too many combinations, and processing takes ages.
If the two columns that combine are always next to each other, and that combined column is listed before these two columns, and the column naming scheme is a logical sequence (as in the example), the following will be a lot faster for large number of columns:
import itertools as it
for i in range(len(df.columns)-2):
# Check three consecutive columns
col1 = f"MC#{i+1}"
col2 = f"MC#{i+2}"
col3 = f"MC#{i+3}"
s = df[col2] | df[col3]
if (df[col1] == s).all():
print(f"Column {col1} appears to be a logical-or combination of {col2} and {col3}")
A variation (combination) of the above two solutions may work for cases where combined columns could span a slightly larger range (but not the entire dataframe), by running it.combinations over a sliding subset of the columns using a nested loop.