I want to find out if any of the value in columns mark1, mark2, mark3, mark4 and mark5 are the same, column-wise comparison from a dataframe below, and list result as True or False
import pandas as pd
df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
columns=["mark1", "mark2", 'mark3', 'mark4', 'mark5'])
Ideal output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
So I came up with a func using nested forloop to compare each value in a column, does not work.
AttributeError: ‘Series’ object has no attribute ‘columns’
What’s the correct way? Avoid nested forloop by all means.
def compare_col(df):
check = 0
for i in range(len(df.columns.tolist())+1):
for j in range(1, len(df.columns.tolist())+1):
if df.iloc[i, i] == df.iloc[j, i]:
check += 1
if check >= 1:
return True
else:
return False
df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5]]), axis=1)
>Solution :
The difference between the number of unique items of a series and its total size points to a presence of duplicated values.
df['result'] = df.apply(lambda x: x.unique().size != x.size, axis=1)
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True