I have a dataframe I built by scraping multiple excel files. one of the problems is that the header from the excel file (the column names) gets repeated for every file that gets concatenated into the dataframe.
I would like to drop the rows where the value is equal to the column names.
The line below gets me the correct boolean mask but I just can’t seem to find a clean way to filter the data frame with the below condition
(df.iloc[:,0:13] == df.columns[0:13])
I’ve solved the issue with the below lines but I was hoping there was a cleaner or more pythonic way to write it because it feels like I’m missing something.
df1 = df[(df[(df.iloc[:,:13] == df.columns[:13])]).notna()].iloc[:,:13].copy()
df = df.drop(df1[df1['Name'].notnull()].index)
See the below sample table:
sample = pd.DataFrame(data = [['name', 'description', 'qty', 'units', 'wall height', 'xyz'],
[17, 8, 34, 5, 41,'xyz'],
[23, 16, 25, 43, 48,'abc'],
[15, 44, 21, 9, 3,'cde'],
[14, 16, 40, 26, 6, 'efg'],
['name', 'description', 'qty', 'units', 'wall height','efg'],
],
columns=['name', 'description', 'qty', 'units', 'wall height','address'])
With the above sample I would be trying to filter out index locations 0 and 5 where columns 0:5 are equal to column name 0:5. Note that column 5(‘address’) with never match any values in the field.
Hopefully someone has a more pythonic way to solve me above problem.
TIA
>Solution :
IIUC, you can filter the dataframe using boolean indexing:
mask = (sample.iloc[:, 0:5] == sample.columns[0:5]).all(axis=1)
print(sample[~mask])
Prints:
name description qty units wall height address
1 17 8 34 5 41 xyz
2 23 16 25 43 48 abc
3 15 44 21 9 3 cde
4 14 16 40 26 6 efg