Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove column(s) with overrepresented categorical values

I have a dataset like below:

data <- data.frame(
  Col1 = c("id1", "id2", "id3", "id4","id5",  "id6", "id7", "id8"),
  Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
  Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
  Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
  Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
  Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)

data

  Col1 Col2 Col3 Col4 Col5 Col6
1  id1    A   BK   CA   Ao   Bc
2  id2   Bc   AB   XB   Bu   Bc
3  id3    A  BsC   CA   Ai   Bc
4  id4   As   BX   SC  Ayy   Bc
5  id5   As   BK   CA   Ao   Bc
6  id6   Bs  AsB   CA  Byu   Bc
7  id7    A   BC   CA  Aiy   Be
8  id8    A   BX   SC   Ay   Bd

If a category is over-represented, the columns need to be omitted. For example, if the threshold is 0.74 or 74%, the filtered data will remove Col6 as category Bc is over-represented (6/8=75%). The filtered data will be like the following:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

Or if the threshold is 60%, the filtered data will remove Col4 and Col6 as category CA (in Col4) is over-represented (5/8=62.5%) and Bc (in Col6) is over-represented (6/8=75%). The filtered data will be like the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  Col1 Col2 Col3 Col5
1  id1    A   BK   Ao
2  id2   Bc   AB   Bu
3  id3    A  BsC   Ai
4  id4   As   BX  Ayy
5  id5   As   BK   Ao
6  id6   Bs  AsB  Byu
7  id7    A   BC  Aiy
8  id8    A   BX   Ay

>Solution :

Loop through columns get table frequencies, check weather smaller than threshold:

x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
#   Col1 Col2 Col3 Col4 Col5
# 1  id1    A   BK   CA   Ao
# 2  id2   Bc   AB   XB   Bu
# 3  id3    A  BsC   CA   Ai
# 4  id4   As   BX   SC  Ayy
# 5  id5   As   BK   CA   Ao
# 6  id6   Bs  AsB   CA  Byu
# 7  id7    A   BC   CA  Aiy
# 8  id8    A   BX   SC   Ay
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading