I have a table with many columns (fields). In the first field, I need to retain only unique values. In the subsequent columns, I need to count the original number of values present in the first column, but only if the value in a given column is > 0.
I’ve managed to partially accomplish this with awk, but my current attempt would require me to manually create an array for every column in the table and manually type each array for the print command. This isn’t really feasible.
Any help/suggestions (and explanation of how a potential solution works) would be greatly appreciated.
Here’s a subset of the INPUT TABLE (it has already been sorted on column 1):
ATP6 93.883156 55.84006
COX1 230.708456 63.109
COX2 179.993226 74.224269
COX3 169.945901 72.036519
CYTB 228.799722 87.575892
LOC111099029 0.926958 6.124982
LOC111099030 10.124096 5.024844
LOC111099031 0 0
LOC111099031 0 0
LOC111099031 2.279801 2.289838
LOC111099032 17.674714 12.796428
LOC111099033 5.259716 7.326938
LOC111099034 3.514635 2.858349
LOC111099035 0 0
LOC111099035 1.929607 4.409107
LOC111099036 0 0
LOC111099036 1.45196 7.58513
LOC111099037 21.520663 26.353308
LOC111099038 6.019084 5.311657
LOC111099039 12.858404 13.689644
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0.354202 0.265986
LOC111099040 0.587969 0
LOC111099040 2.620288 1.077892
LOC111099040 4.290659 3.487692
LOC111099040 6.42671 6.906503
LOC111099041 0 0
LOC111099041 3.892818 4.934959
LOC111099042 0 0
LOC111099042 13.86859 14.319505
LOC111099043 0 0
Here’s an example of the DESIRED OUTPUT:
LOC111099030 1 1
CYTB 1 1
LOC111099042 1 1
LOC111099037 1 1
LOC111099033 1 1
COX3 1 1
ATP6 1 1
LOC111099039 1 1
LOC111099036 1 1
LOC111099040 5 4
LOC111099035 1 1
LOC111099032 1 1
COX2 1 1
LOC111099038 1 1
LOC111099031 1 1
COX1 1 1
LOC111099029 1 1
LOC111099041 1 1
LOC111099034 1 1
Here’s the code I’ve run to obtain the output above:
awk '{if ($2 > 0) gene_name[$1]++}; {if ($3 > 0) col3_arr[$1]++}; END{ for (var in gene_name) print var, "\t", gene_name[var], col3_arr[var]}' input_file.txt
P.S. I’m also open to a solution in R, as this manipulation is part of a larger R Markdown notebook. I went the awk route because I’m not particularly well-versed with R.
>Solution :
In R, with dplyr:
library(dplyr)
desired_result = your_data %>%
group_by(name_of_first_column) %>%
summarize(across(everything(), ~sum(. > 0)))