Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Apply different array to each column and print results from each array programmatically?

I have a table with many columns (fields). In the first field, I need to retain only unique values. In the subsequent columns, I need to count the original number of values present in the first column, but only if the value in a given column is > 0.

I’ve managed to partially accomplish this with awk, but my current attempt would require me to manually create an array for every column in the table and manually type each array for the print command. This isn’t really feasible.

Any help/suggestions (and explanation of how a potential solution works) would be greatly appreciated.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Here’s a subset of the INPUT TABLE (it has already been sorted on column 1):

ATP6          93.883156   55.84006
COX1          230.708456  63.109
COX2          179.993226  74.224269
COX3          169.945901  72.036519
CYTB          228.799722  87.575892
LOC111099029  0.926958    6.124982
LOC111099030  10.124096   5.024844
LOC111099031  0           0
LOC111099031  0           0
LOC111099031  2.279801    2.289838
LOC111099032  17.674714   12.796428
LOC111099033  5.259716    7.326938
LOC111099034  3.514635    2.858349
LOC111099035  0           0
LOC111099035  1.929607    4.409107
LOC111099036  0           0
LOC111099036  1.45196     7.58513
LOC111099037  21.520663   26.353308
LOC111099038  6.019084    5.311657
LOC111099039  12.858404   13.689644
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0.354202    0.265986
LOC111099040  0.587969    0
LOC111099040  2.620288    1.077892
LOC111099040  4.290659    3.487692
LOC111099040  6.42671     6.906503
LOC111099041  0           0
LOC111099041  3.892818    4.934959
LOC111099042  0           0
LOC111099042  13.86859    14.319505
LOC111099043  0           0

Here’s an example of the DESIRED OUTPUT:

LOC111099030  1  1
CYTB          1  1
LOC111099042  1  1
LOC111099037  1  1
LOC111099033  1  1
COX3          1  1
ATP6          1  1
LOC111099039  1  1
LOC111099036  1  1
LOC111099040  5  4
LOC111099035  1  1
LOC111099032  1  1
COX2          1  1
LOC111099038  1  1
LOC111099031  1  1
COX1          1  1
LOC111099029  1  1
LOC111099041  1  1
LOC111099034  1  1

Here’s the code I’ve run to obtain the output above:

awk '{if ($2 > 0) gene_name[$1]++}; {if ($3 > 0) col3_arr[$1]++}; END{ for (var in gene_name) print var, "\t", gene_name[var], col3_arr[var]}' input_file.txt

P.S. I’m also open to a solution in R, as this manipulation is part of a larger R Markdown notebook. I went the awk route because I’m not particularly well-versed with R.

>Solution :

In R, with dplyr:

library(dplyr)
desired_result = your_data %>%
  group_by(name_of_first_column) %>%
  summarize(across(everything(), ~sum(. > 0)))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading