Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

using grep (or awk) to extract two consecutive words

I have a BAM file from which I need to extract the two words starting with "PL:Z…", followed by "PR:Z…"

I started trying with the first word, but no luck :

samtools view -h file1.bam |  grep -o '\<PR[[:alnum:]]+\>'

Extracting columns with awk would have been easier, however, I observed that the column numbers for PL and PR are not consistent for all the lines in the file

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

 awk -v OFS='\t' '{print $21, $22}' 

Test file with first 3 lines:

MN01111:72:000H3TTKV:1:13108:10015:1913 2689    SL3.0ch00       8990677 0       59H40M52H       SL3.0ch01       5122725 0    TTTTTTTTTTTTTTATTATTTTTTTTTATTTTTTTTTTTT AFFFF/FF/FFFF//FF////A/FFFF///F/FF////F/        NM:i:2  MD:Z:5A24A9     MC:Z:122M29H AS:i:30  XS:i:28 SA:Z:SL3.0ch09,55182541,-,78S31M42S,0,0;        XA:Z:SL3.0ch05,+4984944,78S33M40S,1;SL3.0ch09,-70510420,47S27M77S,0;SL3.0ch02,-52101716,44S37M70S,2;SL3.0ch08,+62573290,63S25M63S,0;  bl:Z:CGATGT     br:Z:TTTGTC     bm:Z:0  PL:Z:SL3.0ch01_5122724_5122846_FW     PR:Z:None       RG:Z:000H3TTKV_1_BSPT19472_0
MN01111:72:000H3TTKV:1:23103:5003:15527 641     SL3.0ch00       8990677 19      67S40M44S       SL3.0ch01       838549  0    CCGCTCCCCCGATCCCTTCCACCCGGTCCTTATTTTTTTTTTTTTTTTTTTTTTTTTTTATATTTTTTTTTTATTTTTTTTATTATTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTTTCTTTTATATTTTTGCCC        ////////6=///=///////==////////F=//FA/F/F/6//6FFF/FFFF=/F//F///FF/FAFF///F//FFF/F/FF//FF/FAAAFAFFAFA///AFFFFFF/FFAF/A///6/F///F///6/F////FF////FF///FFF       NM:i:1  MD:Z:30A9       MC:Z:105S35M11S AS:i:35 XS:i:31 SA:Z:SL3.0ch02,46044972,+,28S31M92S,0,0;      XA:Z:SL3.0ch09,-70510416,35S31M85S,0;   bl:Z:ATCACG     br:Z:GTGCCT     bm:Z:0  PL:Z:SL3.0ch05_3501697_3501846_FW     PR:Z:None       RG:Z:000H3TTKV_1_Fimande_0
MN01111:72:000H3TTKV:1:23110:15540:17389        2689    SL3.0ch00       8990677 0       10H40M101H      SL3.0ch02       39003136      0       TTTTTTTTTTTTTTTTTATTTTTTTTTATTATTTTTTTTT        F==AFFFA6FAF//F////A/F/F=///////////A/FA        NM:i:2  MD:Z:5A8A25   MC:Z:151M       AS:i:30 XS:i:29 SA:Z:SL3.0ch03,30054271,+,44S32M75S,0,0;SL3.0ch12,17846152,-,40S30M81S,0,0;     bl:Z:ATCACG   br:Z:ACCATG     bm:Z:0  PL:Z:SL3.0ch02_39003135_39003329_FW     PR:Z:None       RG:Z:000H3TTKV_1_Martyvel_0

Expected output:

PL:Z:SL3.0ch01_5122724_5122846_FW     PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW     PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW     PR:Z:None

>Solution :

You may use this awk that loops through all the fields and matches a field using regular expression ^P[LR]:Z: and appends it into a variable to print it in the end.

awk -v OFS='\t' '
{
   s = ""
   for (i=1; i<=NF; ++i)
      if ($i ~ /^P[LR]:Z:/)
         s = (s ? s OFS : "") $i
   print s
}' file

PL:Z:SL3.0ch01_5122724_5122846_FW   PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW   PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading