How to merge the following 2 files using python or bash?

Advertisements

The first file (tab-delimited) is

a       1       2       3
b       6       7       8
c       9       0       9

The second file (tab-delimited) is

a_rna_1 r       r       x
a_rna_2 q       q       d
a_rna_3 c       c       c
b_rna_1 e       e       e
c_rna_1 a       a       g

How can we merge these 2 files so that we get the following file?

a       1       2       3      a_rna_1 r       r       x
a       1       2       3      a_rna_2 q       q       d
a       1       2       3      a_rna_3 c       c       c
b       6       7       8      b_rna_1 e       e       e
c       9       0       9      c_rna_1 a       a       g

Is there a way that we can do this automatically?

>Solution :

Normally, the join command would work, but you don’t have fields that exactly match; a in one file matches something starting with a_rna_... in the other file. So probably the best thing to do is break off the part of the second key that matches the first and prepend it as its own field to the second file, then join. So
something like this:

join <(sort file1.tsv) <(sed 's/^\([^_]*\)_/\1\t&/' file2.tsv | sort)

Explanation:

The <() syntax is process substitution. You can put any command that generates output inside the parentheses, and pass the whole thing to any command that expects the name of a file to read; it will read the output of the command instead. This lets us do some prep work on the files before feeding them to join, without having to create temporary files to hold the prepared versions.

Both files have to be sorted by their keys for join to be able to match them up. You can always sort the final result differently; they just have to be consistent with each other for the join operation. The first file is otherwise fine as-is, so the sort is the only thing we do in the process sub.

For the second file, we use sed to read off the part of the first column before the underscore and prepend it as a new tab-delimited column at the front of the line.

Here’s a breakdown of the sed s/ regex / replacement /
expression:

v-- at the start of the line
  v----------v remember what matches

^ \( [^_] * \) _

          ^--- any number of repetitions of
     ^--^ any character that's not an underscore
               ^--- followed by an underscore

And replace it with:

v--- just the stuff matched by the bit within \(...\)

   v--- followed by a tab 

\1 \t &

      ^--- followed by whatever matched the whole thing 
       

So after the sed, this line:

a_rna_1 r       r       x       

Turns into this:

a       a_rna_1 r       r       x

We pipe the output of sed to sort, and now we’re ready to join.

The join combines the fields of the two files wherever the join columns (the first one, by default) match up. And it only prints one copy of the join column, so the output looks just like you want it to:

a       1       2       3       a_rna_1 r       r       x
a       1       2       3       a_rna_2 q       q       d
a       1       2       3       a_rna_3 c       c       c
b       6       7       8       b_rna_1 e       e       e
c       9       0       9       c_rna_1 a       a       g

Leave a ReplyCancel reply