Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?

Step 1:

for i in {001..999}; do
    [[ ! -f ${FILENAME}.${i}.xyz ]] && break
    cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
    mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done

Step 2:

for ((j=0; j<=${NUM_CONF}; j++)); do
    let "n = 2 + (${j} * ${LINES_PER_CONF})"
    let "m = ${j} + 1"
    ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
    sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done

I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE

Some details about the files:

FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
… continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)

The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.

The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)

>Solution :

If you know that at least one matching file exists, you should be able to do this:

cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}

Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn’t exist or isn’t a problem if it were to exist, then you should just be able to do the above.

However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.

If you could change your workflow so that the filenames are preserved, it could just be:

mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}

where we now have a directory named after the job basename, rather than a path component fragment.

The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.

Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:

awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz

for instance just before catenating and moving them away. Then you don’t have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.

GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.

Leave a Reply