I am trying to train tesseract. The process involves creating triples of files: box files, text files and image (tif) files.
The tool that creates the .box files sometimes creates empty files. Those empty files cause problems for the engine. So, I want to delete the empty box files as well as their partners.
The whole pattern looks like the following
- File1.box
- File1.gt.txt
- File1.tif
- File2.box
- File2.gt.txt
- File2.tif
File2.box is an empty file (has zero size). I want to find and delete it as well as its partners (duplicates) such as File2.gt.txt and File2.tif.
Is this doable?
>Solution :
check this simple script,I used the find command to search for all empty .box files (-type f -name "*.box" -size 0) and then I deletes the empty .box files using the -delete flag, at the end it removes the corresponding .gt.txt and .tif files by executing the rm command within the -exec flag :
#!/bin/bash
#specifing the directory where the files are located
directory="/path/to/files"
#changing to the specified directory
cd "$directory" || exit
#find and delete empty .box files along with their partners
find . -type f -name "*.box" -size 0 -delete -exec sh -c 'rm -f "${1%.box}.gt.txt" "${1%.box}.tif"' sh {} \;