|
CEG 333: Introduction to UnixPrabhaker MatetiBash: Scripting Example |
The purpose of this example script is to scan a directory for .doc files, and convert them to plain text. In reality, the binary structure of a Word document is complex and non-linear, but for demonstration purposes it suffices to simply extract ASCII strings in order. Redundant conversion should not be done if the .doc file has not been changed since it was last converted.
The converted file should have the same filename as the original, but with the extension changed to ".txt" (i.e., text from "document.doc" goes in "document.txt").
How to generate this new filename? ${file} mans
the value of the value named "file",
${file%doc} means that value with the string
"doc" removed from the end, if it's there. So
${file%doc}txt removes the ".doc" extension
and appends a new one, ".txt".
(See the notes on Bash variable manipulation.)
In English, the script should do the following:
#!/bin/sh: Tell the kernel this is a shell script
for each file matching the FNRE *.doc, set a variable to it's filename and do:
txtfile=the filename with a .txt extension
if the file is readable and is newer than $txtfile, then
Put a header in txtfile saying where the text came from
Use strings(1) to dump the text into $txtfile
fi
done
And the script:
#!/bin/bash #1
for file in *.doc; do #2
txtfile=${file%doc}txt #3
if [ -r $file -a $file -nt $txtfile ]; then #4
echo Text found in $file > $txtfile #5
strings $file >> $txtfile #6
fi
done
Notes about this code:
-a is
"and"; nothing else is necessary.> is
used to redirect output and overwrite the textfile's
contents with the header. To avoid overwriting that header, the
converted text in line 6 is appended with
>>.Version 1 of this script accomplishes a lot in nine lines, but a few things could be improved:
The script:
#!/bin/sh
for file in `find . -name *.doc`; do #1
txtfile="${file%doc}txt" #2
txtfile=`echo "$txtfile" | sed 's/ \+/_/g'` #3
if [ -r "$file" -a "$file" -nt $txtfile ]; then #4
echo Text found in "$file" > $txtfile #5
echo Last update: `ls -l "$file" | cut -d' ' -f 7-10` >> $txtfile #6
strings "$file" >> $txtfile #7
fi
done
Changed lines:
for command line just as the
expansion of plain *.doc was. \+ matches one or more spaces.*
and + is essential here. * also
matches 0 repositions of a space: the empty string! This
causes the replacement to be inserted every other
character.-d specifies delimiter (a space).-f selects the 7th through 10th fields, the
date part of ls -l.Version 2 is more useful, but it would be even better if the user could specify an arbitrary directories and individual files. Version 3 accepts directory and file names as parameters.
Given parameter, if it is a directory, the script should search
it as it previously did the current directory, converting each
file. However, for files it should only invoke the conversion code
inside the for loop. To prevent code duplication, the new script
will use two procedures, convert_dir and
convert_file.
#!/bin/sh
convert_dir() { #1
for file in `find "$1" -name *.doc`; do #2
convert_file "$file" #3
done #4
} #5
convert_file() { #6
file="$1" #7
txtfile="${file%doc}txt" #8
txtfile=`echo "$txtfile" | sed 's/ \+/_/g'` #9
if [ -r "$file" -a "$file" -nt $txtfile ]; then #10
echo Text found in "$file" > $txtfile #11
echo Last update: `ls -l "$file" | cut -d' ' -f 7-10` >> $txtfile #12
strings "$file" >> $txtfile #13
fi #14
} #15
if [ $# -eq 0 ]; then #16
convert_dir . #17
else #18
for i in "$@"; do #19
if [ -d "$i" ]; then #20
convert_dir "$i" #21
elif [ -r "$i" ]; then #22
convert_file "$i" #23
else #24
echo "There was a problem reading $i" #25
fi #26
done #27
fi #28
Notes about this code:
$1 is
always the first parameter, $2 is the second, and
so on.$#) is 0, there are
no user-supplied paths to search."$@" expands to all
the parameters as separate tokens.