The objectives of this lab experiment are to make you :
sed, vi,
and emacs.grepfind.A regular expression (also called a regex or regexp) is a pattern which describes the characteristics of a chunk of text that it matches. This is handy for tasks like searching and replacing (e.g., to fix a spelling mistake such as "helo" with "hello") or running commands on multiple files.
This section is a supplement to the text book. Sobell, Appendix A: Regular Expressions is Required Reading.
A regular expression Hello matches that exact
substring in a line such as Hello Bob. The individual
characters of Hello as a regex stand for themselves and match only
themselves. The so-called meta characters do not match themselves but
describe other matching requirements, such as sequences of one or more
characters. E.g., the asterisk is a meta character and matches (in
FNRE) a sequence of any number of characters. There are several meta
characters, and each has its own semantics.
The typical meta characters are: . ? * () [] {} + | ^ $
Every once in a while, we wish to use a meta character as a regular (i.e., as a non-meta) char; in such situations, a meta character needs to be "escaped" or "quoted" by preceding it with a backslash. Thus, \* will match exactly one asterisk.
There are several different kinds of regex with different syntaxes
depending on the task/ tool/ Windows/ Linux, etc. We will be
discussing two major kinds: FNRE (File Name Regular Expressions) and
SMRE (String Matching Regular Expressions). The meta characters and
other details differ between these two. Even within SMRE, different
programs interpret regular expressions slightly differently. For
example, grep and sed have two modes: basic and extended (enabled
with grep -E). In basic
mode, ?, +, and |, will be
interpreted only when prefixed with a \ (the opposite of
a backslash's normal meaning of escape.).
File-name-regex are used by shells in the context of file names.
| Meta | Meaning | Example | Matches |
* |
Match zero or more characters | a*txt |
"atxt", "aa.txt", "abtxt", "aba.txt", etc |
? |
Match exactly one character | ?.txt |
"a.txt", "b.txt", etc. NOT "aa.txt" or "ab.txt". |
. |
not a meta char | . |
matches itself |
! | history meta char | !c | a command in
your history that begins with c. |
$ | value of a var | $HOME | not a file name thing, but ... |
# | comment line | # this is just a comment | |
% | cut out matching tail | ${fnm%.mp3} |
$fnm but without the .mp3 tail |
# | cut out matching head | ${fnm#*.} |
$fnm but without the head substring matching *. |
' | use as-is | '*' | the quote-stripped single token * as-is |
" | expand but protect | ls -i "$fnm" | value of
$fnm; useful when $fnm has e.g. blanks |
` | execute cmd enclosed | ls -l `cat list.txt` | and subst resulting stdout |
() | grouping of cmds | (echo start; ls -lisa; echo done) | wc -l | |
[] | as in string regex | [d-h] | any one of
d,e,f,g,h |
{} | enumeration | echo {hello,hi}there | |
This regex flavor is used by grep, emacs, sed, and many other
utilities to search for text. It is more complicated than the FNRE syntax. See
man "grep(1)", "regexp(n)", and "perlre(1)" for more detail.
The meta characters of SMRE are: . * + ? | () [] {} ^ $
The dot (also called period) will match any one character no matter what it is (numbers, letters, punctuation, etc.). Of course, it will match the dot itself.
The asterisk, plus, query and the vertical-bar can appear only following a sub-regex. The asterisk specifies that the matching of the preceding sub-regex any number of times, including not at all (i.e., zero times). E.g., z* matches sequences of z of any length, and .* matches any arbitrary sequence (i.e., any number of matches of the dot). The parentheses enclose a sub-regex; e.g., (xy)* matches xyxyxyxy but not xyaxy.
The * and + are said to be "greedy" in that they will match maximally
(i.e., as much as possible); thus, .*e will match all of
abcdefgeeeeee. Appending a question mark (?) after the +
or * will make it match minimally.
The meaning of other meta characters is summarized in the table below.
| Syntax | Meaning | Example | Matches | |
| . | any character | . | any single character | |
^ |
Beginning of line | ^Unix |
occurrences of "Unix" at the beginning of lines | |
$ |
End of line | CEG$ |
occurrences of "CEG" at the end of lines | |
[range] |
Any character in the range | [d-m] |
matches any letter between d through m
including d and m |
|
[chars...] |
Any one of the characters inside the square brackets | [aeiou] |
any one of a e i o u |
|
[^chars...] |
Any char other than those inside the square brackets | [^aeiou] |
any character other than a
e i o u |
|
| Quantifiers | ||||
* |
Match 0 or more times | .* |
Any number of repetitions of any character | |
+ |
Match 1 or more times | [ab]+ |
Any string consisting of only a's and
b's ("a", "b", "aab" "bbbab", and so forth). Note: must be
prefixed with \ in basic regex mode. |
|
(cat)+ |
Examples of matches: cat, catcat, catcatcat | |||
? |
Match 1 or 0 times | b? |
Either "b" or "bb" but not further
repetitions. Note: must be prefixed with \ in basic
regex mode. |
|
| |
Match either of the expressions joined | a|b |
Either "a" or "b". | |
{n} |
Match n times | a{5} |
aaaaa | |
{m,n} |
Match between m to n times | a{2,4} |
aa, aaa, aaaa | |
| Other | ||||
\
Quoting |
The quoted character without special syntactic meaning | \* |
a literal "*" (as opposed to the quantifier) | |
\ Escape
sequence |
As in C/C++/Java. | \n |
a newline | |
The regex (abc|[a-f][g-j][m-p]|[1-9]) matches abc, agm, cjo, 9,
3 but not ag3, ab, ag, ABC, AGM, 12.
The notation s/PATTERN/REPLACEMENT/ generally stands
for "Search for the regular expression PATTERN and replace the
left-most match with the string REPLACEMENT. Below are a few
examples.
| Input | Regular Expression | Output |
| abccba | s/a/x/ |
xbccba |
| abccba | s/a/x/g |
xbccbx |
| a19b20c3d4e5 | s/[0-9]+//g |
abcde |
|
||
| duckduckduck | s/duck$/goose/ |
duckduckgoose |
| water fire air | s/^water/snow/ |
snow fire air |
| abcdefgeeeeee | s/[^e]+e/123/ |
123fgeeeeee |
| abcdefgeeeeee | s/.*e/123/ |
123 |
| abcdefgeeeeee | s/.*?e/123/ |
123fgeeeeee |
In the replacement itself, you may not use regular expressions at all. All characters (except for \1's \2's \3's etc.; see below) will stand only for themselves. A period means a period, nothing else.
Regular expressions often contain sub-regular-expressions;
e.g., a*b+c has a*, b+,
and c subexpressions. Sub expressions can be enclosed in
parentheses to make this structure more obvious;
thus (a*)(b+)(c) matches the same strings as the
previous a*b+c. This idea is used in a more advanced way
in what is known as capturing.
Consider (a*)(b+)(a*). This pattern matches, among many others, the following: aabbbaaaa. What if you wish to match a target string so that the first a* and the second a* are exactly the same? The expression (a*)(b+)\1 will do that; the expression (a*)(b+)(a*) matches possibley different a*'s. The backslash-one refers to whatever was matched by subexpression numbered 1.
The numbering of subexpressions starts with the left innermost sub
expression and works towards the right outermost sub expression. In
(A[BC](dE))([fF][Gg]),
(dE) would be 1, (A[BC](de)) would be 2,
and ([fF][Gg]) would be 3. Here are a few examples.
| Regular Expression | Matches | Does not match |
(ABC)def\1 |
ABCdefABC | ABCdefAbc, ABCdefabc |
([aA][bB][cC])def\1 |
abcdefabc, AbcdefAbc, AbCdefAbC | abcdefABC, abcdefAbc |
(.....) ham \1 |
Hello ham Hello, 12345 ham 12345 | Hello ham H...o |
([a-z][A-Z])([1-9])\2\1 |
aA11aA, aF99aF | aA11Aa, aA12aA |
As an example, s/([a-z]+) in the ([a-z]+)/\2 in the \1/
uses captures to convert "hat in the cat" to "cat in the hat".
The find utility finds files that meet given criteria. It can
also run a command on each of these files. This section being part of a Regular
Expressions article, we are not focusing on many searches based on timestamps,
sizes etc that find can do.
The basic syntax is
find PATH... EXPRESSION...where PATH describes the directories to examine, and the EXPRESSION can be an option, a test, or an action.
| Test | Finds Files... |
-name PATTERN |
with basename matching PATTERN in FNRE syntax. |
-path PATTERN |
with full paths matching PATTERN in FNRE syntax. |
-regex REGEX |
with full paths matching REGEX in SMRE syntax. |
-type TYPE |
which are of TYPE. Types include "f" for files, "d" for directories, and "l" for links. |
-lname PATTERN |
which are symbolic links pointing at a file matching PATTERN in FNRE syntax. |
-user NAME |
owned by the user NAME. |
-group NAME |
owned by the group NAME. |
Some tests also have a case-insensitive version (such as
-iname, -iregex,-ipath, and
-ilname).Most tests expect FNRE patterns. If these
tests are given an expression in SMRE syntax, they will not find the expected
files. This is the difference between "-path" and
"-regex" above.
| Action | Effect |
-exec COMMAND ; |
Run COMMAND. Any tokens between -exec and the semicolon are its arguments. Wherever the string {} is present, the currently matching filename will be substituted. |
-print |
Print the current filename. This is the default action if none is specified. |
-fprint FILE |
Write the current filename to FILE. |
Make sure you use single quotes surrounding the patterns in the
following. We want the pattern evaluated by find not the
shell. Remember to properly quote and/or escape as needed to avoid
unexpected results from the shell.
Example:
find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm \{\} \;
This command accomplishes the following:
"$USER" is evaluated
by the shell, not find)tmp followed by a number-exec rm \{\} \; Note how we had to
escape the {} and the semicolon)Another way of writing the above example is:
find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm '{}' ';'
List of archive formats http://en.wikipedia.org/wiki/List_of_archive_formats Required Reading.
All work is expected to be carried out in the Operating Systems and Internet Security (OSIS) Lab, 429 Russ. But, you are welcome to work wherever. Note that use of both Linux and Windows and other software, that may not always be installed in other facilities, may be needed.
Record your observations in a plain text file named myLabJournal.txt
using your own words and/or copying appropriate lines. You may use any
editor you wish to edit this file. The editing of other files, etc. mentioned in
this Lab requires the use of specific editors/tools. Note that the choice
of specific tools in Windows/Linux is mentioned so that you appreciate cross
platform usage. As always, we are not as into producing the specific files
as compared to learning specific tools and techniques.
Visit the web pages listed below.
It is OK if you do not care to study the content of the above. We will be saving the content of these pages in different ways as three different files, and use them as input files in our manipulations below.
ConscientiousSoftwareCC.pdf as a new file named
ConscientiousSoftware0.txt.jobs-061505.html and save the resulting file as a new file
named jobs-061505.txt.
twoVoweledWordList.txt.
Make the above into a PS script procedure
called twoVowelSearch() and include it in
answers.txt
windowsTxtFiles.tar.bzip2, bzip2-compressed, of
all the above files. [One well-known open-source program that can
do this is 7z (visit
http://www.7-zip.org/).] Append to your journal a listing of
all the files in this tar-ball.
sed, make the following changes in the file
named ConscientiousSoftware0.txt and save the result as a new
file named ConscientiousSoftware1.txt.
rmHTML() and include it in answers.txt
<img ... > ).
find, list the details of all files
in your home directory (and its subdirectories) that are older than
3 days, and of size bigger than 999 bytes. Save this listing as a
file named oldFilesList.txt
RegEx.zip
of all the files and folders used/created/saved in the above
steps.ls -l listing of
all the files used/created above.turnin.
Turn in the tar-ball, the zip file, answers.txt,
myLabJournal.txt files and the usual
ReadMe.txt as explained in Expectations.Ben Murray, Taylor Killian.