The objectives of this lab experiment are to make you :
sed, vi,
andemacs.grepfind.[This article is a supplement to the text book.]
A regular expression (also called a regex or regexp) is a pattern which describes the characteristics of a chunk of text that it matches. This is handy for tasks like searching and replacing (e.g., to fix a spelling mistake such as "helo" with "hello") or running commands on multiple files.
For details supporting the material of this article, do man
regex.
A regular expression Helo matches that exact substring in a
line such as Helo Bob. The individual characters of
Helo as a regex stand for themselves and match only themselves. The
so-called meta characters do not match themselves but describe other matching
requirements, such as sequences of one or more characters. E.g., the
asterisk is a meta character and matches (in FNRE) a sequence of any number of
characters. There are several meta characters, and each has its own
semantics.
The typical meta characters are: . ? * () [] {} + | ^ $
Unfortunately, there are several different kinds of regex with different syntaxes depending on the task. We will be discussing two kinds: FNRE (File Name Regular Expressions) and SMRE String (Matching Regular Expressions). The meta characters and other details differ in these.
Every once in a while, we wish to use a meta character as a regular (i.e., as a non-meta) char; in such situations, a meta character needs to be "escaped" or "quoted" by preceding it with a backslash. Thus, \* will match exactly one asterisk.
These regexps are expanded to a list of files. These are used by shells such as bash.
The meta characters of FNRE are: ? * () [] {} $
The dot (.) is not a meta character in FNRE. We describe FNRE further in a later section.
| Wildcard | Meaning | Example | Matches |
* |
Zero or more characters | a*txt |
"atxt", "aa.txt", "abtxt", "aba.txt", etc |
? |
Exactly one character | ?.txt |
"a.txt", "b.txt", etc. NOT "aa.txt" or "ab.txt". |
This regex flavor is used by grep, emacs, sed, and many other
utilities to search for text. It is more complicated than the FNRE syntax. See
man "grep(1)", "regexp(n)", and "perlre(1)" for more detail.
The meta characters of SMRE are: . * + ? | () [] {} ^ $
The dot (also called period) will match any one character no matter what it is (numbers, letters, punctuation, etc.). Of course, it will match the dot itself.
The asterisk, plus, query and the vertical-bar can appear only following a sub-regex. The asterisk specifies that the matching of the preceding sub-regex any number of times. Including not at all (i.e., zero times). E.g., z* matches sequences of z of any length, (xy)* matches xyxyxyxy but not xyaxy, and .* matches any arbitrary sequence (i.e., any number of matches of the dot). The meaning of other meta characters is summarized in the table below.
| Syntax | Meaning | Example | Matches |
| Quantifiers | |||
* |
Match 0 or more times | .* |
Any number of repetitions of any character |
+ |
Match 1 or more times | [ab]+ |
Any string consisting of only a's and
b's ("a", "b", "aab" "bbbab", and so forth). Note: must be
prefixed with \ in basic regex mode. |
(cat)+ |
Example matches: cat, catcat, catcatcat | ||
? |
Match 1 or 0 times | b? |
Either "b" or "bb" but not further
repetitions. Note: must be prefixed with \ in basic
regex mode. |
| |
Match either of the expressions joined | a|b |
Either "a" or "b". |
| {n} | Match n times | a{5} | aaaaa |
| {m,n} | Match between m to n times | a{2,4} | aa, aaa, aaaa |
| Other | |||
| . | any character | . | any single character |
^ |
Beginning of line | ^Unix |
occurrences of "Unix" at the beginning of lines |
$ |
End of line | CEG$ |
occurrences of "CEG" at the end of lines |
[range] |
Any character in the range | [d-m] |
matches any letter between d through m
including d and m |
[chars...] |
Any one of the characters inside the square brackets | [aeiou] |
any one of a e i o u |
[^chars...] |
Any char other than those inside the square brackets | [^aeiou] |
any character other than a
e i o u |
\
Quoting |
The quoted character without special syntactic meaning | \* |
a literal "*" (as opposed to the quantifier) |
\ Escape
sequence |
As in C/C++/Java. | \n |
a newline |
Note: Different programs interpret regular expressions slightly
differently. For example, grep and sed have two modes: basic and extended
(enabled with grep -E). In basic mode, ?,
+, and |, will be interpreted only when prefixed with
a \ (the opposite of a backslash's normal meaning of escape.).
s/PATTERN/REPLACEMENT/stands for "Search for the regular expression PATTERN and replace it with the string REPLACEMENT. Below are a few examples.
| Input | Regular Expression | Output |
| abccba | s/a/x/ | xbccba |
| abccba | s/a/x/g | xbccbx |
| a19b20c3d4e5 | s/[0-9]+//g | abcde |
| duckduckduck | s/duck$/goose/ | duckduckgoose |
| water fire air | s/^water/snow/ | snow fire air |
| abcdefgeeeeee | s/[^e]+e/123/ | 123fgeeeeee |
| abcdefgeeeeee | s/.*e/123/ | 123 |
| abcdefgeeeeee | s/.*?e/123/ | 123fgeeeeee |
The * and + are said to be "greedy" in that they will match it maximally
(i.e., as much as possible); thus, .*e will match all of
abcdefgeeeeee. Appending a question mark (?) after the + or
* will make it match minimally.
In the replacement itself, you may not use regular expressions at all. All characters (except for \1's \2's \3's etc.) will stand only for themselves. A period means a period, nothing else.
Regular expressions often contain sub-regular-expressions; e.g., a*b+c has
a*, b+, and c subexpressions. Sub expressions can be enclosed in
parentheses to make this structure more obvious; thus (a*)(b+)(c) matches the
same strings as the previous. This idea is used in a more advanced
way in what is known as capturing.
Consider (a*)(b+)(a*). This pattern matches, among many others, the following: aabbbaaaa. What if you wish to match a target string so that the first a* and the second a* are exactly the same? The expression (a*)(b+)\1 will do that. The backslash-one refers to whatever was matched by sub expression numbered 1. The numbering of subexpressions starts with the left innermost sub expression and works towards the right outermost sub expression. In the following,
(A[BC](dE))([fF][Gg])
(dE) would be 1, (A[BC](de)) would be 2, and ([fF][Gg]) would be 3. Here are a few examples.
| Regular Expression | Matches | Does not match |
| (ABC)def\1 | ABCdefABC | ABCdef, ABCdefabc, ABC, def |
| ([aA][bB][cC])def\1 | abcdefabc, AbcdefAbc, AbCdefAbC | abcdefABC, abcdefAbc |
| (.....) ham \1 | Hello ham Hello, 12345 ham 12345 | Hello ham H...o |
| ([a-z][A-Z])([1-9])\2\1 | aA11aA, aF99aF | aA11Aa, aA12aA |
| hat in the cat | s/([a-z]+) in the ([a-z]+)/\2 in the \1/ | cat in the hat |
The find utility finds files that meet given criteria. It can
also run a command on each of these files. This section being part of a Regular
Expressions article, we are not focusing on many searches based on timestamps,
sizes etc that find can do.
The basic syntax is where PATH describes the directories to examine, and the EXPRESSION can
be an option, a test, or an action.
find PATH... EXPRESSION...
| Test | Finds Files... |
-name PATTERN |
with basename matching PATTERN in FNRE syntax. |
-path PATTERN |
with full paths matching PATTERN in FNRE syntax. |
-regex REGEX |
with full paths matching REGEX in SMRE syntax. |
-type TYPE |
which are of TYPE. Types include "f" for files, "d" for directories, and "l" for links. |
-lname PATTERN |
which are symbolic links pointing at a file matching PATTERN in FNRE syntax. |
-user NAME |
owned by the user NAME. |
-group NAME |
owned by the group NAME. |
Some tests also have a case-insensitive version (such as
-iname, -iregex,-ipath, and
-ilname).Most tests expect FNRE patterns. If these
tests are given an expression in SMRE syntax, they will not find the expected
files. This is the difference between "-path" and
"-regex" above.
| Action | Effect |
-exec COMMAND ; |
Run COMMAND. Any tokens between -exec and the semicolon are its arguments. Wherever the string {} is present, the currently matching filename will be substituted. |
-print |
Print the current filename. This is the default action if none is specified. |
-fprint FILE |
Write the current filename to FILE. |
Make sure you use single quotes surrounding the patterns in the
following. We want the pattern evaluated by find not the
shell. Remember to properly quote and/or escape as needed to avoid
unexpected results from the shell.
Example:
find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm \{\}
\;
This command accomplishes the following:
"$USER" is evaluated
by the shell, not find)tmp followed by a number-exec rm \{\} \; Note how we had to
escape the {} and the semicolon)Another way of writing the above example is:
find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm '{}'
';'
All work is expected to be carried out in the Operating Systems and Internet Security (OSIS) Lab, 429 Russ. But, you are welcome to work wherever. Note that use of both Linux and Windows and other software, that may not always be installed in other facilities, may be needed.
Record your observations in a plain text file named myLabJournal.txt
using your own words and/or copying appropriate lines. You may
use any editor you wish to edit this file. The editing of other files
mentioned in this Lab requires the use of specific editors.
Visit the web pages listed below.
It is OK if you do not care to study the content of the above. We will be saving the content of these pages in different ways as three different files, and use them as input files in our manipulations below.
ConscientiousSoftwareCC.pdf as a new file named
ConscientiousSoftware0.txt.jobs-061505.html and save the resulting file as a new file
named jobs-061505.txt.
sed, make the following changes in the file
named ConscientiousSoftware0.txt and save the result as a new
file named ConscientiousSoftware1.txt.
vi and save just the
abstract as a new file named abstract.txt. Make sure
that in this file:
RegEx.zip of all the
files and folders used/created/saved in the above steps.RegEx.tar.bz of the
same files and folders that you zipped in the above step.ls -l listing of all the files
used/created above.turnin. Turn in the
*.zip, *.tar.bz, myLabJournal.txt files and the usual
ReadMe.txt as explained in Expectations.Ben Murray, Taylor Killian.
| . |
| Copyright © 2009 Prabhaker Mateti | last edited: June 24, 2009 |