CEG 233: Linux and Windows 

Lab on Regular Expressions, and File Manipulations

   

Table of Contents

  1. Educational Objectives
  2. Regular Expressions
  3. Find
  4. Lab Experiment
  5. Acknowledgements
  6. References

Educational Objectives

The objectives of this lab experiment are to make you :

  1. Search/replace based on regular expressions in file content and file names
  2. Learn the usage of
    1. three standard editors: sed, vi, andemacs.
    2.  a string pattern search program: grep
    3. A file locating program: find.
    4. two distinct tools of compression and archive creation: zip and tar

[This article is a supplement to the text book.]

Regular Expressions

A regular expression (also called a regex or regexp) is a pattern which describes the characteristics of a chunk of text that it matches. This is handy for tasks like searching and replacing  (e.g., to fix a spelling mistake such as "helo" with "hello") or running commands on multiple files.

For details supporting the material of this article, do man regex.

Meta Characters

A regular expression Helo matches that exact substring in a line such as  Helo Bob.  The individual characters of Helo as a regex stand for themselves and match only themselves.  The so-called meta characters do not match themselves but describe other matching requirements, such as sequences of one or more characters.  E.g., the asterisk is a meta character and matches (in FNRE) a sequence of any number of characters.  There are several meta characters, and each has its own semantics. 

The typical meta characters are: . ? * () [] {} + | ^ $

Unfortunately, there are several different kinds of regex with different syntaxes depending on the task.  We will be discussing two kinds: FNRE (File Name Regular Expressions) and SMRE String (Matching Regular Expressions).  The meta characters and other details differ in these.

Every once in a while, we wish to use a meta character as a regular (i.e., as a non-meta) char;  in such situations, a meta character needs to be "escaped" or "quoted" by preceding it with a backslash.  Thus, \* will match exactly one asterisk.

FNRE (File Name Regular Expressions)

These regexps are expanded to a list of files. These are used by shells such as bash.

The meta characters of FNRE are: ? * () [] {} $

The dot (.) is not a meta character in FNRE.  We describe FNRE further in a later section.

Wildcard Meaning Example Matches
* Zero or more characters a*txt "atxt", "aa.txt", "abtxt", "aba.txt", etc
? Exactly one character ?.txt "a.txt", "b.txt", etc. NOT "aa.txt" or "ab.txt".

SMRE (String Matching Regular Expressions)

This regex flavor is used by grep, emacs, sed, and many other utilities to search for text. It is more complicated than the FNRE syntax. See man "grep(1)", "regexp(n)", and "perlre(1)" for more detail.

The meta characters of SMRE are: . * + ? | () [] {} ^ $

The dot (also called period) will match any one character no matter what it is (numbers, letters, punctuation, etc.).  Of course, it will match the dot itself.

The asterisk, plus, query and the vertical-bar can appear only following a sub-regex.  The asterisk specifies that the matching of the preceding sub-regex any number of times.  Including not at all (i.e., zero times).  E.g., z* matches sequences of z of any length, (xy)* matches xyxyxyxy but not xyaxy, and .* matches any arbitrary sequence (i.e., any number of matches of the dot).  The meaning of other meta characters is summarized in the table below.

Syntax Meaning Example Matches
Quantifiers  
* Match 0 or more times .* Any number of repetitions of any character
+ Match 1 or more times [ab]+ Any string consisting of only a's and b's ("a", "b", "aab" "bbbab", and so forth). Note: must be prefixed with \ in basic regex mode.
    (cat)+ Example matches: cat, catcat, catcatcat
? Match 1 or 0 times b? Either "b" or "bb" but not further repetitions. Note: must be prefixed with \ in basic regex mode.
| Match either of the expressions joined a|b Either "a" or "b".
{n} Match n times a{5} aaaaa
{m,n} Match between m to n times a{2,4} aa, aaa, aaaa
Other  
. any character . any single character
^ Beginning of line ^Unix occurrences of "Unix" at the beginning of lines
$ End of line CEG$ occurrences of "CEG" at the end of lines
[range] Any character in the range [d-m] matches any letter between d through m including d and m
[chars...] Any one of the characters inside the square brackets [aeiou] any one of a e i o u
[^chars...] Any char other than those inside the square brackets [^aeiou] any character other than  a e i o u
\ Quoting The quoted character without special syntactic meaning \* a literal "*" (as opposed to the quantifier)
\ Escape sequence As in C/C++/Java. \n a newline

The regex (abc|[a-f][g-j][m-p]|[1-9]) matches abc, agm, cjo, 9, 3 but not ag3, ab, ag, ABC, AGM, 12.

Note: Different programs interpret regular expressions slightly differently. For example, grep and sed have two modes: basic and extended (enabled with grep -E). In basic mode, ?, +, and |, will be interpreted only when prefixed with a \ (the opposite of a backslash's normal meaning of escape.).

Replacements

The notation
s/PATTERN/REPLACEMENT/

stands for "Search for the regular expression PATTERN and replace it with the string REPLACEMENT.  Below are a few examples.

Input Regular Expression Output
abccba s/a/x/ xbccba
abccba s/a/x/g xbccbx
a19b20c3d4e5 s/[0-9]+//g abcde



duckduckduck s/duck$/goose/ duckduckgoose
water fire air s/^water/snow/ snow fire air
abcdefgeeeeee s/[^e]+e/123/ 123fgeeeeee
abcdefgeeeeee s/.*e/123/ 123
abcdefgeeeeee s/.*?e/123/ 123fgeeeeee

The * and + are said to be "greedy" in that they will match it maximally (i.e., as much as possible); thus, .*e will match all of abcdefgeeeeee.  Appending a question mark (?) after the + or * will make it match minimally.

In the replacement itself, you may not use regular expressions at all. All characters (except for \1's \2's \3's etc.) will stand only for themselves. A period means a period, nothing else.

Capturing

Regular expressions often contain sub-regular-expressions; e.g., a*b+c has a*, b+, and c subexpressions.  Sub expressions can be enclosed in parentheses to make this structure more obvious; thus (a*)(b+)(c) matches the same strings as the previous.   This idea is used in a more advanced way in what is known as  capturing.

Consider (a*)(b+)(a*).  This pattern matches, among many others, the following: aabbbaaaa.  What if you wish to match a target string so that the first a* and the second a* are exactly the same?  The expression (a*)(b+)\1 will do that.  The backslash-one refers to whatever was matched by sub expression numbered 1.  The numbering of subexpressions starts with the left innermost sub expression and works towards the right outermost sub expression. In the following,

(A[BC](dE))([fF][Gg])

(dE) would be 1, (A[BC](de)) would be 2, and ([fF][Gg]) would be 3.   Here are a few examples.

Regular Expression Matches Does not match
(ABC)def\1 ABCdefABC ABCdef, ABCdefabc, ABC, def
([aA][bB][cC])def\1 abcdefabc, AbcdefAbc, AbCdefAbC abcdefABC, abcdefAbc
(.....) ham \1 Hello ham Hello, 12345 ham 12345 Hello ham H...o
([a-z][A-Z])([1-9])\2\1 aA11aA, aF99aF aA11Aa, aA12aA

The following uses captures to convert "hat in the cat" to "cat in the hat".
hat in the cat s/([a-z]+) in the ([a-z]+)/\2 in the \1/ cat in the hat


Find

The find utility finds files that meet given criteria. It can also run a command on each of these files. This section being part of a Regular Expressions article, we are not focusing on many searches based on timestamps, sizes etc that find can do.

The basic syntax is

find PATH... EXPRESSION...

where PATH describes the directories to examine, and the EXPRESSION can be an option, a test, or an action.

Test Finds Files...
-name PATTERN with basename matching PATTERN in FNRE syntax.
-path PATTERN with full paths matching PATTERN in FNRE syntax.
-regex REGEX with full paths matching REGEX in SMRE syntax.
-type TYPE which are of TYPE. Types include "f" for files, "d" for directories, and "l" for links.
-lname PATTERN which are symbolic links pointing at a file matching PATTERN in FNRE syntax.
-user NAME owned by the user NAME.
-group NAME owned by the group NAME.

Some tests also have a case-insensitive version (such as -iname, -iregex,-ipath, and -ilname).Most tests expect FNRE patterns.  If these tests are given an expression in SMRE syntax, they will not find the expected files. This is the difference between "-path" and "-regex" above.

Action Effect
-exec COMMAND ; Run COMMAND. Any tokens between -exec and the semicolon are its arguments. Wherever the string {} is present, the currently matching filename will be substituted.
-print Print the current filename. This is the default action if none is specified.
-fprint FILE Write the current filename to FILE.

Make sure you use single quotes surrounding the patterns in the following.  We want the pattern evaluated by find not the shell.  Remember to properly quote and/or escape as needed to avoid unexpected results from the shell.

Example:

find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm \{\} \;

This command accomplishes the following:

  1. Search
    1. the current directory, and also
    2. /tmp
  2. For files
    1. belonging to the current user (the"$USER" is evaluated by the shell, not find)
    2. with names ending with tmp followed by a number
  3. And
    1. print their names
    2. delete them (-exec rm \{\} \;  Note how we had to escape the {} and the semicolon)

Another way of writing the above example is:

find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm '{}' ';'


Lab Experiment

All work is expected to be carried out in the Operating Systems and Internet Security (OSIS) Lab, 429 Russ.   But, you are welcome to work wherever.  Note that use of both Linux and Windows and other software, that may not always be installed in other facilities, may be needed.

Record your observations in a plain text file named myLabJournal.txt using your own words and/or copying appropriate lines.  You may use any editor you wish to edit this file.  The editing of other files mentioned in this Lab requires the use of specific editors.

Visit the web pages listed below. 

It is OK if you do not care to study the content of the above.  We will be saving the content of these pages in different ways as three different files, and use them as input files in our manipulations below.

  1. In Windows, surf to each of the URLs given above, and save each on your USB mass storage device (thumb drive).  The use of this USB drive is implied in the remaining steps.
  2. In Windows, using Adobe Reader, save the text content of ConscientiousSoftwareCC.pdf as a new file named ConscientiousSoftware0.txt.
  3. In Windows, using Emacs, make the following changes in the file named jobs-061505.html and save the resulting file as a new file named jobs-061505.txt.
    1. Remove all HTML markup.
    2. Make sure that every comma is followed by exactly one space.
    3. Make sure that every period is followed by exactly two spaces. Bonus points given if you make an exception for the period appearing at the end of a line, with nothing more on that line.
  4. In Linux, using sed, make the following changes in the file named ConscientiousSoftware0.txt and save the result as a new file named ConscientiousSoftware1.txt.
    1. Delete trailing white space in each line.
    2. Make sure that every comma is followed by exactly one space.
    3. Make sure that every period is followed by exactly two spaces. Bonus points given if you make an exception for the period appearing at the end of a line, with nothing more on that line.
  5. In Linux, locate the file containing the Abstract of the First Monday article.  Edit this file using vi and save just the abstract as a new file named abstract.txt.  Make sure that in this file:
    1. There are no ^M (ASCII CR, carriage-return) characters.
    2. Every period is followed by exactly two spaces.
  6. In Linux, make a zip archive named RegEx.zip of all the files and folders used/created/saved in the above steps.
  7. In Linux, make a bzipped tar ball named RegEx.tar.bz of the same files and folders that you zipped in the above step.
  8. Explain the causes for the differences in size between the ZIP and the tarball.
  9. Append to your journal, ls -l listing of all the files used/created above.

Turnin

  1. Note the number <n> of this Lab from the course home page and use L<n> as the first argument to turnin.  Turn in the *.zip, *.tar.bz,  myLabJournal.txt files and the usual ReadMe.txt as explained in Expectations.

Link to Grading Sheet


Acknowledgements

Ben Murray, Taylor Killian.


References

  1. Sobell, Chapters 6, 7, 13.  Required Reading.
  2. List of archive formats http://en.wikipedia.org/wiki/List_of_archive_formats Required Reading.
  3. (i) http://en.wikipedia.org/wiki/Bzip2 (ii) http://www.bzip.org/ "bzip2 is a freely available, patent free, high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression."  Recommended Visit.
.
Copyright © 2009 Prabhaker Mateti last edited: June 24, 2009