CEG 233: Linux and Windows

Regular Expressions, Editors, Archivers

Table of Contents

  1. Educational Objectives
  2. Regular Expressions
  3. Editors
  4. Find
  5. Archivers
  6. Lab Experiment
  7. Acknowledgements
  8. References

Educational Objectives

The objectives of this lab experiment are to make you :

  1. Search/replace based on regular expressions in file content and file names
  2. Learn the usage of
    1. three standard editors: sed, vi, and emacs.
    2. a string pattern search program: grep
    3. A file locating program: find.
    4. two distinct tools of compression and archive creation: zip and tar
  3. Notice a few interoperability issues.

Regular Expressions

A regular expression (also called a regex or regexp) is a pattern which describes the characteristics of a chunk of text that it matches. This is handy for tasks like searching and replacing (e.g., to fix a spelling mistake such as "helo" with "hello") or running commands on multiple files.

This section is a supplement to the text book. Sobell, Appendix A: Regular Expressions is Required Reading.

Meta Characters

A regular expression Hello matches that exact substring in a line such as Hello Bob. The individual characters of Hello as a regex stand for themselves and match only themselves. The so-called meta characters do not match themselves but describe other matching requirements, such as sequences of one or more characters. E.g., the asterisk is a meta character and matches (in FNRE) a sequence of any number of characters. There are several meta characters, and each has its own semantics.

The typical meta characters are: . ? * () [] {} + | ^ $

Every once in a while, we wish to use a meta character as a regular (i.e., as a non-meta) char; in such situations, a meta character needs to be "escaped" or "quoted" by preceding it with a backslash. Thus, \* will match exactly one asterisk.

Different Syntaxes of RegEx

There are several different kinds of regex with different syntaxes depending on the task/ tool/ Windows/ Linux, etc. We will be discussing two major kinds: FNRE (File Name Regular Expressions) and SMRE (String Matching Regular Expressions). The meta characters and other details differ between these two. Even within SMRE, different programs interpret regular expressions slightly differently. For example, grep and sed have two modes: basic and extended (enabled with grep -E). In basic mode, ?, +, and |, will be interpreted only when prefixed with a \ (the opposite of a backslash's normal meaning of escape.).

FNRE (File Name Regular Expressions)

File-name-regex are used by shells in the context of file names.

Meta Meaning Example Matches
* Match zero or more characters a*txt "atxt", "aa.txt", "abtxt", "aba.txt", etc
? Match exactly one character ?.txt "a.txt", "b.txt", etc. NOT "aa.txt" or "ab.txt".
. not a meta char . matches itself
!history meta char!ca command in your history that begins with c.
$value of a var$HOMEnot a file name thing, but ...
#comment line# this is just a comment
%cut out matching tail${fnm%.mp3} $fnm but without the .mp3 tail
#cut out matching head${fnm#*.} $fnm but without the head substring matching *.
'use as-is'*'the quote-stripped single token * as-is
"expand but protectls -i "$fnm"value of $fnm; useful when $fnm has e.g. blanks
`execute cmd enclosedls -l `cat list.txt` and subst resulting stdout
()grouping of cmds(echo start; ls -lisa; echo done) | wc -l
[]as in string regex[d-h]any one of d,e,f,g,h
{}enumerationecho {hello,hi}there

SMRE (String Matching Regular Expressions)

This regex flavor is used by grep, emacs, sed, and many other utilities to search for text. It is more complicated than the FNRE syntax. See man "grep(1)", "regexp(n)", and "perlre(1)" for more detail.

The meta characters of SMRE are: . * + ? | () [] {} ^ $

The dot (also called period) will match any one character no matter what it is (numbers, letters, punctuation, etc.). Of course, it will match the dot itself.

The asterisk, plus, query and the vertical-bar can appear only following a sub-regex. The asterisk specifies that the matching of the preceding sub-regex any number of times, including not at all (i.e., zero times). E.g., z* matches sequences of z of any length, and .* matches any arbitrary sequence (i.e., any number of matches of the dot). The parentheses enclose a sub-regex; e.g., (xy)* matches xyxyxyxy but not xyaxy.

The * and + are said to be "greedy" in that they will match maximally (i.e., as much as possible); thus, .*e will match all of abcdefgeeeeee. Appending a question mark (?) after the + or * will make it match minimally.

The meaning of other meta characters is summarized in the table below.

Syntax Meaning Example   Matches
. any character .   any single character
^ Beginning of line ^Unix   occurrences of "Unix" at the beginning of lines
$ End of line CEG$   occurrences of "CEG" at the end of lines
[range] Any character in the range [d-m]   matches any letter between d through m including d and m
[chars...] Any one of the characters inside the square brackets [aeiou]   any one of a e i o u
[^chars...] Any char other than those inside the square brackets [^aeiou]   any character other than a e i o u
Quantifiers
* Match 0 or more times .*   Any number of repetitions of any character
+ Match 1 or more times [ab]+   Any string consisting of only a's and b's ("a", "b", "aab" "bbbab", and so forth). Note: must be prefixed with \ in basic regex mode.
(cat)+   Examples of matches: cat, catcat, catcatcat
? Match 1 or 0 times b?   Either "b" or "bb" but not further repetitions. Note: must be prefixed with \ in basic regex mode.
| Match either of the expressions joined a|b   Either "a" or "b".
{n} Match n times a{5}   aaaaa
{m,n} Match between m to n times a{2,4}   aa, aaa, aaaa
Other
\ Quoting The quoted character without special syntactic meaning \*   a literal "*" (as opposed to the quantifier)
\ Escape sequence As in C/C++/Java. \n   a newline

The regex (abc|[a-f][g-j][m-p]|[1-9]) matches abc, agm, cjo, 9, 3 but not ag3, ab, ag, ABC, AGM, 12.

Replacements

The notation s/PATTERN/REPLACEMENT/ generally stands for "Search for the regular expression PATTERN and replace the left-most match with the string REPLACEMENT. Below are a few examples.

Input Regular Expression Output
abccba s/a/x/ xbccba
abccba s/a/x/g xbccbx
a19b20c3d4e5 s/[0-9]+//g abcde



duckduckduck s/duck$/goose/ duckduckgoose
water fire air s/^water/snow/ snow fire air
abcdefgeeeeee s/[^e]+e/123/ 123fgeeeeee
abcdefgeeeeee s/.*e/123/ 123
abcdefgeeeeee s/.*?e/123/ 123fgeeeeee

In the replacement itself, you may not use regular expressions at all. All characters (except for \1's \2's \3's etc.; see below) will stand only for themselves. A period means a period, nothing else.

Capturing

Regular expressions often contain sub-regular-expressions; e.g., a*b+c has a*, b+, and c subexpressions. Sub expressions can be enclosed in parentheses to make this structure more obvious; thus (a*)(b+)(c) matches the same strings as the previous a*b+c. This idea is used in a more advanced way in what is known as capturing.

Consider (a*)(b+)(a*). This pattern matches, among many others, the following: aabbbaaaa. What if you wish to match a target string so that the first a* and the second a* are exactly the same? The expression (a*)(b+)\1 will do that; the expression (a*)(b+)(a*) matches possibley different a*'s. The backslash-one refers to whatever was matched by subexpression numbered 1.

The numbering of subexpressions starts with the left innermost sub expression and works towards the right outermost sub expression. In (A[BC](dE))([fF][Gg]), (dE) would be 1, (A[BC](de)) would be 2, and ([fF][Gg]) would be 3. Here are a few examples.

Regular Expression Matches Does not match
(ABC)def\1 ABCdefABC ABCdefAbc, ABCdefabc
([aA][bB][cC])def\1 abcdefabc, AbcdefAbc, AbCdefAbC abcdefABC, abcdefAbc
(.....) ham \1 Hello ham Hello, 12345 ham 12345 Hello ham H...o
([a-z][A-Z])([1-9])\2\1 aA11aA, aF99aF aA11Aa, aA12aA

As an example, s/([a-z]+) in the ([a-z]+)/\2 in the \1/ uses captures to convert "hat in the cat" to "cat in the hat".

Editors

Find

The find utility finds files that meet given criteria. It can also run a command on each of these files. This section being part of a Regular Expressions article, we are not focusing on many searches based on timestamps, sizes etc that find can do.

The basic syntax is

find PATH... EXPRESSION...
where PATH describes the directories to examine, and the EXPRESSION can be an option, a test, or an action.

Test Finds Files...
-name PATTERN with basename matching PATTERN in FNRE syntax.
-path PATTERN with full paths matching PATTERN in FNRE syntax.
-regex REGEX with full paths matching REGEX in SMRE syntax.
-type TYPE which are of TYPE. Types include "f" for files, "d" for directories, and "l" for links.
-lname PATTERN which are symbolic links pointing at a file matching PATTERN in FNRE syntax.
-user NAME owned by the user NAME.
-group NAME owned by the group NAME.

Some tests also have a case-insensitive version (such as -iname, -iregex,-ipath, and -ilname).Most tests expect FNRE patterns. If these tests are given an expression in SMRE syntax, they will not find the expected files. This is the difference between "-path" and "-regex" above.

Action Effect
-exec COMMAND ; Run COMMAND. Any tokens between -exec and the semicolon are its arguments. Wherever the string {} is present, the currently matching filename will be substituted.
-print Print the current filename. This is the default action if none is specified.
-fprint FILE Write the current filename to FILE.

Make sure you use single quotes surrounding the patterns in the following. We want the pattern evaluated by find not the shell. Remember to properly quote and/or escape as needed to avoid unexpected results from the shell.

Example:

find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm \{\} \;

This command accomplishes the following:

  1. Search
    1. the current directory, and also
    2. /tmp
  2. For files
    1. belonging to the current user (the"$USER" is evaluated by the shell, not find)
    2. with names ending with tmp followed by a number
  3. And
    1. print their names
    2. delete them (-exec rm \{\} \; Note how we had to escape the {} and the semicolon)

Another way of writing the above example is:

find . /tmp -user "$USER" -name '*tmp[0-9]+' -print -exec rm '{}' ';'

Archivers

List of archive formats http://en.wikipedia.org/wiki/List_of_archive_formats Required Reading.

Lab Experiment

All work is expected to be carried out in the Operating Systems and Internet Security (OSIS) Lab, 429 Russ. But, you are welcome to work wherever. Note that use of both Linux and Windows and other software, that may not always be installed in other facilities, may be needed.

Record your observations in a plain text file named myLabJournal.txt using your own words and/or copying appropriate lines. You may use any editor you wish to edit this file. The editing of other files, etc. mentioned in this Lab requires the use of specific editors/tools.  Note that the choice of specific tools in Windows/Linux is mentioned so that you appreciate cross platform usage.  As always, we are not as into producing the specific files as compared to learning specific tools and techniques.

Visit the web pages listed below.

It is OK if you do not care to study the content of the above. We will be saving the content of these pages in different ways as three different files, and use them as input files in our manipulations below. 

  1. In Windows, surf to each of the URLs given above, and save each on your USB mass storage drive. Use any web browser you like. The use of this USB drive is implied in the remaining steps.
  2. In Windows, using Adobe Reader, save the text content of ConscientiousSoftwareCC.pdf as a new file named ConscientiousSoftware0.txt.
  3. In Windows, using Emacs, make the following changes in the file named jobs-061505.html and save the resulting file as a new file named jobs-061505.txt.
    1. Remove all HTML markup. JavaScript lines may be left as they are.
    2. Make sure that every comma is followed by exactly one space.
    3. Make sure that every period is followed by exactly two spaces. Bonus points given if you make an exception for the period appearing at the end of a line, with nothing more on that line.
  4. In Windows, using a tool of your choice, search the First Monday article for the occurrences of any word with more than two vowels. Save the list of these occurrences as a file named twoVoweledWordList.txt. Make the above into a PS script procedure called twoVowelSearch() and include it in answers.txt
  5. In Windows, make a tar archive named windowsTxtFiles.tar.bzip2, bzip2-compressed, of all the above files. [One well-known open-source program that can do this is 7z (visit http://www.7-zip.org/).] Append to your journal a listing of all the files in this tar-ball.
  6. In Linux, using sed, make the following changes in the file named ConscientiousSoftware0.txt and save the result as a new file named ConscientiousSoftware1.txt.
    1. Delete trailing white space in each line.
    2. Make sure that every comma is followed by exactly one space.
    3. Make sure that every period is followed by exactly two spaces. Bonus points given if you make an exception for the period appearing at the end of a line, with nothing more on that line.
    4. Make the above into a script procedure called rmHTML() and include it in answers.txt
  7. In Linux, using vi, edit the First Monday article as follows.
    1. Thinking of this HTML file as a plain text file, insert newlines appropriately (i.e., without any semantic changes) to make sure that no line is longer than 70 characters.
    2. Remove empty/blank paragraphs, if any.
    3. Remove all images (HTML <img ... > ).
    The resulting file should still be a valid HTML file, displayable in a standard web browser.
  8. In Linux, using find, list the details of all files in your home directory (and its subdirectories) that are older than 3 days, and of size bigger than 999 bytes. Save this listing as a file named oldFilesList.txt
  9. In Linux, make a zip archive named RegEx.zip of all the files and folders used/created/saved in the above steps.
  10. In Linux, append to your journal, ls -l listing of all the files used/created above.

Turnin

  1. Note the number <n> of this Lab from the course home page and use L<n> as the first argument to turnin. Turn in the tar-ball, the zip file, answers.txt, myLabJournal.txt files and the usual ReadMe.txt as explained in Expectations.

Link to Grading Sheet

Acknowledgements

Ben Murray, Taylor Killian.

References

  1. Sobell, Chapter 6: The vi Editor. Required Reading.
  2. Sobell, Chapter 7: Emacs Editor. Required Reading.
  3. Sobell, Chapter 13: The sed Editor. Required Reading.
  4. Sobell, Appendix A: Regular Expressions. Required Reading.
  5. List of archive formats http://en.wikipedia.org/wiki/List_of_archive_formats Required Reading.
  6. (i) http://en.wikipedia.org/wiki/Bzip2 (ii) http://www.bzip.org/ "bzip2 is an open source tool for both Linux and Windows, patent free, high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques, whilst being around twice as fast at compression and six times faster at decompression." Recommended Visit.

Copyright © 2011 Prabhaker Mateti