CS 400/600:
Data Structures and Software Design

Project 2: wordCount

Due: July 7, 1999 11:59PM    Summer Qtr 1999,   pmateti@cs.wright.edu

This project is about the use of Binary Search Trees, Skip Lists, Hash Tables. This handout is a print out of file p2.html. You will find several related files in our public directory.

I will answer questions relating to P2 in our newsgroup news:wright.cs.400. Keep an eye on this newsgroup.

Count the Words

We wish to count how many times each word of a text file appears in that file. The name of this file is given as an argument to your program. Name this program wordCount. The program is required to output a sequence of lines, one per word. Each output line consists of a word followed by its count of how many times it appeared in the input file.

Reading the Words In

Implement the "dictionary" of words appearing in the file as (1) a hash table with quadratic probing, (2) as a binary search tree, and (3) as a skip list. You need to keep track of not only the spelling of a word, but also how many times it has so far appeared in the file. The output files resulting from the three dictionaries should be equivalent: all the words appear, and their counts are equal.

Instrumentation

For each data type (BST, Hash Table or SkipList), compute the number of ``computational steps'' taken for each input file you used. The steps are computed as follows. Introduce a private counter initialized to 0 into BST, HashTable, and SkipList classes. Increment the counter by one for each iteration, and for each recursive call no matter where these occur, except in the print() functions. Whenever your print() method is invoked, it prints this cumulative counter in addition to its previous functionality.

Design

  1. Our textbook by Sahni disccuses Skip Lists in Section 7.3, Hash Tables in Section 7.4 and Binary Search Trees in Section 11.1. The source code from these chapters is located in http://www.mhhe.com/engcs/compsci/sahni/ and also in our public directory, /public/Sahni/. Make adaptations, and further assumptions as necessary, but must be documented. Please do not obtain the code from other sources unless you have received prior consent from me.

    Follow the style guidelines given in {\tt c++Style.html}.

  2. The Standard C library has a function called strtok() that is of help in splitting a line of text into words. Do a man strtok for further information. Use the delimiters specified in the man page for strtok(), basically white space and all punctuation marks.

    As each word is obtained, insert it into all the dictionaries. If the word is already present, only the count is incremented.

  3. If you are not comfortable with templates, do not use them. But, if you do, here is what is required. Insert the following lines at the top of *.h and *.C files.
    #pragma interface           /* in  "bst.h" */
    #pragma implementation      /* in  "bst.C" */
    
    and compile with the flag -fexternal-templates, i.e.,
    CFLAGS	= -c -g -Wall -ansi -pedantic -fexternal-templates  ## in "Makefile"
    

Turnin

One test input file hehner.txt is provided. But you should test your program on several (small/medium/large) files. We also hope you will personally read and follow the advice of Hehner.

Submit your solution electronically using turnin P2 <files>. It should include a Makefile, the source code files, and test input files of your choice, and their output files, a ReadMe.txt file, and a successfully run typescript file. We may make and run your program on our own test files.


http://www.cs.wright.edu/people/faculty/pmateti/