CSE 101 -- Prog 3

Due before midnight, Sunday, November 14, 2021.
Late submissions will not be accepted/graded.

Changelog:

Nov 1, 2021: Updated List.c file list_get_index to return NULL on negative indexes. For details see: Piazza note 395
Nov 06, 2021: Updated input file for emma to remove whitespace from start of one of the lines. Updated formatting function. For details see: Piazza post 358

Objectives:

Implement a Dictionary ADT with Hash Tables and produce an application that uses your implementation.
The application will read a document and determine the 25 most common words within that document.
In addition, it will determine the 20 longest words in that document.
Ties in frequency or length should be decided by the ascending alphabetical order of words (hint: strcmp may be useful).
The output of the application will be the same as the input but the font size and colors of those words will be changed.
The output function is provided to you.

Description:

A Dictionary ADT manages data elements that are key:value pairs. Keys are unique and comparable, while values can contain more information associated with the key. In this program, you will have two goals:

implement a Dictionary ADT
implement an application which computes features of text input

The application is a text-parsing system that maintains a dictionary of words and their frequencies.
In addition you will need to be maintain a dictionary of the length of each word.
The dictionary of words should use each word as a key, the value is either a pointer to an int (Describing length or count).
You will also use the provided List ADT to help you sort the entries by word count, word length or alphabetically.

Details:

Input will be provided via stdin.
Input is a document containing multiple lines of words and ends when the input ends.
The input will define the stopwords for each document as a single line, with each stopword separated by a space.
The line following the stopwords will be a padding line consisting of "===="
HashTable:
You will implement your dictionary ADT using hash tables. The hash function will be provided to you in HashTable.c
Keys are the unique words extracted from the document. Each word is separated by a space or a new line. However, you will also be given a list of stop words which should be filtered out when calculating word lengths and frequencies.
All words from the stop word list should be ignored, i.e. they should not be added to your dictionary.
All the stop words are on the very first line of the input. Each word is separated by a space. The line after the stop words should be ignored. Then the full text starts
Your hash table should be sized to contain 100 slots.
Note: This is by no means an ideal size choice. This size is chosen to ensure that collisions occur.
Dictionary ADT:
Dictionary ADT provide efficient access of key:value pairs. You will be using the functions in HashTable.c to support the following operations:
- dictionary_create(n, dp, fd) -- returns an empty dictionary with a hash table of n entries.
- dictionary_size( D ) -- returns the number of elements currently in dictionary D.
- dictionary_find( D, k ) -- returns the pointer to the node whose key is k, or NULL if k is not found. Note that the value element is pointed to by void *.
- dictionary_insert( D, e ) -- inserts element e into the dictionary, if an element with the same key does not already exist. Update the size of D.
- dictionary_delete( D, k) -- deletes element with key k from the dictionary. Update the size of D.
- dictionary_print( D ) -- prints the dictionary. Specifically, it looks at the non-empty slots of the hash table, and prints the key and value fields of each node using the dataPrint (dp) function passed in the constructor.
where k is the key, e is a key:value pair object, and values are generic pointer type.
Document feature extraction application:
The application maintains two dictionaries: Both use words as keys. One stores word lengths, while the other stores word frequencies. The application then outputs a formatted version of the input. Simplifying Assumptions:
- For the purposes of this assignment you do not need to convert upper to lowercase; it is acceptable for your code to treat capitalized words as different words. E.g. "Word" does NOT need to be treated as though equal to "word"
- Words are separated by a space or a newline -- with each character, including punctuation -- being handled equally. E.g. mr. is a length 3 word, and is not equal to mr
- The hash function ht_string2int is provided for you. Provide words to this function to obtain an int associated with that string. Applying the modulo operator (%) to this int will provide you with the key for the string.
- YOu must implement chaining to resolve collisions.
- You can use the provided list implementation to handle collisions in your dictionary.
- The function for printing output is provided to you.
- The first section of input will be a set of stop words to ignore. Your application should not add these words to your dictionaries. This set will be separated from the rest of the input by "===="
Files for this program: The following files are provided to you, and should all be submitted with your work.
- Makefile: This should be able to compile your submission on the unix server. Your program should be able to compile using `make prog3`
- Dictionary.c: Where most of your work will be done. Implements the functions in Dictionary.h
- Dictionary.h: Function headers for your Dictionary implementation. Includes a definition for your key value pairs. You should not have to modify this file.
- HashTable.c: For implementing your hash table. You will have to complete one function within this file. The hash function is completed for you.
- HashTable.h A header for the provided hash table. You should not have to modify this file.
- List.c: A working implementation of a list. You should not have to implement this file. It should be used by Dictionary.c and for sorting in prog3.c
- List.h: A header file for the provided list. You should not modify this file.
- prog3.c: Your application file. This also contains the functions provided for formatting your output. You may add any additionaly functions you want here, however you should not modify printOutput, word_length and any_char.
Example input/output: The following files are provided to you and should NOT be submitted with your work. Note that the first part of the input file is a set of stop words which should be ignored (not added to your dictionary). This set is separated from the rest of the input by "===="
You should be able to test your program works by running `./prog3 < test1.in` and comparing it to the test1.out.html provided.
- ```
Input: test1.in Expected output: test1.out.html
```
- ```
Input: test2.in Expected output: test2.out.html
```
- ```
Input: test3.in Expected output: test3.out.html
```
Resources:
- An explanation of void*
- Sample unit and program tests can be found at the link: exampleTests/
- FAQ will be posted soon
Grading:
- Full unit and program tests can be found at the link: fullTests/
- Rubric:
```
You start off with 100 points.
You lose credit for missing functionality, incorrect results,
poorly documented or formatted code, or not following instructions.

Below is a partial list:


- up to 10 points off for not following instructions
- up to 10 points off for special handling to grade your homework (usually
  because you did not check that it runs on the computers on the timeshare first).
- 50 points - Unit tests of your Dictionary implementation (28 tests)
- 30 points - Program tests for your prog3 implementation (17 tests)
- 20 points - Code readability and commenting
- Sample in and out files are provided in Example input/output section for program tests

Make sure you:
a.  include the Makefile, .c files, and .h file
b.  generate an executable file called prog3 with your Makefile
c.  have confirmed that it compiles and runs properly on unix.ucsc.edu
d.  followed the general instructions described in overview.html
```
- Who graded your assignment based on your CRUZID first letter :
```
nick:   a*  - ju*
jay:    jw* - z*
```
Submission:

Make sure that you've compiled and tested your code on the campus unix timeshare before submitting. On the unix timeshare, put materials in a folder named prog3 and zip it up.
submit cse101-ap.f21 prog3 prog3.zip
You can submit as often as you want up until the deadline. We will only look at your most recent submission.
After each submission, confirm that you get a message like:
```
Submitting the files you specified...
copying XXX

1 File submitted: 
  XXX (revision 2)
```
where XXX is the material you submitted.
Read the for instructions on how to submit your work.

Last modified .

CSE 101 -- Prog 3

Due before midnight, Sunday, November 14, 2021. Late submissions will not be accepted/graded.

Changelog:

Objectives:

Description:

Details:

Grading:

Who graded your assignment based on your CRUZID first letter :

Submission:

Last modified .

Due before midnight, Sunday, November 14, 2021.
Late submissions will not be accepted/graded.