CMPS 180 - Winter 04

Extra Credit Assignment

Roogle: Relational Google

Description: The goal of this assignment is to design a relational database that can implement part of Google's functionality. Overall, the database consists of a collection of documents, where each document is considered to be a sequence of words (thus, words are ordered within a single document). A keyword query has the following format:

keyword_1 keyword_2 ... keyword_n

where each keyword_i is a word that is of interest to the user. The result of the query is the set of document ids which contain all the words. A more powerful version of the keyword query is the following:

keyword_1 p_1 keyword_2 p_2 ... keyword_(n-1) p_(n-1) keyword_n

where each p_i is a positive integer and denotes the distance, in the document, between keyword_(i-1) and keyword_i (in this case, we assume that keyword_i occurs after keyword_(i-1)). For instance, if p_i=1, then the keywords are adjacent, if p_i=2, then there exists a third word between them and so on and so forth. The result of such a query is the set of document ids that contain all the keywords, at the specified distances. (Note: one possible extension is to consider p_i to denote a symmetrical distance, i.e., keyword_(i-1) might occur before or after keyword_i).

Assignment: Describe the implementation of Roogle over a relational DBMS. More specifically, you need to provide:

Hint: Roogle only needs to return document ids, it does not need to reconstruct the documents (although that will be a big plus!).