Web Archeology (290G W03): Possible Projects

Last updated: January 13, 2003

1.0 Introduction

This document lists some project ideas. Ideas outside of this list are most welcome. Also, mixing-and-matching from this list is good to (especially for groups).

2.0 Applications

Counts and histograms

The data at the Archive has only recently been transferred to disk. Thus, there are a lot of basic measurements that need to be made. The document Early measurements from the Internet Archive provides the few measurements that have been done so far. There are many of other interesting variables to be measured. Also, the measurements done so far use averages in inappropriate places; other summaries would be better.

This may not seem glamorous, but there's actually lots of interesting work to be done here (e.g., how many unique pieces of content are there? How many "near duplicates"?). Also, there exist many neat estimation algorithms that could be employed in this project.

Crawl catalog

There currently is no catalog of the "crawls" that make up the Archive's collection. For each (major) crawl, such a catalog would indicate the dates during which data was collected, the ARC files containing the data, and a brief description of the crawl's policies.

A major part of this activity is to track down and ask questions of the people who know about these crawls. (Although this is not a programming task per se, in industry this type of investigation is not unusual: tracking down the original authors of software to find out about the systems they have built.)

(This project should probably be combined with some measurements as suggested above.)

Page-change study

Although the Web is known to be changing, the nature of this change has been little studied, and no one has studied this change over very long periods. The Archive's data provides a unique opportunity to study this topic.

Some related papers:

Table detection

The following paper appeared in the most recent WWW conference:
A Machine Learning Based Approach for Table Detection on The Web
I have recently recieved a copy of the data set from this paper. Further, at Compaq we developed an (unpublished) set of features that should be very well suited to solving this problem. A nice project would be to combine the Compaq feature set with a suitable learning algorithm and to compare the results from the above paper.

Word-burst study

Jon Kleinberg has developed a technique for identifying interesting "bursts" in the appearence of terms (see his web page for this work). It might be interesting to apply this technique to documents in the Archive's collection.

Validating host counts

In class we presented work on estimations of host counts. The Archive data is similar to the Netcraft data, but more comprehensive. It would be interesting to compare host-count inferences from the Archive's data to both the Netcraft and the OCLC results (the Archive's data includes the IP address from which pages were downloaded, which puts us in a good position to understand the impact of virtual hosting).

Mirror detection

A "mirror" of a site is a (near-)duplicate of that site under a different host name (and possibly path prefix). There are many mirrors on the Web, which is problematical for crawling, searching, and other applications.

A number of techniques have been identified for mirror detection (see Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content for some background). This project would be to apply one to the Archive's data. (It would not only be interesting to find duplicates but also to understand their dynamics, e.g., how long do they last?, how "synchronized" do they remain?, etc.)

3.0 Application infrastructure

Successful projects in this category would be useful to many future researchers.

Language identification

It is important to capture the natural language in which pages are written. There are well-understood techniques for doing this, but the Archive does not currently have such software in house. The goal of this project would be to build a language-identification library that could be used by other projects.

Identify soft 404/30x's

A "soft 404" is the reporting of a "not found" page via a 200 response code (e.g., http://www.yahoo.com/blah). Soft 404's are a problem when studying they Web. One typically wants to ignore these pages when taking measurements (because their properties are atypical of regular pages compounded by the fact that there can be lots of them because they are generated by misspellings).

Thus, it would be nice, then to be able to identify soft-404s automatically, perhaps by using some machine-learning techniques. (Success here would be difficult but significant. I won't penalize a valiant but ultimately unsuccessful attempt!)

URL database

A URL database is an extremely important element of a larger set of "feature datamarts." A URL database for the Archive would be immediately useful. At the same time, a URL database for the Archive is particularly challenging because the collection is so large and because it continues to grow. This project would be entail a large amount of "software engineering" and relatively little "mathematics."

For background, see:

Link exctraction from Javascript

Modern browsers support the inclusion of Javascript code that dynamically modifies, on the client-side, the content of Web pages -- including the links on a page. The use of Javascript in this manner is is increasingly prevelant, which is problematic for crawlers which must be able to extract the links on pages. It is especially problematic for "narrow" crawls for which missed links are a significant problem.

Most crawlers ignore Javascript; the rest use very simple heuristics to extract links from it. None have attempted to execute the Javascript in an attempt to perform more accurate link extraction.

4.0 Warehouse infrastructure

These projects deal with the Warehouse-level infrastructure at the Archive. Like the URL database project, these would be entail a large amount of "software engineering" and relatively little "mathematics."

Implementation of a higher-level query processor

The paper Towards web-scale web archaeology proposes a query system for feature extraction meant to replace the "p2" infrastructure now in place at the Archive. A partial prototype has been sketched out but never completed.

The point of this project would be to finish the prototype of the query system and to measure how well it performs on the Archive's infrastructure.

Storage subsystem

The Archive is moving towards a system of file-level RAID-like redundancy. This project would entail prototyping a better system for distributing files and redundancy data within the Archive's storage cluster.

5.0 Web logs/RSS

Web logs (distributed by "Really Simple Syndication", or RSS) are a fascinating medium of communication. Like the Web, Web logs are published widely, from big institutions to individuals. Indeed, Web logs are becomming easier to publish than Web sites, and thus they are potentially even more inclusive than the Web has been.

Any project in this area is potentially quite interesting. Example includes: