CMPS 290G: Topics in Software Engineering

Winter 2003: Software Engineering Aspects of Web Archeology

Last updated: January 27, 2003

Instructor: Raymie Stata  <raymie@cs.ucsc.edu>
Time: MWF 9:30-10:40
Location: Cowell 222
Office Hours: M 3-5 and by appointment
Syllabus
Project suggestions
Overview of Archive cluster
p2 toturial

 

New info (Jan 27)

New info (Jan 22)

 

Course summary

Here is a short blurb:
This Winter Topics in Software Engineering (CMPS 290G) will examine Software Engineering for Web Archeology. We will have guest lecturers from Alta Vista, Google, and elsewhere. Also, there will be project work based on data at the Internet Archive. We will be some of the first people to study this rich and extensive data set; new and interesting discoveries should be within everyone's reach. I believe students will find the semester fun, fascinating, and challenging.
Here's a longer description:

Web Archeology is the study of the content (versus the mechanism) of the Web. In the short term, such study has lead to Web search engines and related applications. Longer term, such study promises to yield deep insights into the nature of people and society.

The scale of the Web, its dynamic nature, and its uncoordinated organization conspire to make the Web a challenging artifact to study. Successful work in Web Archeology requires the cross-disciplinary cooperation of software engineering, algorithms, applied mathematics, and various domain expertise (e.g., linguistics end even political science).

In the Winter quarter, 290G will explore multiple aspects of Web Archeology, focusing on principles and techniques for addressing the challenges presented by the Web. Topics will include:

In each of these areas we will consider especially the software-engineering challenges of Web Archeology.

The seminar will feature guest lecturers from Alta Vista, Google, the Internet Archive, and other organizations involved in Web Archeology.

In addition to lectures and weekly readings, class participants will have a unique opportunity to perform project work using the Web collection of the Internet Archive. This collection has only recently been moved from tape to disk and has not been much studied. Students will be expected to design a small experiment or application, perform or build it, and write up the results. Because this collection has not been much studied, strong projects from this course should form the basis for publishable results.

Update: Jan 3

Students have been asking about requirements and expectations.

There are no specific prerequisites; however, familiarity with Unix programming and the Internet (IP, DNS, etc) is required.

There will be one to two mandatory papers to read each week (except next week). For each of those papers, I'll also recommend supplemental papers for those interested in digging deeper. By Jan 10th, the papers for every week will be posted on the course Web site.

Grades will be based on your project work, specifically, a 15-20 page project report due at the end of the quarter. Team projects (2-3 people) are encouraged, but not required. From teams, I'll expect a more ambitious projects and longer reports. There will be no quizzes or exams. There will be intermediate project milestones to ensure progress is being made, but no grades associated with these milestones.

The overall intensity of the course will be dictated by the project you select. I'll require that your project meets a minimal threshold consistent with the number of hours expected for the course; the upper end will be controlled by your ambition.

In today's class we reviewed the material in Towards web-scale web archaeology. Although a detailed syllabus for the course has not yet been prepared, this paper provides a reasonable summary of the content we'll be covering.