Syllabus for CMPS290G WQ03
Last updated: Feb 10, 2003
In the first two weeks, I plan on presenting background information,
on getting you guys up on the Internet Archive's facilities, and on
getting the projects identified.
Following that, we'll spend most of the course talking about what
I called "applications" in the first lecture. The first 2-3 weeks of
this will be Web Search. This is an important topic on which there
are many results to review; also, it is a large application within
which there are many interest sub-applications.
After Web Search, we will look at a handfull of other topics,
probably one per week. I'm currently planning on Communities,
Clustering, and Extraction. However, we can explore other topics
based on student interest.
As mentioned in the first class, "features" will be covered along
with the applications that use them. After we cover features and
applications, we will move on to discussing warehousing (stroage of
our large page collection) and crawling. There will be a little bit
of time at the end to wrap-up and present projects as well.
Summary of classes:
- Wed 3/5. Class in Baskin 318. Visitor: Marc
Najork (Microsoft Research). Topic: Web Crawlers.
- Mon 3/3. Class in Baskin 360. Visitor: Jan
Pederson (Alta Vista). Topic: Sampling the Web.
- Fri 2/21. Class in Baskin 360. Visitor:
Krishna Bharat (Google). Topic: Mirror detection.
- Mon 2/10. Information
extraction I: manual extraction with the Marais/Kistler Web Language.
- Wed 2/5. Link ranking II:
Hubs, Authorities, and more.
- Mon 2/3. Link ranking I:
PageRank and Markov Processes.
- Fri 1/31. Class in Baskin 360. Visitor:
Mark Manasse (Microsoft Research). Topic: Shingleprinting.
- Wed 1/29. Web-change studies.
- Mon 1/27. Introduction to Web Search. I didn't prepare a
lecture summary, but you may want to look at lecture
eleven of the Manning, Raghavan, and Schutze class on information
retrieval.
- Wed 1/22. Introduction to
"classic" information retrieval.
- MANDATORY READING for Friday 1/25: Salton, Wong, Yang,
A
vector space model for automatic indexing, CACM 18(11),
Nov. 1975. This is a (the?) classic in the IR world. Be
prepared to discuss on Friday. (Instructions for
ACM Portal access.)
- Optional reading: Kalt (and Croft?), A new probabalistic
model of text classification and retrieval., CIIR-TR-78 (1996).
- Some books on IR:
- I. Witten, A. Moffat, T. Bell. Managing Gigabytes (2nd ed).
Morgan Kaufmann, 1999. Very practical introduction to
"hands-on" IR.
- K. Sparck Jones and P. Willet, eds. Readings in Information
Retrieval. Morgan Kaufmann, 1997. Great collection of "classic"
papers.
- R. Baeza-Yates and B. Ribeiro-Neto, eds. Modern Information
Retrieval. Addison-Wesley/ACM Press, 1999.
- Fri 1/17. Lab, the following are useful:
- Mon 1/13. Continuation of
project discussion, plus p2 discussion.
- Click here to see a list of
project suggestions (you may make up your own if you wish).
- Fri 1/10.
Description of the Archive's infrastructure. (We started talking
about "p2" and projects, but postponed the discussion for lack of
time.)
- Wed 1/08. Discussion of
measurements on the Web. (Lots of references, slides have some, more
are coming.)
- Mon 1/06. Overview of the
basic technologies making up the Web (HTTP, URIs, and HTML). The
following readings are recommended (not required):
- Fri 1/03. Overview of Library Mining and Web Archeology
(summary of matieral to be presented throughout the course). Much of
this material can be found in Towards
web-scale web archaeology.