Sie sind auf Seite 1von 2

10/04/2018 Berthier Ribeiro-Neto

Information Retrieval---Search Engines in the Web


This is a graduate level course on Information Retrieval with emphasis on search engines in the
Web. The goal of the course is to guide the students in a journey---a journey aimed at the design,
implementation, and testing of a Web search engine. This search engine should be able to receive
and answer queries posed in the domain of the Brazilian Web.

While the design and conception of the various modules that compose the engine---crawler,
indexer, and query processor---will be discussed in class, the students will be responsible for
implementing them in C++ and validating their implementations during the course. Additionally, the
students will explore techniques of machine learning to improve the ranking of the results.
Evaluation will consist of three implementation assignments (crawler, indexer, and query
processor) and two exams.

This is a rigorous and intensive course with a non-trivial load of theoretical material and practical
tasks in the form of coding in C++. At the end of the course, the students should gain a technical
understanding of how a search engine works and the challenges at the frontier of the technology---
they should understand why search is hard.

Required Textbook

Ricardo Baeza-Yates & Berthier Ribeiro-Neto. Modern Information Retrieval. Pearson, 2011, 2nd
Edition.
Slides can be found here.

Other Useful References

1. Ian Witten, Alistair Moffat, Timothy Bell. Managing Gigabytes. Morgan Kauffman, 1999, 2nd
Edition.
2. Christopher Manning, Prabakhar Raghavan, Hinrich Schutze. Introduction to Information
Retrieval. Cambridge University Press, 2008.
3. Bruce Croft, Donald Metzer, Trevor Strohman. Search Engines: Information Retrieval in
Practice. Pearson, 2009.
4. Charles Clarke, Gordon Cormack, Stefan Buttcher. Information Retrieval: Implementing and
Evaluating Search Engines. MIT Press, 2016.
5. Gerard Salton. Introduction to Modern Information Retrieval. McGraw Hill, 1983.

Calendar: Classes, Assignments, Tests


Mar 5 Introduction: the e-publishing era, basic architecture of a search engine (Chapter 1)
Crawling the Web: history, taxonomy of crawlers, architecture and implementation,
Mar 7
scheduling of URLs (Chapter 12, slides 1-44)
Crawling the Web: page events, freshness, exclusion protocol, efficiency, evaluation
(Chapter 12, slides 45-91)
Mar 12
Assignment 1: Implementation of a Crawler (20 points)
Due Date: Apr 8th, 23:59pm, BR time
Penalty for late delivery: 20% by Apr 15 and 40% by Apr 22
Mar 14 Modeling: basic concepts, Boolean Model, TF-IDF weights (Chapter 3, slides 1-48)
Modeling: document normalization, Vector Model, Link Analysis, Probabilistic Model
Mar 19
(Chapter 3, slides 49-86 and Chapter 11, slides 70-85)
Modeling: Set-based Model, Generalized Vector Model, BM25 (Chapter 3, slides 89-107,
Mar 21
133-144, 164-176)
Mar 26 Modeling: Language models, Belief Network models (Chapter 3, slides 177-205, 219-247)
Mar 28 Indexing: basics, inverted indexes, compressed inverted indexes (slides 1-48)
Apr 2 RA assistantship for Assignment 1.
Apr 4 RA assistantship for Assignment 1.
http://homepages.dcc.ufmg.br/~berthier/ir-2018.htm 1/2
10/04/2018 Berthier Ribeiro-Neto

Apr 9 Indexing: implementation of compressed inverted indexes (slides: indexing, vocabulary,


compression)

Assignment 2: Implementation of a Compressed Index (20 points).


Due Date: May 6th, 23:59pm, BR time.
Penalty for late delivery: 20% by May 13 and 40% by May 20.
Apr 11 Test I (20 points)
Retrieval Evaluation: basics, precision-recall, single value summaries, user evaluation
Apr 16
(Chapter 4, slides 1-41)
Retrieval Evaluation: DCG, Bpref, rank correlation metrics, Spearman, Kendall-Tau
Apr 18
(Chapter 4, slides 42-93)
Retrieval Evaluation: reference collections, TREC, OSHUMED, INEX, user based
Apr 23
evaluation, side-by-side, A/B testing, clickthrough data (Chapter 4, slides 94-144)
Apr 25 Relevance Feedback (RF): basics, explicit RF, Rocchio, clicks (Chapter 5, slides 1-53)
Relevance Feedback (RF): implicit RF, local analysis, global analysis (Chapter 5, slides
Apr 30
54-104)
Text Classification: basics, machine learning, unsupervised learning, supervised learning,
May 2
decision trees (Chapter 8, slides 1-51)
May 7 RA assistantship for assignment 2.
May 9 RA assistantship for assignment 2.
Text Classification: KNN, Rocchio, Naive Bayes, SVM, Ensemble, Stacking (Chapter 8,
slides 52-111)
May 14
Assignment 3: Implementation of a Query Processor with Page Rank (20 points).
Due Date: Jun 17th, 23:59pm, BR time.
Hard deadline: assignments not accepted after due date.
Text Classification: feature selection, tf-idf, mutual information, information gain, chi-
May 16 square, evaluation metrics, precision-recall, F-measures, cross-validation, standard
collections (Chapter 8, slides 112-157)
Web Retrieval: characterizing the Web, the Web graph, modeling the Web, link analysis,
May 21
search engine architectures, caching (Chapter 11, slides 1-50)
Web Retrieval: multiple indices, multi-site architecture, search engine ranking signals,
May 23
learning to rank, clickthrough data, Web spam (Chapter 11, slides 51-69, 86-98)
Web Retrieval: managing Web data, document identifiers, metadata, compressing the
Web graph, duplicated data, user interaction, user interfaces, the retangle search box,
May 28
title-snippet-url entity, universal search, spelling assistance, query recommendations,
query refinements, actionable results (Chapter 11, slides 99-163)
May 30 Test II (20 points)
Jun 4 RA assistantship for assignment 3.
Jun 6 RA assistantship for assignment 3.
Jun 11 RA assistantship for assignment 3.
Jun 13 RA assistantship for assignment 3.
Jun 18 No class.
Jun 20 Discussion, evaluation of course, final results.

http://homepages.dcc.ufmg.br/~berthier/ir-2018.htm 2/2

Das könnte Ihnen auch gefallen