You are on page 1of 3

----------------------------------------------------------------------------------------------------------------

CS240B
Tuesday and Thursday 14:00--15:50
Fall 2018
Fall 2018

CS240B:ADVANCED DATA BASES and KNOWLEDGE BASES

Data Stream Management Systems:


Syllabus Supporting Data Stream Mining Applications
Notes
Instructor: Carlo Zaniolo
Office Hours: Tuesdays and Thursdays, 16:00--17:00.
Home

Talks First Assignement (Due on Tuesday Oct. 9) Some answers for 2.1, 2.2
Study the following papers:

B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom: Models and Issues in Data
Stream Systems. PODS 2002.
SQL-MR.
New SQL OLAP Functions for everyone, by K.M. Guion.

Complete the following tasks and turn them in via ccle:

Task 1.1: Write a four page (about) report on SQL-MR focusing on the things that you can
do with SQL-MR that cannot be done in SQL-TS,
Task 1.2: Some of the new OLAP functions support windows others do not. List those in the
two groups, and for those in the second group, suggest a general implementation for (i)
unlimited preceding windows, and (ii) physical windows.

-------------------------------------------------------------------------------------------------

Second Assignement (Due on Tuesday Oct. 16) Answers for 2.1, and 2.2.
Study the following papers:

Y. Bai, H. Thakkar, C. Luo, H. Wang and CZaniolo: A Data Stream Language and
System Designed for Power and Extensibility. (CIKM'06), Arlington, Virginia,
USA,
H. Thakkar, N. Laptev, H. Mousavi, B. Mozafari, V. Russo and C. Zaniolo: SMM: A
data stream management system for Knowledge Discovery. ICDE 2011.

Complete the following tasks and turn them in via ccle:

Task 2.1: Using a syntax based on that of notes and the two reference above, express a user-
defined aggregate d_count to perform the exact count of distinct values in a window on a data
stream. Your window aggregate could, e.g., be called as follows:
SELECT col_name1, d_count(col_name2) OVER (ROWS 99999 PRECEDING) FROM
my_stream
Task 2.2: Using the same syntax, write a UDA that implements the RANK function (without
window or with a unlimited-preceding window). Then extend this definition to support
windows.

------------------------------------------------------------------------------------------------

Third Assignement (Due on Tuesday, Oct. 23) Answers for 3.1, 3.2, and 3.3.

Study the following papers:

Yijian Bai and Carlo Zaniolo: Minimizing Latency and Memory in DSMS: A Unified
Approach to Quasi-Optimal Scheduling . The Second International Workshop on Scalable
Stream Processing Systems, March 29, 2008, Nantes, France.
Y. Bai, H. Thakkar, H. Wang, and C. Zaniolo. Time-stamp Management and
QueryExecution in Data Stream Management Systems. IEEE Internet Computing, 12(6):
2008.
Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining Data Streams - The Stanford
University InfoLab. You only need to study sections 4.3 and 4.4. But the rest is also
interesting. http://infolab.stanford.edu/~ullman/mmds/ch4.pdf

Complete the following Tasks:

Task 3.1: We showed that binary operators, such as union and joins, are subject to idle-
wating problem. But this problem can also occur in some unary operators. Say for instance
that an aggregate is called on a time-stamped tumble window (or on a timestamped window
with slides). Discuss this situation, its idle-waiting problem, and how it can be solved (using
concepts and notation from the paper by Bai and Zaniolo above ).

Task 3.2: Propose a simple generalization of the chain algorithm to minimize memory for
the following situations:
(A) Two or more simple paths without forks,

Task 3.3: Write the window UDA to compute the standard deviation on a data stream.
Explain the math formulas you use in the expire of the window UDA---you might want to
consult textbooks or the web to find the best formula.

-------------------------------------------------------------------------------------------------

Fourth Assignement (Due on Tuesday Oct. 30) Answers for 4.1, and 4.2.

Study the following papers:


Study: Chapter 3: A Survey of Classification Methods in Data Streams) from DATA
STREAMS: MODELS AND ALGORITHMS DATA STREAMS. CHARU C.
AGGARWAL editor, Kluwer Academic Publishers.
Chapter 3 is at pages 39-60 of the following pdf document.

N. Laptev et al. : Extendin Relational query Languages for Data Streams. pp. 361--386 of
Data StreamManagement--Processing High-Speed Data Streams. Springer, 2017.

Complete the following Tasks:


Task 4.1: Express the Flajolet-Martin's distinct_count sketch as a user-defined aggregate
mamed dcount_sketch, to be called in the same way as d_count. You can assume that you
have available a function LmostbitH(X) that return K, where the K position contains a 1,
whereas all the position to its right are zeros, for the value returned by a randomizing hash
function H(X).
Task 4.2: Assume that you have a stream of temperature readings temperature(Celsius
Integer) that start everyday at time 00:01 and end at time 23:59. At the end of each day, we
want to have 10,000 temperature samples stored into a table tenKsamples(Rowno integer,
Celsius Integer). We do not know how many temperature readings are going to arrive every day,
except that their number is significantly larger than 10,000. Please write a UDA that uses
the reservoiralgorithm to populate tenKsamples(Rowno , Celsius) with 10,000 random samples
taken fromtemperature(Celsius Integer), which is then processed and reset to empty at
midnight. You can assume that the system support a function random(K), which given a positive
integer K returns a random integer between 1 and K.

-----------------------------------------------------------------------------------------------------

Project Assigned on October 30--Due on November 20 (strict deadline).

-----------------------------------------------------------------------------------------------------

Midterm : Tuesday November 6 (Practice Midterm)

-----------------------------------------------------------------------------------------------------

Presentation: November 8--December 6. Students should pair-up.

-----------------------------------------------------------------------------------------------------

Research Report (as your takhom final) Due on or before Dec. 14:
Done individually or by pairs.

-----------------------------------------------------------------------------------------------------

Holidays