Sie sind auf Seite 1von 67

UNIT- 5

Database System Concepts, 6th Ed.


Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use

Centralized Systems
Run on a single computer system and do not interact with other

computer systems.
General-purpose computer system: one to a few CPUs and a number

of device controllers that are connected through a common bus that provides access to shared memory.
Single-user system (e.g., personal computer or workstation): desk-top

unit, single user, usually has only one CPU and one or two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and a

multi-user OS. Serve a large number of users who are connected to the system vie terminals. Often called server systems.

Database System Concepts - 6th Edition

22.2

Silberschatz, Korth and Sudarshan

A Centralized Computer System

Database System Concepts - 6th Edition

22.3

Silberschatz, Korth and Sudarshan

Client-Server Systems
Server systems satisfy requests generated at m client systems, whose

general structure is shown below:

Database System Concepts - 6th Edition

22.4

Silberschatz, Korth and Sudarshan

Client-Server Systems (Cont.)


Database functionality can be divided into:

Back-end: manages access structures, query evaluation and optimization, concurrency control and recovery.
Front-end: consists of tools such as forms, report-writers, and graphical user interface facilities.

The interface between the front-end and the back-end is through SQL or

through an application program interface.

Database System Concepts - 6th Edition

22.5

Silberschatz, Korth and Sudarshan

Client-Server Systems (Cont.)


Advantages of replacing mainframes with networks of workstations or

personal computers connected to back-end server machines:


better functionality for the cost flexibility in locating resources and expanding facilities better user interfaces easier maintenance

Database System Concepts - 6th Edition

22.6

Silberschatz, Korth and Sudarshan

Server System Architecture


Server systems can be broadly categorized into two kinds:

transaction servers which are widely used in relational database systems, and data servers, used in object-oriented database systems

Database System Concepts - 6th Edition

22.7

Silberschatz, Korth and Sudarshan

Transaction Servers
Also called query server systems or SQL server systems

Clients send requests to the server Transactions are executed at the server Results are shipped back to the client.

Requests are specified in SQL, and communicated to the server

through a remote procedure call (RPC) mechanism.


Transactional RPC allows many RPC calls to form a transaction. Open Database Connectivity (ODBC) is a C language application

program interface standard from Microsoft for connecting to a server, sending SQL requests, and receiving results.
JDBC standard is similar to ODBC, for Java

Database System Concepts - 6th Edition

22.8

Silberschatz, Korth and Sudarshan

Transaction Server Process Structure


A typical transaction server consists of multiple processes accessing

data in shared memory.


Server processes

These receive user queries (transactions), execute them and send results back

Processes may be multithreaded, allowing a single process to execute several user queries concurrently
Typically multiple multithreaded server processes

Lock manager process

More on this later


Output modified buffer blocks to disks continually

Database writer process

Database System Concepts - 6th Edition

22.9

Silberschatz, Korth and Sudarshan

Transaction Server Processes (Cont.)


Log writer process

Server processes simply add log records to log record buffer


Log writer process outputs log records to stable storage. Performs periodic checkpoints Monitors other processes, and takes recovery actions if any of the other processes fail

Checkpoint process

Process monitor process

E.g., aborting any transactions being executed by a server process and restarting it

Database System Concepts - 6th Edition

22.10

Silberschatz, Korth and Sudarshan

Transaction System Processes (Cont.)

Database System Concepts - 6th Edition

22.11

Silberschatz, Korth and Sudarshan

Transaction System Processes (Cont.)


Shared memory contains shared data

Buffer pool Lock table Log buffer

Cached query plans (reused if same query submitted again) All database processes can access shared memory To ensure that no two processes are accessing the same data structure at the same time, databases systems implement mutual exclusion using either Operating system semaphores Atomic instructions such as test-and-set
To avoid overhead of interprocess communication for lock

request/grant, each database process operates directly on the lock table instead of sending requests to lock manager process Lock manager process still used for deadlock detection
Database System Concepts - 6th Edition 22.12 Silberschatz, Korth and Sudarshan

Data Servers
Used in high-speed LANs, in cases where

The clients are comparable in processing power to the server The tasks to be executed are compute intensive.

Data are shipped to clients where processing is performed, and then

shipped results back to the server.


This architecture requires full back-end functionality at the clients. Used in many object-oriented database systems Issues:

Page-Shipping versus Item-Shipping Locking Data Caching Lock Caching

Database System Concepts - 6th Edition

22.13

Silberschatz, Korth and Sudarshan

Data Servers (Cont.)


Page-shipping versus item-shipping

Smaller unit of shipping more messages Worth prefetching related items along with requested item Page shipping can be thought of as a form of prefetching Locking

Overhead of requesting and getting locks from server is high due to message delays Can grant locks on requested and prefetched items; with page shipping, transaction is granted lock on whole page. Locks on a prefetched item can be P{called back} by the server, and returned by client transaction if the prefetched item has not been used. Locks on the page can be de escalated to locks on items in the page when there are lock conflicts. Locks on unused items can then be returned to server.

Database System Concepts - 6th Edition

22.14

Silberschatz, Korth and Sudarshan

Data Servers (Cont.)


Data Caching

Data can be cached at client even in between transactions


But check that data is up-to-date before it is used (cache coherency) Check can be done when requesting lock on data item Locks can be retained by client system even in between transactions Transactions can acquire cached locks locally, without contacting server Server calls back locks from clients when it receives conflicting lock request. Client returns lock once no local transaction is using it. Similar to deescalation, but across transactions.

Lock Caching

Database System Concepts - 6th Edition

22.15

Silberschatz, Korth and Sudarshan

Parallel Systems
Parallel database systems consist of multiple processors and multiple

disks connected by a fast interconnection network.


A coarse-grain parallel machine consists of a small number of

powerful processors
A massively parallel or fine grain parallel machine utilizes

thousands of smaller processors.


Two main performance measures:

throughput --- the number of tasks that can be completed in a given time interval response time --- the amount of time it takes to complete a single task from the time it is submitted

Database System Concepts - 6th Edition

22.16

Silberschatz, Korth and Sudarshan

Speed-Up and Scale-Up


Speedup: a fixed-sized problem executing on a small system is given

to a system which is N-times larger.

Measured by: speedup = small system elapsed time large system elapsed time

Speedup is linear if equation equals N. N-times larger system used to perform N-times larger job Measured by: scaleup = small system small problem elapsed time big system big problem elapsed time

Scaleup: increase the size of both the problem and the system

Scale up is linear if equation equals 1.

Database System Concepts - 6th Edition

22.17

Silberschatz, Korth and Sudarshan

Speedup

Database System Concepts - 6th Edition

22.18

Silberschatz, Korth and Sudarshan

Scaleup

Database System Concepts - 6th Edition

22.19

Silberschatz, Korth and Sudarshan

Batch and Transaction Scaleup


Batch scaleup:

A single large job; typical of most decision support queries and scientific simulation. Use an N-times larger computer on N-times larger problem. Numerous small queries submitted by independent users to a shared database; typical transaction processing and timesharing systems. N-times as many users submitting requests (hence, N-times as many requests) to an N-times larger database, on an N-times larger computer. Well-suited to parallel execution.

Transaction scaleup:

Database System Concepts - 6th Edition

22.20

Silberschatz, Korth and Sudarshan

Factors Limiting Speedup and Scaleup


Speedup and scaleup are often sublinear due to:
Startup costs: Cost of starting up multiple processes may dominate

computation time, if the degree of parallelism is high.


Interference: Processes accessing shared resources (e.g., system

bus, disks, or locks) compete with each other, thus spending time waiting on other processes, rather than performing useful work.
Skew: Increasing the degree of parallelism increases the variance in

service times of parallely executing tasks. Overall execution time determined by slowest of parallely executing tasks.

Database System Concepts - 6th Edition

22.21

Silberschatz, Korth and Sudarshan

Parallel Database Architectures


Shared memory -- processors share a common memory Shared disk -- processors share a common disk Shared nothing -- processors share neither a common memory nor

common disk
Hierarchical -- hybrid of the above architectures

Database System Concepts - 6th Edition

22.22

Silberschatz, Korth and Sudarshan

Parallel Database Architectures

Database System Concepts - 6th Edition

22.23

Silberschatz, Korth and Sudarshan

Shared Memory
Processors and disks have access to a common memory, typically via

a bus or through an interconnection network.


Extremely efficient communication between processors data in

shared memory can be accessed by any processor without having to move it using software.
Downside architecture is not scalable beyond 32 or 64 processors

since the bus or the interconnection network becomes a bottleneck


Widely used for lower degrees of parallelism (4 to 8).

Database System Concepts - 6th Edition

22.24

Silberschatz, Korth and Sudarshan

Shared Disk
All processors can directly access all disks via an interconnection

network, but the processors have private memories.


The memory bus is not a bottleneck Architecture provides a degree of fault-tolerance if a processor fails, the other processors can take over its tasks since the database is resident on disks that are accessible from all processors.

Examples: IBM Sysplex and DEC clusters (now part of Compaq)

running Rdb (now Oracle Rdb) were early commercial users


Downside: bottleneck now occurs at interconnection to the disk

subsystem.
Shared-disk systems can scale to a somewhat larger number of

processors, but communication between processors is slower.

Database System Concepts - 6th Edition

22.25

Silberschatz, Korth and Sudarshan

Shared Nothing
Node consists of a processor, memory, and one or more disks.

Processors at one node communicate with another processor at another node using an interconnection network. A node functions as the server for the data on the disk or disks the node owns.
Examples: Teradata, Tandem, Oracle-n CUBE Data accessed from local disks (and local memory accesses) do not

pass through interconnection network, thereby minimizing the interference of resource sharing.
Shared-nothing multiprocessors can be scaled up to thousands of

processors without interference.


Main drawback: cost of communication and non-local disk access;

sending data involves software interaction at both ends.

Database System Concepts - 6th Edition

22.26

Silberschatz, Korth and Sudarshan

Hierarchical
Combines characteristics of shared-memory, shared-disk, and shared-

nothing architectures.
Top level is a shared-nothing architecture nodes connected by an

interconnection network, and do not share disks or memory with each other.
Each node of the system could be a shared-memory system with a

few processors.
Alternatively, each node could be a shared-disk system, and each of

the systems sharing a set of disks could be a shared-memory system.


Reduce the complexity of programming such systems by distributed

virtual-memory architectures

Also called non-uniform memory architecture (NUMA)

Database System Concepts - 6th Edition

22.27

Silberschatz, Korth and Sudarshan

Hybrid architecture
hybrid architecture includes:
Non-Uniform Memory Architecture (NUMA), which involves the Non-

Uniform Memory Access.


Cluster (shared nothing + shared disk: SAN/NAS), which is formed by

a group of connected computers.


Non-Uniform Memory Access (NUMA) is a computer memory

design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than nonlocal memory (memory local to another processor or memory shared between processors).
NUMA architectures logically follow in scaling from symmetric

multiprocessing (SMP) architectures.

Database System Concepts - 6th Edition

22.28

Silberschatz, Korth and Sudarshan

XML
Using an example explain the distinction between attribute and a sub

element. Explain the purpose and use of namespaces


Give the DTD for an XML representation of the following nested-

relational schema.

Emp = (ename, ChildrenSet setof(Children), SkillsSet setof(Skills)) Children = (name, Birthday) Birthday = (day, month, year) Skills = (type, ExamsSet setof(Exams)) Exams = (year, city)
Explain the limitations of DTD. Describe the alternative to overcome

this limitation.

Database System Concepts - 6th Edition

22.29

Silberschatz, Korth and Sudarshan

Introduction
XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Derived from SGML (Standard Generalized Markup Language), but

simpler to use than SGML


Documents have tags giving extra information about sections of the

document

E.g. <title> XML </title> <slide> Introduction </slide> Users can add new tags, and separately specify how the tag should be handled for display

Extensible, unlike HTML

Database System Concepts - 6th Edition

22.30

Silberschatz, Korth and Sudarshan

Comparison with Relational Data


Inefficient: tags, which in effect represent schema information, are

repeated
Better than relational tuples as a data-exchange format

Unlike relational tuples, XML data is self-documenting due to presence of tags

Non-rigid format: tags can be added


Allows nested structures Wide acceptance, not only in database systems, but also in browsers, tools, and applications

Database System Concepts - 6th Edition

22.31

Silberschatz, Korth and Sudarshan

Structure of XML Data


Tag: label for a section of data Element: section of data beginning with <tagname> and ending with

matching </tagname>
Elements must be properly nested

Proper nesting

<course> <title> . </title> </course> <course> <title> . </course> </title>

Improper nesting

Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.

Every document must have a single top-level element

Database System Concepts - 6th Edition

22.32

Silberschatz, Korth and Sudarshan

Structure of XML Data (Cont.)


Mixture of text with sub-elements is legal in XML.

Example: <course> This course is being offered for the first time in 2009. <course id> BIO-399 </course id> <title> Computational Biology </title> <dept name> Biology </dept name> <credits> 3 </credits> </course> Useful for document markup, but discouraged for data representation

Database System Concepts - 6th Edition

22.33

Silberschatz, Korth and Sudarshan

Attributes
Elements can have attributes

<course course_id= CS-101> <title> Intro. to Computer Science</title> <dept name> Comp. Sci. </dept name> <credits> 4 </credits> </course>
Attributes are specified by name=value pairs inside the starting tag of an

element
An element may have several attributes, but each attribute name can

only occur once <course course_id = CS-101 credits=4>

Database System Concepts - 6th Edition

22.34

Silberschatz, Korth and Sudarshan

Attributes vs. Subelements


Distinction between subelement and attribute

In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents In the context of data representation, the difference is unclear and may be confusing

Same information can be represented in two ways


<course course_id= CS-101> </course> <course> <course_id>CS-101</course_id> </course>

Suggestion: use attributes for identifiers of elements, and use subelements for contents

Database System Concepts - 6th Edition

22.35

Silberschatz, Korth and Sudarshan

Namespaces
XML data has to be exchanged between organizations Same tag name may have different meaning in different organizations,

causing confusion on exchanged documents


Specifying a unique string as an element name avoids confusion
Better solution: use unique-name:element-name Avoid using long unique names all over document by using XML

Namespaces

<university xmlns:yale=http://www.yale.edu> <yale:course> <yale:course_id> CS-101 </yale:course_id> <yale:title> Intro. to Computer Science</yale:title> <yale:dept_name> Comp. Sci. </yale:dept_name> <yale:credits> 4 </yale:credits> </yale:course> </university>

Database System Concepts - 6th Edition

22.36

Silberschatz, Korth and Sudarshan

XML Document Schema


Database schemas constrain what information can be stored, and the

data types of stored values


XML documents are not required to have an associated schema However, schemas are very important for XML data exchange

Otherwise, a site cannot automatically interpret data received from another site Document Type Definition (DTD)

Two mechanisms for specifying XML schema

Widely used Newer, increasing use

XML Schema

Database System Concepts - 6th Edition

22.37

Silberschatz, Korth and Sudarshan

Document Type Definition (DTD)


The type of an XML document can be specified using a DTD DTD constraints structure of XML data

What elements can occur What attributes can/must an element have What subelements can/must occur inside each element, and how many times. All values represented as strings in XML <!ELEMENT element (subelements-specification) > <!ATTLIST element (attributes) >

DTD does not constrain data types

DTD syntax

Database System Concepts - 6th Edition

22.38

Silberschatz, Korth and Sudarshan

Element Specification in DTD

Subelements can be specified as


names of elements, or #PCDATA (parsed character data), i.e., character strings EMPTY (no subelements) or ANY (anything can be a subelement)

Example <! ELEMENT department (dept_name building, budget)> <! ELEMENT dept_name (#PCDATA)> <! ELEMENT budget (#PCDATA)> Subelement specification may have regular expressions <!ELEMENT university ( ( department | course | instructor | teaches )+)>

Notation:
| - alternatives + - 1 or more occurrences * - 0 or more occurrences

Database System Concepts - 6th Edition

22.39

Silberschatz, Korth and Sudarshan

University DTD
<!DOCTYPE university [ <!ELEMENT university ( (department|course|instructor|teaches)+)> <!ELEMENT department ( dept name, building, budget)> <!ELEMENT course ( course id, title, dept name, credits)> <!ELEMENT instructor (IID, name, dept name, salary)> <!ELEMENT teaches (IID, course id)> <!ELEMENT dept name( #PCDATA )> <!ELEMENT building( #PCDATA )> <!ELEMENT budget( #PCDATA )> <!ELEMENT course id ( #PCDATA )> <!ELEMENT title ( #PCDATA )> <!ELEMENT credits( #PCDATA )> <!ELEMENT IID( #PCDATA )> <!ELEMENT name( #PCDATA )> <!ELEMENT salary( #PCDATA )> ]>

Database System Concepts - 6th Edition

22.40

Silberschatz, Korth and Sudarshan

Attribute Specification in DTD


Attribute specification : for each attribute

Name Type of attribute

CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs) more on this later Whether mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED) Examples <!ATTLIST course course_id CDATA #REQUIRED>, or <!ATTLIST course course_id ID #REQUIRED dept_name IDREF #REQUIRED instructors IDREFS #IMPLIED >

Database System Concepts - 6th Edition 22.41 Silberschatz, Korth and Sudarshan

IDs and IDREFs


An element can have at most one attribute of type ID The ID attribute value of each element in an XML document must be

distinct

Thus the ID attribute value is an object identifier

An attribute of type IDREF must contain the ID value of an element in

the same document


An attribute of type IDREFS contains a set of (0 or more) ID values.

Each ID value must contain the ID value of an element in the same document

Database System Concepts - 6th Edition

22.42

Silberschatz, Korth and Sudarshan

University DTD with Attributes


University DTD with ID and IDREF attribute types.

<!DOCTYPE university-3 [ <!ELEMENT university ( (department|course|instructor)+)> <!ELEMENT department ( building, budget )> <!ATTLIST department dept_name ID #REQUIRED > <!ELEMENT course (title, credits )> <!ATTLIST course course_id ID #REQUIRED dept_name IDREF #REQUIRED instructors IDREFS #IMPLIED > <!ELEMENT instructor ( name, salary )> <!ATTLIST instructor IID ID #REQUIRED dept_name IDREF #REQUIRED > declarations for title, credits, building, budget, name and salary ]>

Database System Concepts - 6th Edition

22.43

Silberschatz, Korth and Sudarshan

Limitations of DTDs
No typing of text elements and attributes

All values are strings, no integers, reals, etc. Order is usually irrelevant in databases (unlike in the documentlayout environment from which XML evolved) (A | B)* allows specification of an unordered set, but

Difficult to specify unordered sets of subelements


Cannot ensure that each of A and B occurs only once

IDs and IDREFs are untyped

The instructors attribute of an course may contain a reference to another course, which is meaningless

instructors attribute should ideally be constrained to refer to instructor elements

Database System Concepts - 6th Edition

22.44

Silberschatz, Korth and Sudarshan

XML Schema
XML Schema is a more sophisticated schema language which

addresses the drawbacks of DTDs. Supports

Typing of values

E.g. integer, string, etc Also, constraints on min/max values

User-defined, comlex types Many more features, including

uniqueness and foreign key constraints, inheritance

XML Schema is itself specified in XML syntax, unlike DTDs

More-standard representation, but verbose

XML Scheme is integrated with namespaces BUT: XML Schema is significantly more complicated than DTDs.

Database System Concepts - 6th Edition

22.45

Silberschatz, Korth and Sudarshan

Decision Support Systems


Decision-support systems are used to make business decisions,

often based on data collected by on-line transaction-processing systems.


Examples of business decisions:

What items to stock? What insurance premium to change? To whom to send advertisements? Retail sales transaction details Customer profiles (income, age, gender, etc.)

Examples of data used for making decisions


Database System Concepts - 6th Edition

22.46

Silberschatz, Korth and Sudarshan

Decision-Support Systems: Overview


Data analysis tasks are simplified by specialized tools and SQL

extensions Example tasks For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year As above, for each product category and each customer category

Statistical analysis packages (e.g., : S++) can be interfaced with

databases Statistical analysis is a large field, but not covered here Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.
A data warehouse archives information gathered from multiple

sources, and stores it under a unified schema, at a single site.

Important for large businesses that generate data from multiple divisions, possibly at multiple sites Data may also be purchased externally
Database System Concepts - 6th Edition 22.47 Silberschatz, Korth and Sudarshan

Data Warehousing
Data sources often store only current data, not historical data Corporate decision making requires a unified view of all organizational

data, including historical data


A data warehouse is a repository (archive) of information gathered

from multiple sources, stored under a unified schema, at a single site

Greatly simplifies querying, permits study of historical trends


Shifts decision support query load away from transaction processing systems

Database System Concepts - 6th Edition

22.48

Silberschatz, Korth and Sudarshan

Data Warehousing

Database System Concepts - 6th Edition

22.49

Silberschatz, Korth and Sudarshan

Design Issues
When and how to gather data

Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g., at night) Destination driven architecture: warehouse periodically requests new information from data sources Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Data/updates are periodically downloaded form online transaction processing (OLTP) systems.

What schema to use

Schema integration

Database System Concepts - 6th Edition

22.50

Silberschatz, Korth and Sudarshan

More Warehouse Design Issues


Data cleansing

E.g., correct mistakes in addresses (misspellings, zip code errors) Merge address lists from different sources and purge duplicates Warehouse schema may be a (materialized) view of schema from data sources Raw data may be too large to store on-line

How to propagate updates

What data to summarize

Aggregate values (totals/subtotals) often suffice


Queries on raw data can often be transformed by query optimizer to use aggregate values

Database System Concepts - 6th Edition

22.51

Silberschatz, Korth and Sudarshan

Why Data Mining?


The Explosive Growth of Data

Data collection and data availability

Automated data collection tools, database systems, Web, computerized society

Major sources of abundant data


Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras,

We are drowning in data, but starving for knowledge! Necessity is the mother of inventionData miningAutomated analysis of

massive data sets


Database System Concepts - 6th Edition 22.52

52
Silberschatz, Korth and Sudarshan

Why Data Mining?Potential Applications


Data analysis and decision support

Market analysis and management

Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications

Text mining (news group, email, documents) and Web mining


Stream data mining Bioinformatics and bio-data analysis
53
22.53 Silberschatz, Korth and Sudarshan

Database System Concepts - 6th Edition

Data Mining: A KDD Process


Pattern Evaluation

Data mining: the core of knowledge discovery Data Mining process.


Task-relevant Data
Data Selection Data Preprocessing

Data Warehouse
Data Cleaning Data Integration

Databases
Database System Concepts - 6th Edition 22.54 Silberschatz, Korth and Sudarshan

Steps of a KDD Process


Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.


summarization, classification, regression, association, clustering.

Choosing functions of data mining

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.


22.55 Silberschatz, Korth and Sudarshan

Use of discovered knowledge


Database System Concepts - 6th Edition

Data Mining Functionalities


General functionality

Descriptive data mining Predictive data mining

Different views lead to different classifications


Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted

56
Database System Concepts - 6th Edition 22.56 Silberschatz, Korth and Sudarshan

Data Mining Functionalities


Multidimensional concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

association analysis

Diaper Beer [0.5%, 75%] (Correlation or causality?)

Classification and prediction

Construct models (functions) that describe and distinguish classes or concepts for future prediction

E.g., classify countries based on (climate), or classify cars based on (gas mileage)

Predict some unknown or missing numerical values


57

Database System Concepts - 6th Edition

22.57

Silberschatz, Korth and Sudarshan

Data Mining Functionalities (2)


Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and deviation: e.g., regression analysis Periodicity analysis Similarity-based analysis

Outlier analysis

Trend and evolution analysis


Other pattern-directed or statistical analyses


58
Database System Concepts - 6th Edition 22.58 Silberschatz, Korth and Sudarshan

Data Cleaning
Importance

Data cleaning is one of the three biggest problems in data warehousingRalph Kimball Data cleaning is the number one problem in data warehousingDCI survey

Data cleaning tasks


Fill in missing values

Identify outliers and smooth out noisy data


Correct inconsistent data Resolve redundancy caused by data integration
Data Mining: Concepts and 22.59
Silberschatz, Korth and Sudarshan

Database System Concepts - 6th Edition

December 5, 2013

59

Missing Data
Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to


equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data

Missing data may need to be inferred.


Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.60

Silberschatz, Korth and Sudarshan

60

How to Handle Missing Data?


Ignore the tuple: usually done when class label is missing (assuming

the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible? Fill in it automatically with

a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or decision tree
Data Mining: Concepts and 22.61
Silberschatz, Korth and Sudarshan

December 5, 2013 Database System Concepts - 6th Edition

61

Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to

faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention duplicate records incomplete data

Other data problems which requires data cleaning


Database System Concepts - 6th Edition

inconsistent December 5, 2013

data Data Mining: Concepts and 22.62

Silberschatz, Korth and Sudarshan

62

How to Handle Noisy Data?


Binning

first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. smooth by fitting the data into regression functions detect and remove outliers detect suspicious values and check by human (e.g., deal with possible outliers)
Data Mining: Concepts and 22.63
Silberschatz, Korth and Sudarshan

Regression

Clustering

Combined computer and human inspection

Database System Concepts - 6th Edition

December 5, 2013

63

Simple Discretization Methods: Binning


Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N.

The most straightforward, but outliers may dominate presentation Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately same number of samples

Good data scaling Managing categorical attributes can be tricky


Data Mining: Concepts and 22.64
Silberschatz, Korth and Sudarshan

Database System Concepts - 6th Edition

December 5, 2013

64

Binning Methods for Data Smoothing


Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,

29, 34

* Partition into equal-frequency (equi-depth) bins:


- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29


* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25
Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.65

Silberschatz, Korth and Sudarshan

65

Regression
y

Y1

Y1

y=x+1

X1

Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.66

Silberschatz, Korth and Sudarshan

66

Cluster Analysis

Database System Concepts - 6th Edition

December 5, 2013

Data Mining: Concepts and 22.67

Silberschatz, Korth and Sudarshan

67