P. 1
Data Quality: High-impact Strategies - What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors

Data Quality: High-impact Strategies - What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors

|Views: 6,932|Likes:
Veröffentlicht vonEmereo Publishing
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database.

The first views can often be in disagreement, even about the same set of data used for the same purpose. This book discusses the concept as it related to business data processing, although of course other data have various quality issues as well.

This book is your ultimate resource for Data Quality. Here you will find the most up-to-date information, analysis, background and everything you need to know.

In easy to read chapters, with extensive references and links to get you to know all there is to know about Data Quality right away, covering: Data quality, Bit rot, Cleansing and Conforming Data, Data auditing, Data cleansing, Data corruption, Data integrity, Data profiling, Data quality assessment, Data quality assurance, Data Quality Firewall, Data truncation, Data validation, Data verification, Database integrity, Database preservation, DataCleaner, Declarative Referential Integrity, Digital continuity, Digital preservation, Dirty data, Entity integrity, Information quality, Link rot, One-for-one checking, Referential integrity, Soft error, Two pass verification, Validation rule, Abstraction (computer science), ADO.NET, ADO.NET data provider, WCF Data Services, Age-Based Content Rating System, Aggregate (Data Warehouse), Data archaeology, Archive site, Association rule learning, Atomicity (database systems), Australian National Data Service, Automated Tiered Storage, Automatic data processing, Automatic data processing equipment, BBC Archives, Bitmap index, British Oceanographic Data Centre, Business intelligence, Business Intelligence Project Planning, Change data capture, Chunked transfer encoding, Client-side persistent data, Clone (database), Cognos Reportnet, Commit (data management), Commitment ordering, The History of Commitment Ordering, Comparison of ADO and ADO.NET, Comparison of OLAP Servers, Comparison of structured storage software, Computer-aided software engineering, Concurrency control, Conference on Innovative Data Systems Research, Consumer Relationship System, Content Engineering, Content format, Content inventory, Content management, Content Migration, Content re-appropriation, Content repository, Control break, Control flow diagram, Copyright, Core Data, Core data integration, Customer data management, DAMA, Dashboard (business), Data, Data access, Data aggregator, Data architect, Data architecture, Data bank, Data binding, Data center, Data classification (data management), Data conditioning, Data custodian, Data deduplication, Data dictionary, Data Domain (corporation), Data exchange, Data extraction, Data field, Data flow diagram, Data governance, Data independence, Data integration, Data library, Data maintenance, Data management, Data management plan, Data mapping, Data migration, Data processing system, Data proliferation, Data recovery, Data Reference Model, Data retention software, Data room, Data security, Data set (IBM mainframe), Data steward, Data storage device, Data Stream Management System, Data Transformation Services, Data Validation and Reconciliation, Data virtualization, Data visualization, Data warehouse, Database administration and automation...and much more.

This book explains in-depth the real drivers and workings of Data Quality. It reduces the risk of your technology, time and resources investment decisions by enabling you to compare your understanding of Data Quality with the objectivity of experienced professionals.
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database.

The first views can often be in disagreement, even about the same set of data used for the same purpose. This book discusses the concept as it related to business data processing, although of course other data have various quality issues as well.

This book is your ultimate resource for Data Quality. Here you will find the most up-to-date information, analysis, background and everything you need to know.

In easy to read chapters, with extensive references and links to get you to know all there is to know about Data Quality right away, covering: Data quality, Bit rot, Cleansing and Conforming Data, Data auditing, Data cleansing, Data corruption, Data integrity, Data profiling, Data quality assessment, Data quality assurance, Data Quality Firewall, Data truncation, Data validation, Data verification, Database integrity, Database preservation, DataCleaner, Declarative Referential Integrity, Digital continuity, Digital preservation, Dirty data, Entity integrity, Information quality, Link rot, One-for-one checking, Referential integrity, Soft error, Two pass verification, Validation rule, Abstraction (computer science), ADO.NET, ADO.NET data provider, WCF Data Services, Age-Based Content Rating System, Aggregate (Data Warehouse), Data archaeology, Archive site, Association rule learning, Atomicity (database systems), Australian National Data Service, Automated Tiered Storage, Automatic data processing, Automatic data processing equipment, BBC Archives, Bitmap index, British Oceanographic Data Centre, Business intelligence, Business Intelligence Project Planning, Change data capture, Chunked transfer encoding, Client-side persistent data, Clone (database), Cognos Reportnet, Commit (data management), Commitment ordering, The History of Commitment Ordering, Comparison of ADO and ADO.NET, Comparison of OLAP Servers, Comparison of structured storage software, Computer-aided software engineering, Concurrency control, Conference on Innovative Data Systems Research, Consumer Relationship System, Content Engineering, Content format, Content inventory, Content management, Content Migration, Content re-appropriation, Content repository, Control break, Control flow diagram, Copyright, Core Data, Core data integration, Customer data management, DAMA, Dashboard (business), Data, Data access, Data aggregator, Data architect, Data architecture, Data bank, Data binding, Data center, Data classification (data management), Data conditioning, Data custodian, Data deduplication, Data dictionary, Data Domain (corporation), Data exchange, Data extraction, Data field, Data flow diagram, Data governance, Data independence, Data integration, Data library, Data maintenance, Data management, Data management plan, Data mapping, Data migration, Data processing system, Data proliferation, Data recovery, Data Reference Model, Data retention software, Data room, Data security, Data set (IBM mainframe), Data steward, Data storage device, Data Stream Management System, Data Transformation Services, Data Validation and Reconciliation, Data virtualization, Data visualization, Data warehouse, Database administration and automation...and much more.

This book explains in-depth the real drivers and workings of Data Quality. It reduces the risk of your technology, time and resources investment decisions by enabling you to compare your understanding of Data Quality with the objectivity of experienced professionals.

More info:

Published by: Emereo Publishing on Aug 01, 2011
Urheberrecht:Traditional Copyright: All rights reserved
Listenpreis:$39.95

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
This book can be read on up to 6 mobile devices.
Full version available to members
See more
See less

12/04/2014

Sections

  • Data quality
  • Bit rot
  • Cleansing and Conforming Data
  • Data auditing
  • Data cleansing
  • Data corruption
  • Data integrity
  • Data profiling
  • Data quality assessment
  • Data quality assurance
  • Data Quality Firewall
  • Data truncation
  • Data validation
  • Data verification
  • Database integrity
  • Database preservation
  • DataCleaner
  • Declarative Referential Integrity
  • Digital continuity
  • Digital preservation
  • Dirty data
  • Entity integrity
  • Information quality
  • Link rot
  • One-for-one checking
  • Referential integrity
  • Soft error
  • Two pass verification
  • Validation rule
  • Abstraction (computer science)
  • ADO.NET
  • ADO.NET data provider
  • ADO.NET data providers
  • WCF Data Services
  • Age-Based Content Rating System
  • Aggregate (Data Warehouse)
  • Data archaeology
  • Archive site
  • Association rule learning
  • Atomicity (database systems)
  • Australian National Data Service
  • Automated Tiered Storage
  • Automatic data processing
  • Automatic data processing equipment
  • BBC Archives
  • Bitmap index
  • British Oceanographic Data Centre
  • Business intelligence
  • Business Intelligence Project Planning
  • Change data capture
  • Chunked transfer encoding
  • Client-side persistent data
  • Clone (database)
  • Cognos Reportnet
  • Cognos ReportNet
  • Commit (data management)
  • Commitment ordering
  • The History of Commitment Ordering
  • Comparison of ADO and ADO.NET
  • Comparison of OLAP Servers
  • Comparison of structured storage software
  • Computer-aided software engineering
  • Concurrency control
  • Conference on Innovative Data Systems Research
  • Consumer Relationship System
  • Content Engineering
  • Content format
  • Content inventory
  • Content management
  • Content Migration
  • Content re-appropriation
  • Content repository
  • Control break
  • Control flow diagram
  • Copyright
  • Core Data
  • Core data integration
  • Customer data management
  • Customer Data Management
  • DAMA
  • Dashboard (business)
  • Data
  • Data access
  • Data aggregator
  • Data architect
  • Data architecture
  • Data bank
  • Data binding
  • Data center
  • Data classification (data management)
  • Data conditioning
  • Data custodian
  • Data deduplication
  • Data dictionary
  • Data Domain (corporation)
  • Data exchange
  • Data extraction
  • Data field
  • Data flow diagram
  • Data governance
  • Data independence
  • data independence:
  • Data integration
  • Data library
  • Data maintenance
  • Data management
  • Data management plan
  • Data mapping
  • Data migration
  • Data processing system
  • Data proliferation
  • Data recovery
  • Data Reference Model
  • Data retention software
  • Data room
  • Data security
  • Data set (IBM mainframe)
  • Data steward
  • Data storage device
  • Data Stream Management System
  • Data Transformation Services
  • Data Validation and Reconciliation
  • Data virtualization
  • Data visualization
  • Data warehouse
  • Database administration and automation
  • Database administrator
  • Database engine
  • Database schema
  • Database server
  • Database transaction
  • Database-centric architecture
  • Dimensional Fact Model
  • Disaster recovery
  • Distributed concurrency control
  • Distributed data store
  • Distributed database
  • Distributed file system
  • Distributed transaction
  • DMAPI
  • Document capture software
  • Document-oriented database
  • Durability (database systems)
  • Dynamic Knowledge Repository
  • Dynomite
  • Early-arriving fact
  • Edge data integration
  • Electronically stored information (Federal Rules of Civil Procedure)
  • Enterprise bus matrix
  • The Enterprise Bus Matrix
  • Enterprise data management
  • Enterprise Data Planning
  • Enterprise Information Integration
  • Enterprise Information System
  • Enterprise manufacturing intelligence
  • Enterprise Objects Framework
  • Explain Plan
  • Flat file database
  • Flow (software)
  • Free space bitmap
  • Geospatial metadata
  • Global Change Master Directory
  • •Global Change Master Directory [3]
  • Global concurrency control
  • Global serializability
  • Government Performance Management
  • Grid-oriented storage
  • Guardium, an IBM Company
  • Hierarchical classifier
  • HiT Software
  • Holistic Data Management
  • Holos
  • Hybrid array
  • IBM InfoSphere
  • IBM Lotus Domino
  • IMS VDEX
  • Information architecture
  • Information Engineering Facility
  • Information integration
  • Integration Competency Center
  • Integrity constraints
  • integrity constraints:
  • International Protein Index
  • Inverted index
  • ISO 8000
  • Isolation (database systems)
  • Jenks Natural Breaks Optimization
  • Junction table
  • Lean Integration
  • Learning object
  • Learning object metadata
  • Linear medium
  • Locks with ordered sharing
  • Log trigger
  • Long-lived transaction
  • Long-running transaction
  • Lookup
  • MANOC
  • Master data
  • Master Data
  • Master data management
  • Match report
  • Metadata
  • Metadata controller
  • Metadirectory
  • Microsoft Office PerformancePoint Server
  • Microsoft Query by Example
  • Microsoft SQL Server Master Data Services
  • Mirror (computing)
  • Mobile business intelligence
  • Mobile content management system
  • Modular concurrency control
  • Modular serializability
  • mum software
  • National Data Repository
  • Navigational database
  • Nested transaction
  • Network transparency
  • Network-neutral data center
  • Nonlinear medium
  • NoSQL
  • Novell File Management Suite
  • Novell File Management Suite [1]
  • Novell File Reporter
  • Novell File Reporter [1]
  • Novell Storage Manager
  • Novell Storage Manager [1]
  • Numerical data
  • ODBC driver
  • ODBC drivers
  • Online analytical processing
  • Online complex processing
  • Online transaction processing
  • Ontology merging
  • Ontology-based data integration
  • Open Database Connectivity
  • Operational data store
  • Operational database
  • Operational historian
  • Operational system
  • Paper data storage
  • Parchive
  • Parity file
  • Performance intelligence
  • Photo recovery
  • Physical schema
  • PL/Perl
  • Point-in-time recovery
  • Project workforce management
  • pureXML
  • Query language
  • QuickPar
  • Rainbow Storage
  • Reactive Business Intelligence
  • Read–write conflict
  • Record linkage
  • Recording format
  • Reference data
  • Reference table
  • Retention period
  • Rollback (data management)
  • ROOT
  • Sales intelligence
  • Savepoint
  • Schedule (computer science)
  • Schema crosswalk
  • Scientific data management system
  • Scriptella
  • Secure electronic delivery service
  • Semantic integration
  • Semantic translation
  • Semantic warehousing
  • Serializability
  • SIGMOD Edgar F. Codd Innovations Award
  • Signed overpunch
  • Single source publishing
  • Social Information Architecture
  • SQL injection
  • SQL programming tool
  • SQL/PSM
  • State transition network
  • Storage area network
  • Storage block
  • Storage model
  • Streamlizing algorithms
  • SWX Format
  • Synthetic data
  • Tagsistant
  • Technical data management system
  • Thomas write rule
  • Three-phase commit protocol
  • Transaction data
  • Tuple
  • Two-phase commit protocol
  • UI data binding
  • Uniform data access
  • Uniform information representation
  • Universal Data Element Framework
  • Vector-field consistency
  • Vector-Field Consistency[1]
  • Versomatic
  • Very large database
  • Virtual data room
  • Virtual directory
  • Virtual facility
  • VMDS
  • Vocabulary-based transformation
  • White pages schema
  • Wiping
  • Workflow engine
  • World Wide Molecular Matrix
  • Write–read conflict

Data Quality

High-impact Strategies - What You Need to Know:
Definitions, Adoptions, Impact, Benefits, Maturity, Vendors
Kevin Roebuck
IN-DEPTH: THE REAL DRIVERS AND
WORKINGS
REDUCES THE RISK OF YOUR
TECHNOLOGY, TIME AND RESOURCES
INVESTMENT DECISIONS
ENABLING YOU TO COMPARE YOUR
UNDERSTANDING WITH THE OBJECTIVITY OF
EXPERIENCED PROFESSIONALS
Data are of high quality “if they are fit for their intended uses in operations, decision making and plan-
ning” (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-
world construct to which they refer. Furthermore, apart from these definitions, as data volume increases,
the question of internal consistency within data becomes paramount, regardless of fitness for use for any
external purpose, e.g. a person’s age and birth date may conflict within different parts of a database.
The first views can often be in disagreement, even about the same set of data used for the same purpose.
This book discusses the concept as it related to business data processing, although of course other data
have various quality issues as well.
This book is your ultimate resource for Data Quality. Here you will find the most up-to-date information,
analysis, background and everything you need to know.
In easy to read chapters, with extensive references and links to get you to know all there is to know about
Data Quality right away, covering: Data quality, Bit rot, Cleansing and Conforming Data, Data auditing,
Data cleansing, Data corruption, Data integrity, Data profiling, Data quality assessment, Data quality as-
surance, Data Quality Firewall, Data truncation, Data validation, Data verification, Database integrity, Da-
tabase preservation, DataCleaner, Declarative Referential Integrity, Digital continuity, Digital preservation,
Dirty data, Entity integrity, Information quality, Link rot, One-for-one checking, Referential integrity, Soft
error, Two pass verification, Validation rule, Abstraction (computer science), ADO.NET, ADO.NET data pro-
vider, WCF Data Services, Age-Based Content Rating System, Aggregate (Data Warehouse), Data archae-
ology, Archive site, Association rule learning, Atomicity (database systems), Australian National Data Ser-
vice, Automated Tiered Storage, Automatic data processing, Automatic data processing equipment, BBC
Archives, Bitmap index, British Oceanographic Data Centre, Business intelligence, Business Intelligence
Project Planning, Change data capture, Chunked transfer encoding, Client-side persistent data, Clone (da-
tabase), Cognos Reportnet, Commit (data management), Commitment ordering, The History of Commit-
ment Ordering, Comparison of ADO and ADO.NET, Comparison of OLAP Servers, Comparison of structured
storage software, Computer-aided software engineering, Concurrency control, Conference on Innovative
Data Systems Research, Consumer Relationship System, Content Engineering, Content format, Content
inventory, Content management, Content Migration, Content re-appropriation, Content repository, Control
break, Control flow diagram, Copyright, Core Data, Core data integration, Customer data management,
DAMA, Dashboard (business), Data, Data access, Data aggregator, Data architect, Data architecture, Data
bank, Data binding, Data center, Data classification (data management), Data conditioning, Data custo-
dian, Data deduplication, Data dictionary, Data Domain (corporation), Data exchange, Data extraction,
Data field, Data flow diagram, Data governance, Data independence, Data integration, Data library, Data
maintenance, Data management, Data management plan, Data mapping, Data migration, Data process-
ing system, Data proliferation, Data recovery, Data Reference Model, Data retention software, Data room,
Data security, Data set (IBM mainframe), Data steward, Data storage device, Data Stream Management
System, Data Transformation Services, Data Validation and Reconciliation, Data virtualization, Data visual-
ization, Data warehouse, Database administration and automation...and much more
This book explains in-depth the real drivers and workings of Data Quality. It reduces the risk of your tech-
nology, time and resources investment decisions by enabling you to compare your understanding of Data
Quality with the objectivity of experienced professionals.
D
a
t
a

Q
u
a
l
i
t
y
Topic relevant selected content from the highest rated entries, typeset, printed and
shipped.
Combine the advantages of up-to-date and in-depth knowledge with the convenience of
printed books.
A portion of the proceeds of each book will be donated to the Wikimedia Foundation
to support their mission: to empower and engage people around the world to collect
and develop educational content under a free license or in the public domain, and to
disseminate it effectively and globally.
The content within this book was generated collaboratively by volunteers. Please be
advised that nothing found here has necessarily been reviewed by people with the
expertise required to provide you with complete, accurate or reliable information. Some
information in this book maybe misleading or simply wrong. The publisher does not
guarantee the validity of the information found here. If you need specifc advice (for
example, medical, legal, fnancial, or risk management) please seek a professional who
is licensed or knowledgeable in that area.
Sources, licenses and contributors of the articles and images are listed in the section
entitled “References”. Parts of the books may be licensed under the GNU Free
Documentation License. A copy of this license is included in the section entitled “GNU
Free Documentation License”
All used third-party trademarks belong to their respective owners.
Contents
Articles
Data quality 1
Bit rot 4
Cleansing and Conforming Data 5
Data auditing 6
Data cleansing 7
Data corruption 9
Data integrity 10
Data profiling 12
Data quality assessment 13
Data quality assurance 14
Data Quality Firewall 14
Data truncation 14
Data validation 15
Data verification 17
Database integrity 18
Database preservation 18
DataCleaner 19
Declarative Referential Integrity 21
Digital continuity 22
Digital preservation 23
Dirty data 30
Entity integrity 31
Information quality 31
Link rot 33
One-for-one checking 37
Referential integrity 37
Soft error 38
Two pass verification 44
Validation rule 45
Abstraction (computer science) 46
ADO.NET 54
ADO.NET data provider 56
WCF Data Services 57
Age-Based Content Rating System 58
Aggregate (Data Warehouse) 60
Data archaeology 61
Archive site 62
Association rule learning 63
Atomicity (database systems) 70
Australian National Data Service 71
Automated Tiered Storage 72
Automatic data processing 73
Automatic data processing equipment 73
BBC Archives 74
Bitmap index 78
British Oceanographic Data Centre 82
Business intelligence 86
Business Intelligence Project Planning 95
Change data capture 97
Chunked transfer encoding 100
Client-side persistent data 102
Clone (database) 103
Cognos Reportnet 104
Commit (data management) 106
Commitment ordering 107
The History of Commitment Ordering 129
Comparison of ADO and ADO.NET 147
Comparison of OLAP Servers 148
Comparison of structured storage software 153
Computer-aided software engineering 155
Concurrency control 160
Conference on Innovative Data Systems Research 167
Consumer Relationship System 167
Content Engineering 169
Content format 169
Content inventory 171
Content management 173
Content Migration 175
Content re-appropriation 177
Content repository 178
Control break 179
Control flow diagram 179
Copyright 181
Core Data 204
Core data integration 207
Customer data management 208
DAMA 209
Dashboard (business) 210
Data 211
Data access 214
Data aggregator 214
Data architect 216
Data architecture 217
Data bank 220
Data binding 220
Data center 221
Data classification (data management) 230
Data conditioning 232
Data custodian 233
Data deduplication 234
Data dictionary 239
Data Domain (corporation) 240
Data exchange 241
Data extraction 241
Data field 242
Data flow diagram 242
Data governance 244
Data independence 247
Data integration 248
Data library 252
Data maintenance 254
Data management 254
Data management plan 256
Data mapping 259
Data migration 261
Data processing system 263
Data proliferation 264
Data recovery 265
Data Reference Model 269
Data retention software 270
Data room 270
Data security 271
Data set (IBM mainframe) 273
Data steward 275
Data storage device 276
Data Stream Management System 282
Data Transformation Services 283
Data Validation and Reconciliation 286
Data virtualization 294
Data visualization 295
Data warehouse 300
Database administration and automation 307
Database administrator 310
Database engine 311
Database schema 311
Database server 313
Database transaction 314
Database-centric architecture 316
Dimensional Fact Model 317
Disaster recovery 319
Distributed concurrency control 322
Distributed data store 323
Distributed database 324
Distributed file system 326
Distributed transaction 328
DMAPI 329
Document capture software 329
Document-oriented database 331
Durability (database systems) 334
Dynamic Knowledge Repository 334
Dynomite 335
Early-arriving fact 337
Edge data integration 337
Electronically stored information (Federal Rules of Civil Procedure) 338
Enterprise bus matrix 339
Enterprise data management 341
Enterprise Data Planning 343
Enterprise Information Integration 344
Enterprise Information System 346
Enterprise manufacturing intelligence 346
Enterprise Objects Framework 347
Explain Plan 350
Flat file database 350
Flow (software) 354
Free space bitmap 356
Geospatial metadata 358
Global Change Master Directory 361
Global concurrency control 362
Global serializability 363
Government Performance Management 370
Grid-oriented storage 372
Guardium, an IBM Company 375
Hierarchical classifier 380
HiT Software 382
Holistic Data Management 383
Holos 386
Hybrid array 388
IBM InfoSphere 389
IBM Lotus Domino 390
IMS VDEX 392
Information architecture 396
Information Engineering Facility 397
Information integration 398
Integration Competency Center 399
Integrity constraints 403
International Protein Index 404
Inverted index 405
ISO 8000 407
Isolation (database systems) 408
Jenks Natural Breaks Optimization 413
Junction table 415
Lean Integration 416
Learning object 418
Learning object metadata 422
Linear medium 427
Locks with ordered sharing 428
Log trigger 428
Long-lived transaction 436
Long-running transaction 436
Lookup 437
MANOC 437
Master data 438
Master Data 439
Master data management 440
Match report 444
Metadata 444
Metadata controller 453
Metadirectory 453
Microsoft Office PerformancePoint Server 454
Microsoft Query by Example 456
Microsoft SQL Server Master Data Services 457
Mirror (computing) 458
Mobile business intelligence 460
Mobile content management system 465
Modular concurrency control 466
Modular serializability 467
mum software 474
National Data Repository 475
Navigational database 492
Nested transaction 494
Network transparency 495
Network-neutral data center 496
Nonlinear medium 496
NoSQL 496
Novell File Management Suite 502
Novell File Reporter 503
Novell Storage Manager 504
Numerical data 505
ODBC driver 506
Online analytical processing 507
Online complex processing 512
Online transaction processing 512
Ontology merging 514
Ontology-based data integration 514
Open Database Connectivity 516
Operational data store 519
Operational database 520
Operational historian 521
Operational system 521
Paper data storage 522
Parchive 523
Parity file 526
Performance intelligence 527
Photo recovery 528
Physical schema 533
PL/Perl 534
Point-in-time recovery 535
Project workforce management 535
pureXML 536
Query language 538
QuickPar 540
Rainbow Storage 542
Reactive Business Intelligence 544
Read–write conflict 545
Record linkage 546
Recording format 549
Reference data 550
Reference table 551
Retention period 551
Rollback (data management) 552
ROOT 553
Sales intelligence 556
Savepoint 557
Schedule (computer science) 557
Schema crosswalk 563
Scientific data management system 565
Scriptella 566
Secure electronic delivery service 567
Semantic integration 568
Semantic translation 569
Semantic warehousing 570
Serializability 572
SIGMOD Edgar F. Codd Innovations Award 584
Signed overpunch 585
Single source publishing 586
Social Information Architecture 590
SQL injection 592
SQL programming tool 599
SQL/PSM 601
State transition network 601
Storage area network 602
Storage block 605
Storage model 606
Streamlizing algorithms 606
SWX Format 607
Synthetic data 608
Tagsistant 611
Technical data management system 612
Thomas write rule 613
Three-phase commit protocol 614
Transaction data 616
Tuple 616
Two-phase commit protocol 620
UI data binding 623
Uniform data access 624
Uniform information representation 625
Universal Data Element Framework 625
Vector-field consistency 628
Versomatic 629
Very large database 630
Virtual data room 631
Virtual directory 633
Virtual facility 635
VMDS 637
Vocabulary-based transformation 639
White pages schema 640
Wiping 641
Workflow engine 654
World Wide Molecular Matrix 655
Write–read conflict 656
Write–write conflict 657
XLDB 657
XML database 659
XSA 661
References
Article Sources and Contributors 662
Image Sources, Licenses and Contributors 676
Article Licenses
License 680
Data quality
1
Data quality
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M.
Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which
they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency
within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth
date may conflict within different parts of a database. The first views can often be in disagreement, even about the
same set of data used for the same purpose. This article discusses the concept as it related to business data
processing, although of course other data have various quality issues as well.
Definitions
1. Data Quality refers to the degree of excellence exhibited by the data in relation to the portrayal of the actual
scenario.
2. The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific
use. Government of British Columbia
[1]
3. The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of
the degrees of excellence for factors related to data. Glossary of Quality Assurance Terms
[2]
4. Glossary of data quality terms
[3]
published by IAIDQ
[4]
5. Data quality: The processes and technologies involved in ensuring the conformance of data values to business
requirements and acceptance criteria
6. Complete, standards based, consistent, accurate and time stamped GS1
[5]
History
Before the rise of the inexpensive server, massive mainframe computers were used to maintain name and address
data so that the mail could be properly routed to its destination. The mainframes used business rules to correct
common misspellings and typographical errors in name and address data, as well as to track customers who had
moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies
began to make postal data available to a few service companies to cross-reference customer data with the National
Change of Address registry (NCOA). This technology saved large companies millions of dollars compared to
manually correcting customer data. Large companies saved on postage, as bills and direct marketing materials made
their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of
corporations, as low-cost and powerful server technology became available.
Companies with an emphasis on marketing often focus their quality efforts on name and address information, but
data quality is recognized as an important property of all types of data. Principles of data quality can be applied to
supply chain data, transactional data, and nearly every other category of data found in the enterprise. For example,
making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of
similar but slightly different stock; 2) improving the understanding of vendor purchases to negotiate volume
discounts; and 3) avoiding logistics costs in stocking and shipping parts across a large organization.
While name and address data has a clear standard as defined by local postal authorities, other types of data have few
recognized standards. There is a movement in the industry today to standardize certain non-address data. The
non-profit group GS1 is among the groups spearheading this movement.
For companies with significant research efforts, data quality can include developing protocols for research methods,
reducing measurement error, bounds checking of the data, cross tabulation, modeling and outlier detection, verifying
data integrity, etc.
Data quality
2
Overview
There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach
influenced by American pragmatism expands the definition of data quality to include information quality, and
emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of
science (Ivanov, 1972). One framework seeks to integrate the product perspective (conformance to specifications)
and the service perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in
semiotics to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly
theoretical approach analyzes the ontological nature of information systems to define data quality rigorously (Wand
and Wang, 1996).
A considerable amount of data quality research involves investigating and describing various categories of desirable
attributes (or dimensions) of data. These lists commonly include accuracy, correctness, currency, completeness and
relevance. Nearly 200 such terms have been identified and there is little agreement in their nature (are these
concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognise this
as a similar problem to "ilities".
MIT has a Total Data Quality Management program, led by Professor Richard Wang, which produces a large
number of publications and hosts a significant international conference in this field (International Conference on
Information Quality, ICIQ).
In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging
from data warehousing and business intelligence to customer relationship management and supply chain
management. One industry study estimated the total cost to the US economy of data quality problems at over
US$600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can
originate from different data sources – through data entry, or data migration and conversion projects.
[6]
In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is
incorrectly addressed.
[7]
One reason contact data becomes stale very quickly in the average database – more than 45 million Americans
change their address every year.
[8]
In fact, the problem is such a concern that companies are beginning to set up a data governance team whose sole role
in the corporation is to be responsible for data quality. In some organizations, this data governance function has been
established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information
Quality to organizations.
Problems with data quality don't only arise from incorrect data. Inconsistent data is a problem as well. Eliminating
data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data
consistency.
The market is going some way to providing data quality assurance. A number of vendors make tools for analysing
and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can
advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a
series of tools for improving data, which may include some or all of the following:
1. Data profiling - initially assessing the data to understand its quality challenges
2. Data standardization - a business rules engine that ensures that data conforms to quality rules
3. Geocoding - for name and address data. Corrects data to US and Worldwide postal standards
4. Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned.
Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be
the same individual. It might be able to manage 'householding', or finding links between husband and wife at the
same address, for example. Finally, it often can build a 'best of breed' record, taking the best components from
multiple data sources and building a single super-record.
Data quality
3
5. Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can
also auto-correct the variations based on pre-defined business rules.
6. Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into
enterprise applications to keep it clean.
There are several well-known authors and self-styled experts, with Larry English perhaps the most popular guru. In
addition, the International Association for Information and Data Quality (IAIDQ)
[4]
was established in 2004 to
provide a focal point for professionals and researchers in this field.
ISO 8000 is the international standard for data quality.
References
[1] http:/ / www. cio. gov. bc. ca/ other/ daf/IRM_Glossary. htm
[2] http:// www. hanford.gov/ dqo/ glossaries/ Glossary_of_Quality_Assurance_Terms1.pdf
[3] http:/ / iaidq.org/main/ glossary. shtml
[4] http:/ / iaidq.org/
[5] http:/ / www. gs1. org/gdsn/ dqf
[6] http:/ / www. information-management.com/ issues/ 20060801/ 1060128-1.html
[7] http:/ / www. directionsmag. com/ article.php?article_id=509
[8] http:// ribbs.usps. gov/ move_update/ documents/ tech_guides/ PUB363. pdf
Further reading
• Eckerson, W. (2002) "Data Warehousing Special Report: Data quality and the bottom line", Article (http:// www.
adtmag. com/ article.asp?id=6321)
• Ivanov, K. (1972) "Quality-control of information: On the concept of accuracy of information in data banks and
in management information systems" (http:/ /www. informatik.umu.se/ ~kivanov/ diss-avh. html). The
University of Stockholm and The Royal Institute of Technology. Doctoral dissertation.
• Kahn, B., Strong, D., Wang, R. (2002) "Information Quality Benchmarks: Product and Service Performance,"
Communications of the ACM, April 2002. pp. 184–192. Article (http:// mitiq. mit. edu/ Documents/
Publications/ TDQMpub/ 2002/ IQ Benchmarks. pdf)
• Price, R. and Shanks, G. (2004) A Semiotic Information Quality Framework, Proc. IFIP International Conference
on Decision Support Systems (DSS2004): Decision Support in an Uncertain and Complex World, Prato. Article
(http:// vishnu. sims. monash. edu. au/ dss2004/ proceedings/ pdf/65_Price_Shanks. pdf)
• Redman, T. C. (2004) Data: An Unfolding Quality Disaster Article (http:// www.dmreview.com/ article_sub.
cfm?articleId=1007211)
• Wand, Y. and Wang, R. (1996) “Anchoring Data Quality Dimensions in Ontological Foundations,”
Communications of the ACM, November 1996. pp. 86–95. Article (http:/ / web.mit. edu/ tdqm/ www/ tdqmpub/
WandWangCACMNov96. pdf)
• Wang, R., Kon, H. & Madnick, S. (1993), Data Quality Requirements Analysis and Modelling, Ninth
International Conference of Data Engineering, Vienna, Austria. Article (http:// web.mit. edu/ tdqm/ www/
tdqmpub/ IEEEDEApr93.pdf)
• Fournel Michel, Accroitre la qualité et la valeur des données de vos clients, éditions Publibook, 2007. ISBN
978-2748338478.
• Daniel F., Casati F., Palpanas T., Chayka O., Cappiello C. (2008) "Enabling Better Decisions through
Quality-aware Reports", International Conference on Information Quality (ICIQ), MIT. Article (http:// dit. unitn.
it/ ~themis/ publications/ iciq08. pdf)
Bit rot
4
Bit rot
Bit rot, also known as bit decay, data rot, or data decay, is a colloquial computing term used to describe either a
gradual decay of storage media or (facetiously) the spontaneous degradation of a software program over time. The
latter use of the term implies that software can wear out or rust like a physical tool. More commonly, bit rot refers to
the decay of physical storage media.
Decay of storage media
Bit rot is often defined as the event in which the small electric charge of a bit in memory disperses, possibly altering
program code.
Bit rot can also be used to describe the phenomenon of data stored in EPROMs and flash memory gradually
decaying over the duration of many years, or in the decay of data stored on CD or DVD discs or other types of
consumer storage.
The cause of bit rot varies depending on the medium. EPROMs and flash memory store data using electrical charges,
which can slowly leak away due to imperfect insulation. The chip itself is not affected by this, so re-programming it
once per decade or so will prevent the bit rot.
Floppy disk and magnetic tape storage may experience bit rot as bits lose magnetic orientation, and in warm, humid
conditions these media are prone to literal rotting. In optical discs such as CDs and DVDs the breakdown of the
material onto which the data is stored may cause bit rot. This can be mitigated by storing disks in a dark, cool
location with low humidity. Archival quality disks are also available. Old punched cards and punched tape may also
experience literal rotting.
Bit rot is also used to describe the idea that semiconductor RAM may occasionally be altered by cosmic rays
[1]
, a
phenomenon known as soft error.
Problems with software
The term "bit rot" is often used to refer to dormant code rot, i.e. the fact that dormant (unused or little-used) code
gradually decays in correctness as a result of interface changes in active code that is called from the dormant code.
A program may run correctly for years with no problem, then malfunction for no apparent reason. Programmers
often jokingly attribute the failure to bit rot. Such an effect may be due to a memory leak or other non-obvious
software bug. Often, although there is no obvious change in the program's operating environment, a subtle difference
has occurred that is triggering a latent software error. The error in the software may also originate by human
operation which allows the construction or derivation of false-positive behavior to occur within the code. Some
operating systems tend to lose stability when left running for long periods, which is why they must be restarted
occasionally to remove resident errors that have built up due to software errors.
The term is also used to describe the slowing of performance of a PC over time from continued use. One cause of
this is installing software or software components that run when the user logs in, causing a noticeable delay in boot
time. Also, the addition of programs and data on the computer can make operations and searching slower, and
sometimes when programs are uninstalled they aren't removed completely. Additionally, fragmentation can slow
performance. Normally, unused data (such as a text file containing some notes) does not impede performance of a
PC (with the exception of software that, for example, indexes files on a disk to make file searching faster).
Bit rot
5
References
[1] http:/ / www. research. ibm. com/ journal/rd/401/ ogorman. pdf "Field testing for cosmic ray soft errors in semiconductor memories"
External links
• "Bit rot" from the Jargon File (http:// www. catb. org/~esr/ jargon/html/ B/ bit-rot.html)
Cleansing and Conforming Data
This process of Cleansing and Conforming Data change data on its way from source system(s) to the data
warehouse and can also be used to identify and record errors about data. The latter information can be used to fix
how the source system(s) work(s).
Good quality source data has to do with “Data Quality Culture” and must be initiated at the top of the organization. It
is not just a matter of implementing strong validation checks on input screens, because almost no matter how strong
these checks are, they can often still be circumvented by the users.
There is a nine-step guide for organizations that wish to improve data quality:
• Declare a high level commitment to a data quality culture
• Drive process reengineering at the executive level
• Spend money to improve the data entry environment
• Spend money to improve application integration
• Spend money to change how processes work
• Promote end-to-end team awareness
• Promote interdepartmental cooperation
• Publicly celebrate data quality excellence
• Continuously measure and improve data quality
Data Cleansing System
The essential job of this system is to find a suitable balance between fixing dirty data and maintaining the data as
close as possible to the original data from the source production system. This is a challenge for the ETL architect.
The system should offer an architecture that can cleanse data, record quality events and measure/control quality of
data in the data warehouse.
A good start is to perform a thorough data profiling analysis that will help define to the required complexity of the
data cleansing system and also give an idea of the current data quality in the source system(s).
Quality Screens
Part of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in
the data flow that, if it fails records an error in the Error Event Schema. Quality screens are divided into three
categories:
• Column screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric
values that should be numeric; out of range values; etc.
• Structure screens. These are used to test for the integrity of different relationships between columns (typically
foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is
valid according to some structural definition it should adhere.
Cleansing and Conforming Data
6
• Business rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables,
follow specific business rules. An example could be, that if a customer is marked as a certain type of customer,
the business rules that define this kind of customer should be adhered.
When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else
than the target system or tag the data. The latter option is considered the best solution because the first option
requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are
missing from the target system (integrity) and it is often unclear, what should happen to these data.
Error Event Schema
This schema is the place, where all error events thrown by quality screens, are recorded. It consists of an Error Event
Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who
produced error). It also holds information about exactly when the error occurred and the severity of the error. In
addition there is an Error Event Detail Fact table with a foreign key to the main table that contains detailed
information about in which table, record and field the error occurred and the error condition.
Sources
• Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., Becker, B. The Data Warehouse Lifecycle Toolkit, Wiley
Publishing, Inc., 2008. ISBN 978-0-470-14977-5.
• Olson, J. E. Data Quality: The Accuracy Dimension", Morgan Kauffman, 2002. ISBN 1558608915.
Data auditing
Data Auditing is the process of conducting a data audit to assess how fit for purpose a company's data is. This
involves profiling the data and assessing the impact of poor quality data on the organization's performance and
profits.
Data cleansing
7
Data cleansing
Data cleansing or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate
records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete,
incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.
After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected
or removed may have been originally caused by different data dictionary definitions of similar entities in different
stores, may have been caused by user entry errors, or may have been corrupted in transmission or storage.
Data cleansing differs from data validation in that validation almost invariably means data is rejected from the
system at entry and is performed at entry time, rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors or validating and correcting values
against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid
postal code) or fuzzy (such as correcting records that partially match existing, known records).
Motivation
Administratively, incorrect or inconsistent data can lead to false conclusions and misdirected investments on both
public and private scales. For instance, the government may want to analyze population census figures to decide
which regions require further spending and investment on infrastructure and services. In this case, it will be
important to have access to reliable data to avoid erroneous fiscal decisions.
In the business world, incorrect data can be costly. Many companies use customer information databases that record
data like contact information, addresses, and preferences. If for instance the addresses are inconsistent, the company
will suffer the cost of resending mail or even losing customers.
Data quality
High quality data needs to pass a set of quality criteria. Those include:
• Accuracy: An aggregated value over the criteria of integrity, consistency and density
• Integrity: An aggregated value over the criteria of completeness and validity
• Completeness: Achieved by correcting data containing anomalies
• Validity: Approximated by the amount of data satisfying integrity constraints
• Consistency: Concerns contradictions and syntactical anomalies
• Uniformity: Directly related to irregularities
• Density: The quotient of missing values in the data and the number of total values ought to be known
• Uniqueness: Related to the number of duplicates in the data
The process of data cleansing
• Data Auditing: The data is audited with the use of statistical methods to detect anomalies and contradictions.
This eventually gives an indication of the characteristics of the anomalies and their locations.
• Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on
the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the
end product of high quality data. In order to achieve a proper workflow, the causes of the anomalies and errors in
the data have to be closely considered. If for instance we find that an anomaly is a result of typing errors in data
input stages, the layout of the keyboard can help in manifesting possible solutions.
• Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness
is verified. The implementation of the workflow should be efficient even on large sets of data which inevitably
Data cleansing
8
poses a trade-off because the execution of a data cleansing operation can be computationally expensive.
• Post-Processing and Controlling: After executing the cleansing workflow, the results are inspected to verify
correctness. Data that could not be corrected during execution of the workflow are manually corrected if possible.
The result is a new cycle in the data cleansing process where the data is audited again to allow the specification of
an additional workflow to further cleanse the data by automatic processing.
Popular methods used
• Parsing: Parsing in data cleansing is performed for the detection of syntax errors. A parser decides whether a
string of data is acceptable within the allowed data specification. This is similar to the way a parser works with
grammars and languages.
• Data Transformation: Data Transformation allows the mapping of the data from their given format into the
format expected by the appropriate application. This includes value conversions or translation functions as well as
normalizing numeric values to conform to minimum and maximum values.
• Duplicate Elimination: Duplicate detection requires an algorithm for determining whether data contains
duplicate representations of the same entity. Usually, data is sorted by a key that would bring duplicate entries
closer together for faster identification.
• Statistical Methods: By analyzing the data using the values of mean, standard deviation, range, or clustering
algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the
correction of such data is difficult since the true value is not known, it can be resolved by setting the values to an
average or other statistical value. Statistical methods can also be used to handle missing values which can be
replaced by one or more plausible values that are usually obtained by extensive data augmentation algorithms.
Existing tools
Before computer automation data about individuals or organizations were maintained and secured as paper records,
dispersed in separate business or organizational units. Information systems concentrate data in computer files that
can potentially be accessed by large numbers of people and by groups outside of organization.
Criticism of Existing Tools and Processes
The value and current approaches to Data Cleansing have come under criticism due to some parties claiming large
costs and low return on investment from major data cleansing initiatives.
Challenges and problems
• Error Correction and loss of information: The most challenging problem within data cleansing remains the
correction of values to remove duplicates and invalid entries. In many cases, the available information on such
anomalies is limited and insufficient to determine the necessary transformations or corrections leaving the
deletion of such entries as the only plausible solution. The deletion of data though, leads to loss of information
which can be particularly costly if there is a large amount of deleted data.
• Maintenance of Cleansed Data: Data cleansing is an expensive and time consuming process. So after having
performed data cleansing and achieving a data collection free of errors, one would want to avoid the re-cleansing
of data in its entirety after some values in data collection change. The process should only be repeated on values
that have changed which means that a cleansing lineage would need to be kept which would require efficient data
collection and management techniques.
• Data Cleansing in Virtually Integrated Environments: In virtually integrated sources like IBM’s
DiscoveryLink, the cleansing of data has to be performed every time the data is accessed which considerably
Data cleansing
9
decreases the response time and efficiency.
• Data Cleansing Framework: In many cases it will not be possible to derive a complete data cleansing graph to
guide the process in advance. This makes data cleansing an iterative process involving significant exploration and
interaction which may require a framework in the form of a collection of methods for error detection and
elimination in addition to data auditing. This can be integrated with other data processing stages like integration
and maintenance.
References
Sources
• Han, J., Kamber, M. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. ISBN 1-55860-489-8.
• Kimball, R., Caserta, J. The Data Warehouse ETL Toolkit, Wiley and Sons, 2004. ISBN 0-7645-6757-8.
• Muller H., Freytag J., Problems, Methods, and Challenges in Comprehensive Data Cleansing,
Humboldt-Universitat zu Berlin, Germany.
• Rahm, E., Hong, H. Data Cleaning: Problems and Current Approaches, University of Leipzig, Germany.
External links
• Computerworld: Data Scrubbing (http:/ / www.computerworld.com/ action/ article.
do?command=viewArticleBasic&articleId=78230) (February 10, 2003)
Data corruption
Photo data corruption; in this case, a result of a
failed data recovery from a hard disk drive
Data corruption refers to errors in computer data that occur during
transmission, retrieval, or processing, introducing unintended changes
to the original data. Computer storage and transmission systems use a
number of measures to provide data integrity, or lack of errors.
In general, when data corruption occurs, the file containing that data
may become inaccessible, and the system or the related application will
give an error. For example, if a Microsoft Word file is corrupted, when
you try to open that file with MS Word, you will get an error message,
and the file would not be opened. Some programs can give a
suggestion to repair the file automatically (after the error), and some
programs cannot repair it. It depends on the level of corruption, and the
in-built functionality of the application to handle the error. There are
various causes of the corruption.
Transmission
Data corruption during transmission has a variety of causes.
Interruption of data transmission causes information loss. Environmental conditions can interfere with data
transmission, especially when dealing with wireless transmission methods. Heavy clouds can block satellite
transmissions. Wireless networks are susceptible to interference from devices such as microwave ovens.
Data corruption
10
Storage
Data loss during storage has two broad causes: hardware and software failure. Head crashes and general wear and
tear of media fall into the former category, while software failure typically occurs due to bugs in the code.
Countermeasures
When data corruption behaves as a Poisson process, where each bit of data has an independently low probability of
being changed, data corruption can generally be detected by the use of checksums, and can often be corrected by the
use of error correcting codes.
If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from
backups can be applied. Certain levels of RAID disk arrays have the ability to store and evaluate parity bits for data
across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending
on the level of RAID implemented.
If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This
is particularly important in banking applications, where an undetected error can drastically affect an account balance,
and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable.
[1]
References
[1] Data Integrity by Cern April 2007 Cern.ch (http:// indico. cern.ch/ getFile. py/ access?contribId=3& sessionId=0& resId=1&
materialId=paper&confId=13797)
Data integrity
Data integrity is data that has a complete or whole structure. All characteristics of the data including business rules,
rules for how pieces of data relate, dates, definitions and lineage must be correct for data to be complete.
Per the discipline of data architecture, when functions are performed on the data the functions must ensure integrity.
Examples of functions are transforming the data, storing the history, storing the definitions (Metadata) and storing
the lineage of the data as it moves from one place to another. The most important aspect of data integrity per the data
architecture discipline is to expose the data, the functions and the data's characteristics.
Data that has integrity is identically maintained during any operation (such as transfer, storage or retrieval). Put
simply in business terms, data integrity is the assurance that data is consistent, certified and can be reconciled.
In terms of a database data integrity refers to the process of ensuring that a database remains an accurate reflection of
the universe of discourse it is modelling or representing. In other words there is a close correspondence between the
facts stored in the database and the real world it models.
[1]
Types of integrity constraints
Data integrity is normally enforced in a database system by a series of integrity constraints or rules. Three types of
integrity constraints are an inherent part of the relational data model: entity integrity, referential integrity and domain
integrity.
Entity integrity concerns the concept of a primary key. Entity integrity is an integrity rule which states that every
table must have a primary key and that the column or columns chosen to be the primary key should be unique and
not null.
Referential integrity concerns the concept of a foreign key. The referential integrity rule states that any foreign key
value can only be in one of two states. The usual state of affairs is that the foreign key value refers to a primary key
Data integrity
11
value of some table in the database. Occasionally, and this will depend on the rules of the business, a foreign key
value can be null. In this case we are explicitly saying that either there is no relationship between the objects
represented in the database or that this relationship is unknown.
Domain integrity specifies that all columns in relational database must be declared upon a defined domain. The
primary unit of data in the relational data model is the data item. Such data items are said to be non-decomposable or
atomic. A domain is a set of values of the same type. Domains are therefore pools of values from which actual values
appearing in the columns of a table are drawn.
If a database supports these features it is the responsibility of the database to insure data integrity as well as the
consistency model for the data storage and retrieval. If a database does not support these features it is the
responsibility of the application to insure data integrity while the database supports the consistency model for the
data storage and retrieval.
Having a single, well controlled, and well defined data integrity system increases stability (one centralized system
performs all data integrity operations), performance (all data integrity operations are performed in the same tier as
the consistency model), re-usability (all applications benefit from a single centralized data integrity system), and
maintainability (one centralized system for all data integrity administration).
Today, since all modern databases support these features (see Comparison of relational database management
systems), it has become the defacto responsibility of the database to insure data integrity. Out-dated and legacy
systems that use file systems (text, spreadsheets, ISAM, flat files, etc.) for their consistency model lack any kind of
data integrity model. This requires companies to invest a large amount of time, money, and personnel in the creation
of data integrity systems on a per application basis that effectively just duplicate the existing data integrity systems
found in modern databases. Many companies, and indeed many database systems themselves, offer products and
services to migrate out-dated and legacy systems to modern databases to provide these data integrity features. This
offers companies a substantial savings in time, money, and resources because they do not have to develop per
application data integrity systems that must be re-factored each time business requirements change.
Examples
An example of a data integrity mechanism is the parent and child relationship of related records. If a parent record
owns one or more related child records all of the referential integrity processes are handled by the database itself,
which automatically insures the accuracy and integrity of the data so that no child record can exist without a parent
(also called being orphaned) and that no parent loses their child records. It also insures that no parent record can be
deleted while the parent record owns any child records. All of this is handled at the database level and does not
require coding integrity checks into each application.
References
[1] Beynon-Davies P. (2004). Database Systems 3rd Edition. Palgrave, Basingstoke, UK. ISBN 1-4039-1601-2
•  This article incorporates public domain material from websites or documents of the General Services
Administration (in support of MIL-STD-188).
• National Information Systems Security Glossary
• Xiaoyun Wang; Hongbo Yu (2005). "How to Break MD5 and Other Hash Functions" (http:// www.infosec. sdu.
edu.cn/ uploadfile/papers/ How to Break MD5 and Other Hash Functions. pdf). EUROCRYPT. ISBN
3-540-25910-4.
Data profiling
12
Data profiling
Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and
collecting statistics and information about that data. The purpose of these statistics may be to:
1. Find out whether existing data can easily be used for other purposes
2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
3. Give metrics on data quality, including whether the data conforms to particular standards or patterns
4. Assess the risk involved in integrating data for new applications, including the challenges of joins
5. Assess whether metadata accurately describes the actual values in the source database
6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided.
Finding data problems late in the project can lead to delays and cost overruns.
7. Have an enterprise view of all data, for uses such as master data management where key data is needed, or data
governance for improving data quality.
Data Profiling in Relation to Data Warehouse/Business Intelligence
Development
Introduction
Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content,
relationships and derivation rules of the data.
[1]
Profiling helps to understand anomalies and to assess data quality,
but also to discover, register, and assess enterprise metadata.
[2]
Thus the purpose of data profiling is both to validate
metadata when it is available and to discover metadata when it is not.
[3]
The result of the analysis is used both
strategically, to determine suitability of the candidate source systems and give the basis for an early go/no-go
decision, and tactically, to identify problems for later solution design, and to level sponsors’ expectations.
[1]
How to do Data Profiling
Data profiling utilizes different kinds of descriptive statistics such as minimum, maximum, mean, mode, percentile,
standard deviation, frequency, and variation as well as other aggregates such as count and sum. Additional metadata
information obtained during data profiling could be data type, length, discrete values, uniqueness, occurrence of null
values, typical string patterns, and abstract type recognition.
[2]

[4]

[5]
The metadata can then be used to discover
problems such as illegal values, misspelling, missing values, varying value representation, and duplicates. Different
analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an
understanding of frequency distribution of different values, type, and use of each column. Embedded value
dependencies can be exposed in cross-columns analysis. Finally, overlapping value sets possibly representing foreign
key relationships between entities can be explored in an inter-table analysis.
[2]
Normally purpose build tools are used
for data profiling to ease the process.
[1]

[2]

[4]

[5]

[6]

[7]
The computation complexity increases when going from single
column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for
profiling tools.
[3]
When to Conduct Data Profiling
According to Kimball,
[1]
data profiling is performed several times and with varying intensity throughout the data
warehouse developing process. A light profiling assessment should be undertaken as soon as candidate source
systems have been identified right after the acquisition of the business requirements for the DW/BI. The purpose is
to clarify at an early stage if the right data is available at the right detail level and that anomalies can be handled
subsequently. If this is not the case the project might have to be canceled.
[1]
More detailed profiling is done prior to
Data profiling
13
the dimensional modeling process in order to see what it will require to convert data into the dimensional model, and
extends into the ETL system design process to establish what data to extract and which filters to apply.
[1]
Benefits of Data Profiling
The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and
improve understanding of data for the users.
[7]
Discovering business knowledge embedded in data itself is one of the
significant benefits derived from data profiling.
[3]
Data profiling is one of the most effective technologies for
improving data accuracy in corporate databases.
[7]
Although data profiling is effective, then do remember to find a
suitable balance and do not slip in to “analysis paralysis”.
[3]

[7]
References
[1] [Ralph Kimball et al. (2008), “The Data Warehouse Lifecycle Toolkit”, Second Edition, Wiley Publishing, Inc.], (p. 297) (p. 376)
[2] [David Loshin (2009), “Master Data Management”, Morgan Kaufmann Publishers], (p. 94-96)
[3] [David Loshin (2003), “Business Intelligence: The Savvy Manager’s Guide, Getting Onboard with Emerging IT”, Morgan Kaufmann
Publishers], (p. 110-111)]
[4] [Erhard Rahm and Hong Hai Do (2000), “Data Cleaning: Problems and Current Approaches” in “Bulletin of the Technical Committee on Data
Engineering”, IEEE Computer Society, Vol. 23, No. 4, December 2000]
[5] [Ranjit Singh, Dr Kawaljeet Singh et al. (2010), “A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing”,
IJCSI International Journal of Computer Science Issue, Vol. 7, Issue 3, No. 2, May 2010]
[6] "[Ralph Kimball (2004), “Kimball Design Tip #59: Surprising Value of Data Profiling”, Kimball Group, Number 59, September 14, 2004,
(www.rkimball.com/html/designtipsPDF/ KimballDT59 SurprisingValue.pdf)]
[7] [Jack E. Olson (2003), “Data Quality: The Accuracy dimension”, Morgan Kaufmann Publishers], (p.140-142)
Data quality assessment
Data quality assessment is the process of exposing technical and business data issues in order to plan data cleansing
and data enrichment strategies. Technical quality issues are generally easy to discover and correct, such as
• Inconsistent standards in structure, format/ values • Missing data, default values • Spelling errors, data in wrong
fields
Business quality issues are more subjective and are associated with business processes such as generating accurate
reports, ensuring that data driven processes are working correctly.
Business data quality measures like accuracy and correctness are subjective and need Subject Matter Expert (SME)
involvement to assess data quality.
Data quality assurance
14
Data quality assurance
Data quality assurance is the process of profiling the data to discover inconsistencies, and other anomalies in the
data and performing data cleansing activities (e.g. removing outliers, missing data interpolation) to improve the data
quality .
These activities can be undertaken as part of Data warehousing or as part of the Database administration of an
existing piece of applications software.
Data Quality Firewall
A Data Quality Firewall is the use of software to protect a computer system from the entry of erroneous, duplicated
or poor quality data. Gartner estimates that poor quality data causes failure in up to 50% of Customer relationship
management systems. Older technology required the tight integration of data quality software, whereas this can now
be accomplished by loosely coupling technology in a service-oriented architecture.
Data truncation
Data truncation occurs when data or a data stream (such as a file) is stored in a location too short to hold its entire
length. Data truncation may occur automatically, such as when a long string is written to a smaller buffer, or
deliberately, when only a portion of the data is wanted.
Depending on what type of data validation a program or operating system has, the data may be truncated silently
(i.e., without informing the user), or the user may be given an error message.
Data validation
15
Data validation
In computer science, data validation is the process of ensuring that a program operates on clean, correct and useful
data. It uses routines, often called "validation rules" or "check routines", that check for correctness, meaningfulness,
and security of data that are input to the system. The rules may be implemented through the automated facilities of a
data dictionary, or by the inclusion of explicit application program validation logic.
For business applications, data validation can be defined through declarative data integrity rules, or procedure-based
business rules.
[1]
Data that does not conform to these rules will negatively affect business process execution.
Therefore, data validation should start with business process definition and set of business rules within this process.
Rules can be collected through the requirements capture exercise.
[2]
The simplest data validation verifies that the characters provided come from a valid set. For example, telephone
numbers should include the digits and possibly the characters +, -, (, and ) (plus, minus, and brackets). A more
sophisticated data validation routine would check to see the user had entered a valid country code, i.e., that the
number of digits entered matched the convention for the country or area specified.
Incorrect data validation can lead to data corruption or a security vulnerability. Data validation checks that data are
valid, sensible, reasonable, and secure before they are processed.
Validation methods
Allowed character checks
Checks that ascertain that only expected characters are present in a field. For example a numeric field may
only allow the digits 0-9, the decimal point and perhaps a minus sign or commas. A text field such as a
personal name might disallow characters such as < and >, as they could be evidence of a markup-based
security attack. An e-mail address might require exactly one @ sign and various other structural details.
Regular expressions are effective ways of implementing such checks. (See also data type checks below)
Batch totals
Checks for missing records. Numerical fields may be added together for all records in a batch. The batch total
is entered and the computer checks that the total is correct, e.g., add the 'Total Cost' field of a number of
transactions together.
Cardinality check
Checks that record has a valid number of related records. For example if Contact record classified as a
Customer it must have at least one associated Order (Cardinality > 0). If order does not exist for a "customer"
record then it must be either changed to "seed" or the order must be created. This type of rule can be
complicated by additional conditions. For example if contact record in Payroll database is marked as "former
employee", then this record must not have any associated salary payments after the date on which employee
left organisation (Cardinality = 0).
Check digits
Used for numerical data. An extra digit is added to a number which is calculated from the digits. The computer
checks this calculation when data are entered. For example the last digit of an ISBN for a book is a check digit
calculated modulus 10.
[3]
Consistency checks
Checks fields to ensure data in these fields corresponds, e.g., If Title = "Mr.", then Gender = "M".
Control totals
Data validation
16
This is a total done on one or more numeric fields which appears in every record. This is a meaningful total,
e.g., add the total payment for a number of Customers.
Cross-system consistency checks
Compares data in different systems to ensure it is consistent, e.g., The address for the customer with the same
id is the same in both systems. The data may be represented differently in different systems and may need to
be transformed to a common format to be compared, e.g., one system may store customer name in a single
Name field as 'Doe, John Q', while another in three different fields: First_Name (John), Last_Name (Doe) and
Middle_Name (Quality); to compare the two, the validation engine would have to transform data from the
second system to match the data from the first, for example, using SQL: Last_Name || ', ' || First_Name ||
substr(Middle_Name, 1, 1) would convert the data from the second system to look like the data from the first
'Doe, John Q'
Data type checks
Checks the data type of the input and give an error message if the input data does not match with the chosen
data type, e.g., In an input box accepting numeric data, if the letter 'O' was typed instead of the number zero,
an error message would appear.
File existence check
Checks that a file with a specified name exists. This check is essential for programs that use file handling.
Format or picture check
Checks that the data is in a specified format (template), e.g., dates have to be in the format DD/MM/YYYY.
Regular expressions should be considered for this type of validation.
Hash totals
This is just a batch total done on one or more numeric fields which appears in every record. This is a
meaningless total, e.g., add the Telephone Numbers together for a number of Customers.
Limit check
Unlike range checks, data is checked for one limit only, upper OR lower, e.g., data should not be greater than
2 (<=2).
Logic check
Checks that an input does not yield a logical error, e.g., an input value should not be 0 when there will be a
number that divides it somewhere in a program.
Presence check
Checks that important data are actually present and have not been missed out, e.g., customers may be required
to have their telephone numbers listed.
Range check
Checks that the data lie within a specified range of values, e.g., the month of a person's date of birth should lie
between 1 and 12.
Referential integrity
In modern Relational database values in two tables can be linked through foreign key and primary key. If
values in the primary key field are not constrained by database internal mechanism,
[4]
then they should be
validated. Validation of the foreign key field checks that referencing table must always refer to a valid row in
the referenced table.
[5]
Spelling and grammar check
Looks for spelling and grammatical errors.
Data validation
17
Uniqueness check
Checks that each value is unique. This can be applied to several fields (i.e. Address, First Name, Last Name).
External links
• Data validation in Microsoft Excel
[6]
• Flat File Checker - data validation tool
[7]
• Unix Shell Script based input validation function
[8]
References
[1] Data Validation, Data Integrity, Designing Distributed Applications with Visual Studio .NET (http:// msdn. microsoft.com/ en-us/ library/
aa291820(VS.71).aspx)
[2] Arkady Maydanchik (2007), "Data Quality Assessment", Technics Publications, LLC
[3] ISBN International ISBN Agency Frequently Asked Questions: What is the format of an ISBN? (http:// www.isbn-international.org/faqs/
view/5#q_5)
[4] Oracle Foreign Keys (http:// www. techonthenet. com/ oracle/foreign_keys/ foreign_keys.php)
[5] Referential Integrity, Designing Distributed Applications with Visual Studio .NET (http:/ / msdn. microsoft.com/ en-us/ library/
aa292166(VS.71).aspx)
[6] http:// www. contextures. com/ xlDataVal01. html
[7] http:/ / www. flat-file.net
[8] http:/ / blog.anantshri. info/2009/ 06/ 08/ input_validation_shell_script/
Data verification
Data Verification is a process wherein the data is checked for accuracy and inconsistencies after data migration is
done.
[1]
It helps to determine whether data was accurately translated when data is transported from one source to another, is
complete, and supports processes in the new system. During verification, there may be a need for a parallel run of
both systems to identify areas of disparity and forestall erroneous data loss.
References
[1] http:/ / www. datacap. com/ products/ features/verify/
External links
• PC Guide article (http:// www.pcguide. com/ care/ bu/ howVerification-c.html)
Database integrity
18
Database integrity
Database integrity ensures that data entered into the database is accurate, valid, and consistent. Any applicable
integrity constraints and data validation rules must be satisfied before permitting a change to the database.
Three basic types of database integrity constraints are:
• Entity integrity, not allowing multiple rows to have the same identity within a table.
• Domain integrity, restricting data to predefined data types, e.g.: dates.
• Referential integrity, requiring the existence of a related row in another table, e.g. a customer for a given customer
ID.
Database preservation
Database preservation usually involves converting the information stored in a database, without losing the
characteristics (Context, Content, Structure, Appearance and Behaviour) of the data, to a format which can be used
in the long term, even if the technology and daily life knowledge changes.
Database preservation projects
In the past different research groups have been contributing to the solutions of the problems of database preservation.
Research projects carried out in the past in this regard include:
1. Software independent archival of relational databases (SIARD)
[1]
2. Repository of Authentic Digital Objects (RODA)
[2]
3. Digital Preservation Testbed
[3]
4. Lots of Copies Keep Stuff Safe (LOCKSS)
[4]
References
[1] http:/ / arxiv.org/ abs/ cs/ 0408054v1
[2] http:/ / repositorium.sdum. uminho. pt/ bitstream/ 1822/ 8226/ 1/ RodaAndCrib. pdf
[3] http:/ / www. digitaleduurzaamheid. nl/bibliotheek/ docs/ volatility-permanence-databases-en.pdf
[4] http:/ / www. lockss. org
DataCleaner
19
DataCleaner
DataCleaner
DataCleaner 1.4 screenshot
Developer(s)
eobjects.org
[1]
Stable release 2.0
Written in Java
Operating system Cross-platform
Type Data profiling Data quality
License Lesser General Public License
Website [2]
DataCleaner is the flag-ship application of the eobjects.org
[3]
open source community. DataCleaner is a data quality
application suite with functionality for data profiling, transformation and reporting. The project was founded in late
2007 by Danish student Kasper Sørensen, who wrote a term paper
[4]
on the establishment of the process of
establishing the project and the ways of Open source software development.
Supported datastores
DataCleaner supports read-access to a lot of different types of datastores:
• JDBC compliant databases (such as Oracle, MySQL, Microsoft SQL Server, Postgresql, Firebird, SQLite,
Hsqldb, Derby/JavaDB)
• Comma-separated values (.csv) files
• Microsoft Excel (.xls and .xlsx) spreadsheets
• XML files
• OpenDocument database (.odb) files
• Microsoft Access (.mdb) database files
• DBase (.dbf) database files
DataCleaner
20
History
0.x: A school project
From early on, DataCleaner 0.x versions was released as a part of Kasper Sørensens term paper project. The 0.x
versions had a similar user concept as the later 1.x versions, but the underlying querying mechanisms was based on a
single data factory pattern, where the application could only retrieve data from various datastores using a single
method of retrieval (get all rows).
1.x: An independent OSS project
The 1.x versions of DataCleaner gained a lot of popularity in the field for DQ professionals. The application was
partitioned into three specific data quality function areas:
Profiler
The profiler in DataCleaner enables the user to gain insight in to the content of the datastore. The profiler can
calculate and present a lot of interesting metrics that will help the user become aware and understand data quality
issues. Examples of suchs metrics are distribution of values, max/min/average values, patterns used in values etc.
Validator
The validator assumes a higher degree of data insight since it enables the user to create business rules for the data to
honor. Rules for data can be defined in a variety of ways; through javascripts, lookup dictionaries, regular
expressions and more.
Comparator
The comparator enables a user to compare two separate datastores and look for values from one datastore within
another datastore and vice versa.
2.x: Acquisition by Human Inference
On the 14th of february, 2011, it was announced that the data quality vendor Human Inference had acquired
eobjects.org, hired Kasper Sørensen and participated/sponsored the development of DataCleaner 2.0. The 2.0 release
of DataCleaner was released the same day. It introduces a new user experience, where all of the previous function
areas have been unified into a single workbench.
License history
As of version 1.5 DataCleaner changed its license from the Apache License version 2.0 to the Lesser General Public
License. According to the DataCleaner website, the change was made to "ensure that improvements are submitted
back to the projects" and that "we don't risk that anyone sell modified versions of our projects"
[5]
.
References
[1] http:/ / eobjects. org
[2] http:/ / datacleaner.eobjects. org
[3] eobjects.org (http:// www. eobjects. org)
[4] Sørensen, Kasper (2008). Udvikling og styring af Open Source projekter (Danish). Cand.Merc.Dat, Copenhagen Business School,
Downloadable from http:// eobjects. org/ resources/ download/ afloesningsopgave. pdf
[5] #   eobjects.org news site, http:/ / eobjects. org/trac/ blog/ change-in-preferred-license
DataCleaner
21
External links
• the DataCleaner community (http:// datacleaner.eobjects. org)
• roadmap (http:/ / eobjects. org/ trac/roadmap) for the DataCleaner project
Declarative Referential Integrity
Declarative Referential Integrity (DRI) is one of the techniques in the SQL database programming language to
ensure data integrity.
Meaning in SQL
A table (called the child table) can refer to a column (or a group of columns) in another table (the parent table) by
using a foreign key. The referenced column(s) in the parent table must be under a unique constraint, such as a
primary key. Also, self-references are possible (not fully implemented in MS SQL Server though
[1]
). On inserting a
new row into the child table, the relational database management system (RDBMS) checks if the entered key value
exists in the parent table. If not, no insert is possible. It is also possible to specify DRI actions on UPDATE and
DELETE, such as CASCADE (forwards a change/delete in the parent table to the child tables), NO ACTION (if the
specific row is referenced, changing the key is not allowed) or SET NULL / SET DEFAULT (a changed/deleted key
in the parent table results in setting the child values to NULL or to the DEFAULT value if one is specified).
Product specific meaning
In Microsoft SQL Server the term DRI also applies to the assigning of permissions to users on a database object.
Giving DRI permission to a database user allows them to add foreign key constraints on a table.
[2]
References
[1] Microsoft Support (2007-02-11). "Error message 1785 occurs when you create a FOREIGN KEY constraint that may cause multiple cascade
paths" (http:/ / support. microsoft. com/ kb/ 321843/ en-us). microsoft.com. . Retrieved 2009-01-24.
[2] Chigrik, Alexander (2003-08-13). "Managing Users Permissions on SQL Server" (http:/ / www. databasejournal.com/ features/mssql/
article.php/ 2246271). Database Journal. . Retrieved 2006-12-17.
External links
• DRI versus Triggers (http:// www. cvalde. net/ document/ declaRefIntegVsTrig.htm)
Digital continuity
22
Digital continuity
Digital continuity refers to the ability of a business or organisation to maintain its digital information in such a way
that the information will continue to be available, as needed, despite changes in digital storage technology.
[1]
It
focuses on making sure that information is complete, available and therefore usable for business or organisational
needs. Activities involved with managing digital continuity include information management, information risk
assessment and managing technical environments, including file format conversion. Digital continuity management
is particularly important to organisations that have a duty to act transparently, accountability and legally, such as
government and infrastructure companies,
[2]
, or that are responsible for maintaining repositories of information in
digital form, such as archives and libraries.
[3]
Scope of digital continuity
Digital continuity is often confused with digital preservation and business continuity. While there is some overlap
with both these areas, they are separate issues. Digital preservation focuses on long term record storage, to allow
access to digital information in the distant future in an effort to prevent a Digital Dark Age. Business continuity
focuses on making sure that a business can operate looking at ways to prevent problems from entering a business or
disaster recovery should there be an emergency. Digital continuity, on the other hand, is concerned with the ability to
make information continuously usable. What constitutes usable will be different depending on the organisation's
needs. Five key areas for consideration are; if the user can:
[4]
• find it when needed
• open it as needed
• work with it in the way needed
• understand what it is and what it is about
• trust that it is what it says it is.
Major institutions that have begun digital continuity projects include The National Archives in the United
Kingdom,
[5]
Archives New Zealand,
[6]
, the Welsh Assembly Government (in conjunction with the University of
Wales, Newport.
[7]
, and the National Library of Australia.
[8]
References
[1] MacLean, Margaret; Davis, Ben H (eds) (1999). Time & Bits: Managing Digital Continuity. Getty Publications. ISBN 0892365838.
[2] "Achieving digital continuity" (http:/ / futureproof.records.nsw. gov.au/ achieving-digital-continuity/). State of New South Wales. 4 June
2010. . Retrieved 18 December 2010.
[3] Dalbello, Marija (October 2002). "Is There a Text in This Library? History of the Book and Digital Continuity" (http:// hdl.handle.net/
10150/ 105488). Journal of Education for Library and Information Science 43 (3): 197–204. .
[4] "Understanding digital continuity?" (http:// www.nationalarchives.gov. uk/ documents/ understanding-digital-continuity.pdf). The National
Archives. . Retrieved 21 December 2010.
[5] "What is digital continuity?" (http:/ / www. nationalarchives. gov.uk/ information-management/projects-and-work/dc-what-is.htm). The
National Archives. . Retrieved 18 December 2010.
[6] "Digital Continuity Action Plan" (http:// archives. govt.nz/ advice/ digital-continuity-programme/digital-continuity-action-plan). Archives
New Zealand. . Retrieved 18 December 2010.
[7] "Digital Continuity" (http:/ / www.bcs. org/ server.php?show=conEvent.5639). Chartered Institute for IT. . Retrieved 18 December 2010.
[8] Gatenby, Pam (1 February 2002). "Digital continuity: the role of the National Library of Australia" (http:/ / www. docstoc. com/ docs/
67153296/Digital-continuity-the-role-of-the-National-Library-of-Australia). The Australian Library Jouurnal. . Retrieved 18 December 2010.
Digital preservation
23
Digital preservation
Digital preservation is the active management of digital information over time to ensure its accessibility.
Preservation of digital information is widely considered to require more constant and ongoing attention than
preservation of other media.
[1]
This constant input of effort, time, and money to handle rapid technological and
organizational advance is considered a major stumbling block for preserving digital information. Indeed, while we
are still able to read our written heritage from several thousand years ago, the digital information created merely a
decade ago is in serious danger of being lost, creating a digital Dark Age.
Digital preservation is the set of processes and activities that ensure continued access to information and all kinds of
records, scientific and cultural heritage existing in digital formats. This includes the preservation of materials
resulting from digital reformatting, but particularly information that is born-digital and has no analog counterpart. In
the language of digital imaging and electronic resources, preservation is no longer just the product of a program but
an ongoing process. In this regard the way digital information is stored is important in ensuring its longevity. The
long-term storage of digital information is assisted by the inclusion of preservation metadata.
Digital preservation is defined as: long-term, error-free storage of digital information, with means for retrieval and
interpretation, for the entire time span the information is required for. Long-term is defined as "long enough to be
concerned with the impacts of changing technologies, including support for new media and data formats, or with a
changing user community. Long Term may extend indefinitely".
[2]
"Retrieval" means obtaining needed digital files
from the long-term, error-free digital storage, without possibility of corrupting the continued error-free storage of the
digital files. "Interpretation" means that the retrieved digital files, files that, for example, are of texts, charts, images
or sounds, are decoded and transformed into usable representations. This is often interpreted as "rendering", i.e.
making it available for a human to access. However, in many cases it will mean able to be processed by
computational means.
Why active preservation is necessary
Society's heritage has been presented on many different materials, including stone, vellum, bamboo, silk, and paper.
Now a large quantity of information exists in digital forms, including emails, blogs, social networking websites,
national elections websites, web photo albums, and sites which change their content over time. According an article
by Brewster Kahle, in 1996 founder of Internet Archive, "Preserving the Internet", Scientific American, the average
life of a URL was, in 1997, 44 days.
[3]
The unique characteristic of digital forms makes it easy to create content and keep it up-to-date, but at the same time
brings many difficulties in the preservation of this content. Margaret Hedstrom points out that "...digital preservation
raises challenges of a fundamentally different nature which are added to the problems of preserving traditional
format materials."
[4]
Physical deterioration
The media on which digital contents are stored are more vulnerable to deterioration and catastrophic loss than some
analog media such as paper. While acid paper is prone to deterioration, becoming brittle and yellowing with age, the
deterioration may not become apparent for some decades and progresses slowly. It remains possible to retrieve
information without loss once deterioration is noticed. Digital data recording media may deteriorate more rapidly
and once the deterioration starts, in most cases there may already be data loss. This characteristic of digital forms
leaves a very short time frame for preservation decisions and actions.
Digital preservation
24
Digital obsolescence
Another challenge is the issue of long-term access to data. Digital technology is developing quickly and retrieval and
playback technologies can become obsolete in a matter of years. When faster, more capable and less expensive
storage and processing devices are developed, older versions may be quickly replaced. When a software or decoding
technology is abandoned, or a hardware device is no longer in production, records created with such technologies are
at great risk of loss, simply because they are no longer accessible. This process is known as digital obsolescence.
This challenge is exacerbated by a lack of established standards, protocols and proven methods for preserving digital
information.
[5]
We used to save copies of data on tapes, but media standards for tapes have changed considerably
over the last five to ten years, and there is no guarantee that tapes will be readable in the future.
[6]
Recovering these
materials may require special tools
[7]
Hedstrom further explained that almost all digital library researches have been
focused on "...architectures and systems for information organization and retrieval, presentation and visualization,
and administration of intellectual property rights" and that "...digital preservation remains largely experimental and
replete with the risks associated with untested methods".
Strategies
In 2006, the Online Computer Library Center developed a four-point strategy for the long-term preservation of
digital objects that consisted of:
• Assessing the risks for loss of content posed by technology variables such as commonly used proprietary file
formats and software applications.
• Evaluating the digital content objects to determine what type and degree of format conversion or other
preservation actions should be applied.
• Determining the appropriate metadata needed for each object type and how it is associated with the objects.
• Providing access to the content.
[8]
There are several additional strategies that individuals and organizations may use to actively combat the loss
of digital information.
Refreshing
Refreshing is the transfer of data between two types of the same storage medium so there are no bitrate changes or
alteration of data.
[9]
For example, transferring census data from an old preservation CD to a new one. This strategy
may need to be combined with migration when the software or hardware required to read the data is no longer
available or is unable to understand the format of the data. Refreshing will likely always be necessary due to the
deterioration of physical media.
Migration
Migration is the transferring of data to newer system environments (Garrett et al., 1996). This may include
conversion of resources from one file format to another (e.g., conversion of Microsoft Word to PDF or
OpenDocument), from one operating system to another (e.g., Windows to Linux) or from one programming
language to another (e.g., C to Java) so the resource remains fully accessible and functional. Resources that are
migrated run the risk of losing some type of functionality since newer formats may be incapable of capturing all the
functionality of the original format, or the converter itself may be unable to interpret all the nuances of the original
format. The latter is often a concern with proprietary data formats.
The US National Archives Electronic Records Archives and Lockheed Martin are jointly developing a migration
system that will preserve any type of document, created on any application or platform, and delivered to the archives
on any type of digital media.
[10]
In the system, files are translated into flexible formats, such as XML; they will
therefore be accessible by technologies in the future.
[10]
Lockheed Martin argues that it would be impossible to
Digital preservation
25
develop an emulation system for the National Archives ERA because the volume of records and cost would be
prohibitive.
[10]
Replication
Creating duplicate copies of data on one or more systems is called replication. Data that exists as a single copy in
only one location is highly vulnerable to software or hardware failure, intentional or accidental alteration, and
environmental catastrophes like fire, flooding, etc. Digital data is more likely to survive if it is replicated in several
locations. Replicated data may introduce difficulties in refreshing, migration, versioning, and access control since the
data is located in multiple places.
Emulation
Emulation is the replicating of functionality of an obsolete system.
[11]
Examples include emulating an Atari 2600 on
a Windows system or emulating WordPerfect 1.0 on a Macintosh. Emulators may be built for applications, operating
systems, or hardware platforms. Emulation has been a popular strategy for retaining the functionality of old video
game systems, such as with the MAME project. The feasibility of emulation as a catch-all solution has been debated
in the academic community. (Granger, 2000)
Raymond A. Lorie has suggested a Universal Virtual Computer (UVC) could be used to run any software in the
future on a yet unknown platform.
[12]
The UVC strategy uses a combination of emulation and migration. The UVC
strategy has not yet been widely adopted by the digital preservation community.
Jeff Rothenberg, a major proponent of Emulation for digital preservation in libraries, working in partnership with
Koninklijke Bibliotheek and National Archief of the Netherlands, has recently helped launch Dioscuri, a modular
emulator that succeeds in running MS-DOS, WordPerfect 5.1, DOS games, and more.
[13]
Metadata attachment
Metadata is data on a digital file that includes information on creation, access rights, restrictions, preservation
history, and rights management.
[14]
Metadata attached to digital files may be affected by file format obsolescence.
ASCII is considered to be the most durable format for metadata
[15]
because it is widespread, backwards compatible
when used with Unicode, and utilizes human-readable characters, not numeric codes. It retains information, but not
the structure information it is presented in. For higher functionality, SGML or XML should be used. Both markup
languages are stored in ASCII format, but contain tags that denote structure and format.
Trustworthy digital objects
Digital objects that can speak to their own authenticity are called trustworthy digital objects (TDOs). TDOs were
proposed by Henry M. Gladney to enable digital objects to maintain a record of their change history so future users
can know with certainty that the contents of the object are authentic.
[16]
Other preservation strategies like replication
and migration are necessary for the long-term preservation of TDOs.
Digital sustainability
Digital sustainability encompasses a range of issues and concerns that contribute to the longevity of digital
information.
[17]
Unlike traditional, temporary strategies and more permanent solutions, digital sustainability implies
a more active and continuous process. Digital sustainability concentrates less on the solution and technology and
more on building an infrastructure and approach that is flexible with an emphasis on interoperability, continued
maintenance and continuous development.
[18]
Digital sustainability incorporates activities in the present that will
facilitate access and availability in the future.
Digital preservation
26
Digital preservation standards
To standardize digital preservation practice and provide a set of recommendations for preservation program
implementation, the Reference Model for an Open Archival Information System (OAIS) was developed. The
reference model (ISO 14721:2003) includes the following responsibilities that an OAIS archive must abide by:
• Negotiate for and accept appropriate information from information Producers.
• Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation.
• Determine, either by itself or in conjunction with other parties, which communities should become the
Designated Community and, therefore, should be able to understand the information provided.
• Ensure that the information to be preserved is Independently Understandable to the Designated Community. In
other words, the community should be able to understand the information without needing the assistance of the
experts who produced the information.
• Follow documented policies and procedures which ensure that the information is preserved against all
reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the
original, or as traceable to the original.
• Make the preserved information available to the Designated Community.
[19]
OAIS is concerned with all technical aspects of a digital object’s life cycle: ingest into and storage in a
preservation infrastructure, data management, accessibility, and distribution. The model also addresses
metadata issues and recommends that five types of metadata be attached to a digital object: reference
(identification) information, provenance (including preservation history), context, fixity (authenticity
indicators), and representation (formatting, file structure, and what "imparts meaning to an object’s
bitstream".
[20]
Prior to Gladney's proposal of TDOs was the Research Library Group's (RLG) development of "attributes and
responsibilities" that denote the practices of a "Trusted Digital Repository" (TDR) The seven attributes of a
TDR are: "compliance with the Reference Model for an Open Archival Information System (OAIS),
Administrative responsibility, Organizational viability, Financial sustainability, Technological and procedural
suitability, System security, Procedural accountability." Among RLG’s attributes and responsibilities were
recommendations calling for the collaborative development of digital repository certifications, models for
cooperative networks, and sharing of research and information on digital preservation with regards to
intellectual property rights.
[21]
Digital sound preservation standards
In January 2004, the Council on Library and Information Resources (CLIR) hosted a roundtable meeting of audio
experts discussing best practices, which culminated in a report delivered March 2006. This report investigated
procedures for reformatting sound from analog to digital, summarizing discussions and recommendations for best
practices for digital preservation. Participants made a series of 15 recommendations for improving the practice of
analog audio transfer for archiving:
• Develop core competencies in audio preservation engineering. Participants noted with concern that the number of
experts qualified to transfer older recordings is shrinking and emphasized the need to find a way to ensure that the
technical knowledge of these experts can be passed on.
• Develop arrangements among smaller institutions that allow for cooperative buying of esoteric materials and
supplies.
• Pursue a research agenda for magnetic-tape problems that focuses on a less destructive solution for hydrolysis
than baking, relubrication of acetate tapes, and curing of cupping.
• Develop guidelines for the use of automated transfer of analog audio to digital preservation copies.
Digital preservation
27
• Develop a web-based clearinghouse for sharing information on how archives can develop digital preservation
transfer programs.
• Carry out further research into nondestructive playback of broken audio discs.
• Develop a flowchart for identifying the composition of various types of audio discs and tapes.
• Develop a reference chart of problematic media issues.
• Collate relevant audio engineering standards from organizations.
• Research safe and effective methods for cleaning analog tapes and discs.
• Develop a list of music experts who could be consulted for advice on transfer of specific types of musical content
(e.g., determining the proper key so that correct playback speed can be established).
• Research the life expectancy of various audio formats.
• Establish regional digital audio repositories.
• Cooperate to develop a common vocabulary within the field of audio preservation.
• Investigate the transfer of technology from such fields as chemistry and materials science to various problems in
audio preservation.
[22]
Updated technical guidelines on the creation and preservation of digital audio have been prepared by the
International Association of Sound and Audiovisual Archives (IASA).
[23]
Examples of digital preservation initiatives
• Xena is a free Java-based open source archiving solution that can be installed on any desktop PC. It converts
proprietary document, graphics and audio file formats to open formats, and normalizes other binary files to ASCII
with an XML file wrapper.
• ArchivalWare (www.ArchivalWare.net) built by PTFS, Inc. is a digital library solution created specifically to
house, disseminate, preserve and allow discovery of digital assets. The product supports archival versions and
dissemination versions of ingested digital objects, creates PDFa files upon ingestion for long term digital
preservation and includes XMP metadata support which allows rich metadata to live in and move with the digital
object itself.
• DSpace is open source software that is available to anyone who has the World Wide Web. DSpace takes data in
multiple formats (text, video, audio, or data), distributes it over the web, indexes the data (for easy retrieval), and
preserves the data over time.
• The British Library is responsible for several programmes in the area of digital preservation. The National
Archives of the United Kingdom have also pioneered various initiatives in the field of digital preservation.
• PADI is a comprehensive archive of information on the topic of digital preservation from the National Library of
Australia.
• SimpleDL can store multiple formats, including text, images, video, audio, and data. SimpleDL uses Amazon S3
to provide 99.999999999% durability for the files stored in its preservation system.
Digital preservation
28
Large-scale digital preservation initiatives (LSDIs)
Many research libraries and archives have begun or are about to begin Large-Scale digital preservation initiatives
(LSDI’s). The main players in LSDIs are cultural institutions, commercial companies such as Google and Microsoft,
and non-profit groups including the Open Content Alliance (OCA), the Million Book Project (MBP), and
HathiTrust. The primary motivation of these groups is to expand access to scholarly resources.
LSDIs: library perspective
Approximately 30 cultural entities, including the 12-member Committee on Institutional Cooperation (CIC), have
signed digitization agreements with either Google or Microsoft. Several of these cultural entities are participating in
the Open Content Alliance (OCA) and the Million Book Project (MBP). Some libraries are involved in only one
initiative and others have diversified their digitization strategies through participation in multiple initiatives. The
three main reasons for library participation in LSDIs are: Access, Preservation and Research and Development. It is
hoped that digital preservation will ensure that library materials remain accessible for future generations. Libraries
have a perpetual responsibility for their materials and a commitment to archive their digital materials. Libraries plan
to use digitized copies as backups for works in case they go out of print, deteriorate, or are lost and damaged.
Footnotes
[1] "Lifecycle Information for E-literature" (http:// eprints. ucl.ac.uk/ archive/00001854/ ). LIFE (http:// www. ucl. ac.uk/ ls/ life/). .
Retrieved 2007-06-14.
[2] Consultative Committee for Space Data Systems. (2002). Reference Model for an Open Archival Information System (OAIS). Washington,
DC: CCSDS Secretariat, p. 1-1
[3] Brewster Kahle Preserving the Internet. «Scientific American», 276 (1997), n. 3, p. 72-74. (http:/ / web.archive.org/web/ 19980627072808/
http:// www. sciam. com/ 0397issue/ 0397kahle. html/ ) retrieved on 2011-02-06
[4] Hedstrom, M. (1997). Digital preservation: a time bomb for Digital Libraries. Retrieved on December 4th, 2007, from http:// www.uky. edu/
~kiernan/ DL/hedstrom. html.
[5] Levy, D. M. & Marshall, C. C. (1995). Going digital: a look at assumptions underlying digital libraries," Communications of the ACM, 58,
No. 4: 77-84.
[6] Flugstad, Myron. (2007). Website Archiving: the Long-Term Preservation of Local Born Digital Resources. Arkansas Libraries v. 64 no. 3
(Fall 2007) p. 5-7
[7] Ross, Seamus; Gow, Ann (1999). Digital archaeology? Rescuing Neglected or Damaged Data Resources (http:/ / www. ukoln. ac. uk/
services/elib/ papers/ supporting/ pdf/p2. pdf). Bristol & London: British Library and Joint Information Systems Committee.
ISBN 1-900508-51-6
[8] Online Computer Library Center, Inc. (2006). OCLC Digital Archive Preservation Policy and Supporting Documentation (http:/ / www. oclc.
org/ support/ documentation/ digitalarchive/ preservationpolicy.pdf), p. 5
[9] Cornell University Library. (2005) Digital Preservation Management: Implementing Short-term Strategies for Long-term Problems. (http://
www.library.cornell.edu/ iris/ tutorial/ dpm/ eng_index. html/ )
[10] Reagan, Brad (2006). "The Digital Ice Age" (http:// www. popularmechanics.com/ technology/ industry/4201645. html). Popular
Mechanics. .
[11] Rothenberg, Jeff (1998). Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation (http:// www.
clir.org/ PUBS/ reports/ rothenberg/contents. html). Washington, DC, USA: Council on Library and Information Resources.
ISBN 1-887334-63-7.
[12] Lorie, Raymond A. (2001). " Long Term Preservation of Digital Information (http:// doi.acm.org/ 10. 1145/ 379437. 379726)".
Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '01). Roanoke, Virginia, USA. pp. 346–352.
[13] Hoeven, J. (2007). "Dioscuri: emulator for digital preservation" (http:/ / www.dlib. org/ dlib/ november07/11inbrief.html). D-Lib
Magazine 13 (11/12). doi:10.1045/november2007-inbrief. .
[14] NISO Framework Advisory Group. (2007). A Framework of Guidance for Building Good Digital Collections, 3rd edition (http:// www.
niso.org/publications/ rp/framework3.pdf), p. 57,
[15] National Initiative for a Networked Cultural Heritage. (2002). NINCH Guide to Good Practice in the Digital Representation and
Management of Cultural Heritage Materials (http:// www.nyu. edu/ its/ humanities/ ninchguide/ )
[16] Gladney, H. M. (2004). "Trustworthy 100-year digital objects: Evidence after every witness is dead" (http:// doi. acm.org/ 10.1145/
1010614.1010617). ACM Transactions on Information Systems 22 (3): 406–436. doi:10.1145/1010614.1010617. .
[17] Bradley, K. (Summer 2007). Defining digital sustainability. Library Trends v. 56 no 1 p. 148-163.
Digital preservation
29
[18] Sustainability of Digital Resources. (2008). TASI: Technical Advisory Service for Images. (http:/ / www.tasi. ac.uk/ advice/ managing/ sust.
html)
[19] Consultative Committee for Space Data Systems. (2002). Reference Model for an Open Archival Information System (OAIS). Washington,
DC: CCSDS Secretariat, p. 3-1
[20] Cornell University Library. (2005) Digital Preservation Management: Implementing Short-term Strategies for Long-term Problems (http://
www.library.cornell.edu/ iris/ tutorial/ dpm/ eng_index. html/ )
[21] Research Libraries Group. (2002). Trusted Digital Repositories: Attributes and Responsibilities (http:// www.oclc.org/programs/ ourwork/
past/trustedrep/repositories. pdf)
[22] Council on Library and Information Resources. Publication 137: Capturing Analog Sound for Digital Preservation: Report of a Roundtable
Discussion of Best Practices for Transferring Analog Discs and Tapes March 2006 (http:/ / www.clir. org/pubs/ abstract/ pub137abst. html)
[23] IASA (2009). Guidelines on the Production and Preservation of Digital Audio Objects (http:// www. iasa-web.org/tc04/
audio-preservation)
References
• Garrett, J., D. Waters, H. Gladney, P. Andre, H. Besser, N. Elkington, H. Gladney, M. Hedstrom, P. Hirtle, K.
Hunter, R. Kelly, D. Kresh, M. Lesk, M. Levering, W. Lougee, C. Lynch, C. Mandel, S. Mooney, A. Okerson, J.
Neal, S. Rosenblatt, and S. Weibe (1996). "Preserving digital information: Report of the task force on archiving
of digital information" (http:// www. rlg.org/legacy/ ftpd/pub/ archtf/final-report.pdf) (PDF). Commission on
Preservation and Access and the Research Libraries Group. Retrieved 2009-06-23.
• Gladney, H. M.; Lorie, R. A. (2005). "Trustworthy 100-year digital objects: durable encoding for when it's too
late to ask" (http:/ / portal. acm. org/citation. cfm?id=1080343.1080346). ACM Transactions on Information
Systems 23 (3): 299–324. doi:10.1145/1080343.1080346.
• Gladney, H. M. (2006). "Principles for digital preservation" (http:// portal. acm. org/citation. cfm?id=1113034.
1113038& coll=GUIDE&dl=ACM& CFID=22202827& CFTOKEN=41873705). Communications of the ACM
49 (2): 111–116. doi:10.1145/1113034.1113038.
• Granger, Stewart (2000). "Emulation as a Digital Preservation Strategy" (http:// www.dlib. org/dlib/ october00/
granger/ 10granger.html). D-Lib Magazine 6 (10). doi:10.1045/october2000-granger.
• Edwards, Eli (2004). "Ephemeral to Enduring: The Internet Archive and Its Role in Preserving Digital Media".
Information Technology & Libraries 23 (1).
• Hedstrom, M., Ross, S., Ashley, K., Christensen-Dalsgaard, B., Duff, W., Gladney, H., Huc, C., Kenney, A.R.,
Moore, R., Neuhold, E. (2003). "Invest to Save: Report and Recommendations of the NSF-DELOS Working
Group on Digital Archiving and Preservation" (http:/ / delos-noe. iei. pi. cnr.it/ activities/ internationalforum/
Joint-WGs/ digitalarchiving/Digitalarchiving.pdf). Nsf/Delos (Pisa & Washington DC, USA).
• Jantz, R. & Giarlo, M.J. (2005). "Digital preservation: Architecture and technology for trusted digital
repositories". D-Lib Magazine 11 (6). doi:10.1045/june2005-jantz.
• Ross, S (2000). Changing Trains at Wigan: Digital Preservation and the Future of Scholarship (http:/ / www.bl.
uk/blpac/ pdf/wigan. pdf). London, UK: National Preservation Office (British Library). ISBN 0-7123-4717-8.
• Ross, S. and Gow, A. (1999). Digital archaeology? Rescuing Neglected or Damaged Data Resources (http:/ /
www.ukoln. ac. uk/ services/ elib/ papers/ supporting/ pdf/p2. pdf). Bristol & London: British Library and Joint
Information Systems Committee. ISBN 1-900508-51-6.
• Rothenberg, Jeff (1995). "Ensuring the Longevity of Digital Documents". Scientific American 272 (1).
• Rothenberg, Jeff (1999). Ensuring the Longevity of Digital Information (http:/ / www.clir.org/ pubs/ archives/
ensuring. pdf). Expanded version of Ensuring the Longevity of Digital Documents.
• Milne, Ronald -- moderator: Webcast panel discussion, "Economics," (http:/ / www.lib. umich. edu/ mdp/
symposium/ economics.html) Scholarship and Libraries in Transition: A Dialogue about the Impacts of Mass
Digitization Projects (2006), Symposium sponsored by the University of Michigan Library and the National
Commission on Libraries and Information Science (US).
Digital preservation
30
External links
• National Digital Information Infrastructure and Preservation Program (http:// www.digitalpreservation.gov) at
the Library of Congress
• Digital Preservation page (http:/ / www. diglib.org/preserve.htm) from the Digital Library Federation
• "Thirteen Ways of Looking at...Digital Preservation" (http:/ / dlib. org/dlib/ july04/ lavoie/ 07lavoie. html)
• Cornell University Library's Digital Imaging Tutorial (http:/ / www.library.cornell.edu/ preservation/tutorial/
contents. html)
• What is Digital Preservation? (http:// www. digitalpreservationeurope.eu/ what-is-digital-preservation/) - an
introduction to digital preservation by Digital Preservation Europe
• Macroscopic 10-Terabit–per–Square-Inch Arrays from Block Copolymers with Lateral Order. (http:// www.
sciencemag. org/cgi/ content/ abstract/ 323/ 5917/ 1030) Science magazine article about prospective usage of
sapphire in digital storage media technology
• Animations introducing digital preservation and curation (http:// www.youtube. com/ watch?v=pbBa6Oam7-w)
• Capture Your Collections: Planning and Implementing Digitization Projects (http:/ / www.pro. rcip-chin.gc. ca/
sommaire-summary/ planification_numerisation-digitization_planning-eng.jsp) A CHIN (Canadian Heritage
Information Network) Resource
Dirty data
Dirty data is a term used by Information technology (IT) professionals when referring to inaccurate information
(data) collected from data capture forms. It is also used to refer to data which has not yet been committed to the
database, and is currently held in memory.
Dirty data can be misleading, incorrect, without generalized formatting, incorrectly spelled or punctuated, entered
into the wrong field or duplicated. Dirty data can be prevented using input masks or validation rules, but completely
removing such data from a source can be impossible or impractical
There are several causes of dirty data. In some cases, the information is deliberately distorted. A person may insert
misleading or fictional personal information which appears real. Such dirty data may not be picked up by an
administrator or a validation routine because it appears legitimate. Duplicate data can be caused by repeat
submissions, user error or incorrect data joining. There can also be formatting issues or typographical errors. A
common formatting issue is caused by variations in a user's preference for entering phone numbers.
References
• Webopedia - dirty data
[1]
• Whatis.com - dirty data
[2]
• We Like Bad Data - an alternative take on what might be considered Dirty Data
[3]
References
[1] http:/ / www. webopedia. com/ TERM/ D/ dirty_data.html
[2] http:/ / searchcrm.techtarget.com/ sDefinition/ 0,290660,sid11_gci1007725,00. html
[3] http:/ / www. modelfutures.com/ ramblings/ 6/ we-like-bad-data-quality
Entity integrity
31
Entity integrity
In the relational data model, entity integrity is one of the three inherent integrity rules. Entity integrity is an integrity
rule which states that every table must have a primary key and that the column or columns chosen to be the primary
key should be unique and not null.
[1]
A direct consequence of this integrity rule is that duplicate rows are forbidden
in a table. If each value of a primary key must be unique no duplicate rows can logically appear in a table. The NOT
NULL characteristic of a primary key ensures that its value can be used to identify all rows in a table.
Within relational databases using SQL, entity integrity is enforced by adding a primary key clause to a schema
definition. The system enforces Entity Integrity by not allowing operations (INSERT, UPDATE) to produce an
invalid primary key. Any operation that is likely to create a duplicate primary key or one containing nulls is rejected.
The Entity Integrity ensures that the data that you store remains in the proper format as well as comprehendible.
References
[1] Beynon-Davies P. (2004). Database Systems 3rd Edition. Palgrave, Basingstoke, UK. ISBN 1-4039-1601-2
Information quality
Information quality (IQ) is a term to describe the quality of the content of information systems. It is often
pragmatically defined as: "The fitness for use of the information provided."
Conceptual problems
Although this pragmatic definition is usable for most everyday purposes, specialists often use more complex models
for information quality. Most information system practitioners use the term synonymously with data quality.
However, as many academics make a distinction between data and information,
[1]
some will insist on a distinction
between data quality and information quality. This distinction would be akin to the distinction between syntax and
semantics where for example, the semantic value of "one" could be expressed in different syntaxes like 00001;
1.0000; 01.0; or 1. Thus a data difference may not necessarily represent poor information quality.
Information quality assurance is the process to guarantee confidence that particular information meets some context
specific quality requirements. It has been suggested, however, that higher the quality the greater will be the
confidence in meeting more general, less specific contexts.
[2]
Dimensions and metrics of Information Quality
"Information quality" is a measure of the value which the information provides to the user of that information.
"Quality" is often perceived as subjective and the quality of information can then vary among users and among uses
of the information. Nevertheless, a high degree of quality increases its objectivity or at least the intersubjectivity.
Accuracy can be seen as just one element of IQ but, depending upon how it is defined, can also be seen as
encompassing many other dimensions of quality.
If not, it is perceived that often there is a trade-off between accuracy and other dimensions, aspects or elements of the
information determining its suitability for any given tasks. A list of dimensions or elements used in assessing
subjective Information Quality is:
[3]
• Intrinsic IQ: Accuracy, Objectivity, Believability, Reputation
• Contextual IQ: Relevancy, Value-Added, Timeliness, Completeness, Amount of information
• Representational IQ: Interpretability, Ease of understanding, Concise representation, Consistent representation
• Accessibility IQ: Accessibility, Access security
Information quality
32
While information as a distinct term has various ambiguous definitions, there's one which is more general, such as
"description of events". While the occurrences being described cannot be subjectively evaluated for quality, since
they're very much autonomous events in space and time, their description can -- since it possesses a garnishment
attribute, unavoidably attached by the medium which carried the information, from the initial moment of the
occurrences being described.
In an attempt to deal with this natural phenomenon, qualified professionals primarily representing the researchers'
guild, have at one point or another identified particular metrics for information quality. They could also be described
as 'quality traits' of information, since they're not so easily quantified, but rather subjectively identified on an
individual basis.
Proposed quality metrics
• Authority/Verifiability
Authority refers to the expertise or recognized official status of a source. Consider the reputation of the author and
publisher. When working with legal or government information, consider whether the source is the official provider
of the information. Verifiability refers to the ability of a reader to verify the validity of the information irresepective
of how authoritative the source is. To verify the facts is part of the duty of care of the journalistic deontology, as well
as, where possible, to provide the sources of information so that they can be verified
• Scope of coverage
Scope of coverage refers to the extent to which a source explores a topic. Consider time periods, geography or
jurisdiction and coverage of related or narrower topics.
• Composition and Organization
Composition and Organization has to do with the ability of the information source to present it’s particular message
in a coherent, logically sequential manner.
• Objectivity
Objectivity is the bias or opinion expressed when a writer interprets or analyze facts. Consider the use of persuasive
language, the source’s presentation of other viewpoints, it’s reason for providing the information and advertising.
• Integrity
1. Adherence to moral and ethical principles; soundness of moral character
2. The state of being whole, entire, or undiminished
• Comprehensiveness
1. Of large scope; covering or involving much; inclusive: a comprehensive study.
2. Comprehending mentally; having an extensive mental grasp.
3. Insurance. covering or providing broad protection against loss.
• Validity
Validity of some information has to do with the degree of obvious truthfulness which the information caries
• Uniqueness
As much as ‘uniqueness’ of a given piece of information is intuitive in meaning, it also significantly implies not only
the originating point of the information but also the manner in which it is presented and thus the perception which it
conjures. The essence of any piece of information we process consists to a large extent of those two elements.
• Timeliness
Timeliness refers to information that is current at the time of publication. Consider publication, creation and revision
dates. Beware of Web site scripting that automatically reflects the current day’s date on a page.
• Reproducibility (utilized primarily when referring to instructive information)
Information quality
33
Means that documented methods are capable of being used on the same data set to achieve a consistent result.
Professional associations
International Association for Information and Data Quality (IAIDQ)
[4]
References
[1] For a scientifc and philosophical unraveling of these concept see Churchman, C.W. (1971) The design of inquiring systems, New York: Basic
Books.
[2] See Ivanov, K. (1972) "Quality-control of information: On the concept of accuracy of information in data banks and in management
information systems" (http:/ / www.informatik.umu. se/ ~kivanov/ diss-avh.html). The University of Stockholm and The Royal Institute of
Technology. Doctoral dissertation. Further details are found in Ivanov, K. (1995). A subsystem in the design of informatics: Recalling an
archetypal engineer. In B. Dahlbom (Ed.), The infological equation: Essays in honor of Börje Langefors (http:/ / www.informatik.umu. se/
~kivanov/BLang80. html), (pp. 287-301). Gothenburg: Gothenburg University, Dept. of Informatics (ISSN 1101-7422).
[3] Wang, R. & Strong, D. (1996) "Beyond Accuracy: What Data Quality Means to Data Consumers". "Journal of Management Information
Systems", 12(4), p. 5-34.
Link rot
Link rot (or linkrot) is an informal term for the process by which, either on individual websites or the Internet in
general, increasing numbers of links point to web pages, servers or other resources that have become permanently
unavailable. The phrase also describes the effects of failing to update out-of-date web pages that clutter search
engine results. A link that does not work any more is called a broken link, dead link or dangling link.
Causes
A link may become broken for several reasons: The most common result of a dead link is a 404 error, which
indicates that the web server responded, but the specific page could not be found.
Some news sites contribute to the link rot problem by keeping only recent news articles online where they are freely
accessible at their original URLs, then removing them or moving them to a paid subscription area. This causes a
heavy loss of supporting links in sites discussing newsworthy events and using news sites as references.
Another type of dead link occurs when the server that hosts the target page stops working or relocates to a new
domain name. In this case the browser may return a DNS error, or it may display a site unrelated to the content
sought. The latter can occur when a domain name is allowed to lapse, and is subsequently reregistered by another
party. Domain names acquired in this manner are attractive to those who wish to take advantage of the stream of
unsuspecting surfers that will inflate hit counters and PageRanking.
A link might also be broken because of some form of blocking such as content filters or firewalls. Dead links
commonplace on the Internet can also occur on the authoring side, when website content is assembled, copied, or
deployed without properly verifying the targets, or simply not kept up to date.
Prevalence
The 404 "not found" response is familiar to even the occasional Web user. A number of studies have examined the
prevalence of link rot on the Web, in academic literature, and in digital libraries. In a 2003 experiment, Fetterly et al.
discovered that about one link out of every 200 disappeared each week from the internet. McCown et al. (2005)
discovered that half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after
publication, and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence
et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects
Link rot
34
were no longer accessible after one year.
Discovering
Detecting link rot for a given URL is difficult using automated methods. If a URL is accessed and returns an HTTP
200 (OK) response, it may be considered accessible, but the contents of the page may have changed and may no
longer be relevant. Some web servers also return a soft 404, a page returned with a 200 (OK) response (instead of a
404 that indicates the URL is no longer accessible). Bar-Yossef et al. (2004) developed a heuristic for automatically
discovering soft 404s.
Combating
Due to the unprofessional image that dead links bring to both sites linking and linked to, there are multiple solutions
that are available to tackle them - some working to prevent them in the first place, and others trying to resolve them
when they have occurred. There are several tools that have been developed to help combat link rot.
Server side
• Avoiding unmanaged hyperlink collections
• Avoiding links to pages deep in a website ("deep linking")
• Using redirection mechanisms (e.g. "301: Moved Permanently") to automatically refer browsers and crawlers to
the new location of a URL
• Content Management Systems may offer inbuilt solutions to the management of links, e.g. links are updated when
content is changed or moved on the site.
• WordPress guards against link rot by replacing non-canonical URLs with their canonical versions.
[1]
• IBM's Peridot attempts to automatically fix broken links.
• Permalinking stops broken links by guaranteeing that the content will never move. Another form of permalinking
is linking to a permalink that then redirects to the actual content, ensuring that even though the real content may
be moved etc., links pointing to the resources stay intact.
User side
• The Linkgraph widget gets the URL of the correct page based upon the old broken URL by using historical
location information.
• The Google 404 Widget employs Google technology to 'guess' the correct URL, and also provides the user a
Google search box to find the correct page.
• When a user receives a 404 response, the Google Toolbar attempts to assist the user in finding the missing
page.
[2]
• DeadURL.com
[3]
gathers and ranks alternate urls for a broken link using Google Cache, the Internet Archive, and
user submissions.
[4]
Typing deadurl.com/ left of a broken link in the browser's address bar and pressing enter
loads a ranked list of alternate urls, or (depending on user preference) immediately forwards to the best one.
[5]
Web archiving
To combat link rot, web archivists are actively engaged in collecting the Web or particular portions of the Web and
ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the
public. The largest web archiving organization is the Internet Archive, which strives to maintain an archive of the
entire Web, taking periodic snapshots of pages that can then be accessed for free via the Wayback Machine and
without registration many years later simply by typing in the URL, or automatically by using browser extensions.
[6]

National libraries, national archives and various consortia of organizations are also involved in archiving culturally
Link rot
35
important Web content.
Individuals may also use a number of tools that allow them to archive web resources that may go missing in the
future:
• WebCite, a tool specifically for scholarly authors, journal editors and publishers to permanently archive
"on-demand" and retrieve cited Internet references (Eysenbach and Trudel, 2005).
• Archive-It, a subscription service that allows institutions to build, manage and search their own web archive
• Some social bookmarking websites, such as Furl, make private copies of web pages bookmarked by their users.
• Google keeps a text-based cache (temporary copy) of the pages it has crawled, which can be used to read the
information of recently removed pages. However, unlike in archiving services, cached pages are not stored
permanently.
• The WayBack Machine, at the Internet Archive (link
[7]
), is a free website that archives old web pages. It does not
archive websites whose owners have stated they do not want their website archived.
Authors citing URLs
A number of studies have shown how widespread link rot is in academic literature (see below). Authors of scholarly
publications have also developed best-practices for combating link rot in their work:
• Avoiding URL citations that point to resources on a researcher's personal home page (McCown et al., 2005)
• Using Persistent Uniform Resource Locators (PURLs) and digital object identifiers (DOIs) whenever possible
• Using web archiving services (e.g. WebCite) to permanently archive and retrieve cited Internet references
(Eysenbach and Trudel, 2005).
Further reading
Link rot on the Web
• Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, and Andrew Tomkins (2004). "Sic transit gloria telae: towards an
understanding of the Web’s decay". Proceedings of the 13th international conference on World Wide Web.
pp. 328–337. doi:10.1145/988672.988716.
• Tim Berners-Lee (1998). Cool URIs Don’t Change
[8]
. Retrieved 2010-09-14.
• Gunther Eysenbach and Mathieu Trudel (2005). "Going, going, still there: using the WebCite service to
permanently archive cited web pages". Journal of Medical Internet Research 7 (5): e60. doi:10.2196/jmir.7.5.e60.
PMC 1550686. PMID 16403724.
• Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener (2003). "A large-scale study of the evolution of
web pages"
[9]
. Proceedings of the 12th international conference on World Wide Web. Retrieved 2010-09-14.
• Wallace Koehler (2004). "A longitudinal study of web pages continued: A consideration of document persistence"
[10]
. Information Research 9 (2).
• John Markwell and David W. Brooks (2002). "Broken Links: The Ephemeral Nature of Educational WWW
Hyperlinks". Journal of Science Education and Technology 11 (2): 105–108. doi:10.1023/A:1014627511641.
In academic literature
• Daniel Gomes
[11]
, Mário J. Silva (2006). "Modelling Information Persistence on the Web"
[12]
. Proceedings of
The 6th International Conference on Web Engineering (ICWE'06). Retrieved 2010-09-14.
• Robert P. Dellavalle, Eric J. Hester, Lauren F. Heilig, Amanda L. Drake, Jeff W. Kuntzman, Marla Graber, Lisa
M. Schilling (2003). "Going, Going, Gone: Lost Internet References". Science 302 (5646): 787–788.
doi:10.1126/science.1088234. PMID 14593153.
• Steve Lawrence, David M. Pennock, Gary William Flake, Robert Krovetz, Frans M. Coetzee, Eric Glover, Finn
Arup Nielsen, Andries Kruger, C. Lee Giles (2001). "Persistence of Web References in Scientific Research"
[13]
.
Link rot
36
Computer 34 (2): 26–31. doi:10.1109/2.901164.
• Wallace Koehler (1999). "An Analysis of Web Page and Web Site Constancy and Permanence". Journal of the
American Society for Information Science 50 (2): 162–180.
doi:10.1002/(SICI)1097-4571(1999)50:2<162::AID-ASI7>3.0.CO;2-B.
• Frank McCown, Sheffan Chan, Michael L. Nelson, and Johan Bollen (2005). "The Availability and Persistence of
Web References in D-Lib Magazine"
[14]
. Proceedings of the 5th International Web Archiving Workshop and
Digital Preservation (IWAW'05).
• Carmine Sellitto (2005). "The impact of impermanent Web-located citations: A study of 123 scholarly conference
publications"
[15]
. Journal of the American Society for Information Science and Technology 56 (7): 695–703.
doi:10.1002/asi.20159.
• Diomidis Spinellis (2003). "The Decay and Failures of Web References"
[16]
. Communications of the ACM 46
(1): 71–77. doi:10.1145/602421.602422.
In digital libraries
• Michael L. Nelson and B. Danette Allen (2002). "Object Persistence and Availability in Digital Libraries". D-Lib
Magazine 8 (1). doi:10.1045/january2002-nelson.
References
[1] Rønn-Jensen, Jesper (2007-10-05). "Software Eliminates User Errors And Linkrot" (http:// justaddwater.dk/ 2007/ 10/ 05/
blog-software-eliminates-user-errors-and-linkrot/). Justaddwater.dk. . Retrieved 2007-10-05.
[2] Mueller, John (2007-12-14). "FYI on Google Toolbar's Latest Features" (http:// googlewebmastercentral.blogspot. com/ 2007/ 12/
fyi-on-google-toolbars-latest-features.html). Google Webmaster Central Blog. . Retrieved 2008-07-09.
[3] http:// DeadURL. com
[4] "DeadURL.com" (http:/ / deadurl. com/ ). . Retrieved 2011-03-17. "DeadURL.com gathers as many backup links as possible for each dead
url, via Google cache, Archive.org, and user submissions."
[5] "DeadURL.com" (http:/ / deadurl. com/ ). . Retrieved 2011-03-17. "Just type deadurl.com/ in front of a link that doesn't work, and hit Enter."
[6] 404-Error ? :: Add-ons for Firefox (https:// addons. mozilla.org/ en-US/firefox/addon/ 4693/ )
[7] http:// www. archive. org/
[8] http:/ / www. w3. org/Provider/Style/ URI. html
[9] http:/ / www2003. org/cdrom/ papers/ refereed/p097/ P97%20sources/ p97-fetterly.html
[10] http:// informationr.net/ ir/ 9-2/paper174.html
[11] http:/ / xldb.fc.ul. pt/ daniel/
[12] http:// xldb.di.fc.ul. pt/ daniel/ docs/ papers/ gomes06urlPersistence. pdf
[13] http:/ / doi.ieeecomputersociety. org/10. 1109/ 2. 901164
[14] http:/ / www. iwaw. net/ 05/ papers/ iwaw05-mccown1.pdf
[15] http:/ / doi.wiley. com/ 10. 1002/ asi. 20159
[16] http:// www. spinellis. gr/ pubs/ jrnl/ 2003-CACM-URLcite/ html/ urlcite.html
External links
• Future-Proofing Your URIs (http:// www. wrox.com/ WileyCDA/ Section/ id-301495. html)
• Jakob Nielsen, "Fighting Linkrot" (http:/ / www.useit. com/ alertbox/980614. html), Jakob Nielsen's Alertbox,
June 14, 1998.
• Warrick (http:// warrick.cs. odu. edu/ ) - a tool for recovering lost websites from the Internet Archive and search
engine caches
• Pagefactor (http:/ / www. pagefactor.com/ ) and UndeadLinks.com (http:// www.undeadlinks. com/ ) -
user-contributed databases of moved URLs
• W3C Link Checker (http:/ / validator. w3. org/checklink)
• mod_brokenlink (http:/ / code. google. com/ p/ modbrokenlink/) - Apache module that reports broken links.
One-for-one checking
37
One-for-one checking
In systems auditing, one-for-one checking is a control process that is frequently used to ensure that specific
elements between two or more sources of data are consistent. The control process can also reduce the chances of
human error by typos and miskeyed information.
An operations manager might use one-for-one checking of checks and receivables in order to verify that cash
collected is properly reflected by the receivable accounts with regard to the collected cash (i.e., each check is
associated with an invoice).
References
• Accounting Information Systems. Gelinas, Dull. 7th ed. 2008 Thomson Higher Education. ISBN 0-324-37882-3.
Referential integrity
An example of a database that has not enforced referential integrity. In this example,
there is a foreign key (artist_id) value in the album table that references a non-existent
artist — in other words there is a foreign key value with no corresponding primary key
value in the referenced table. What happened here was that there was an artist called
"Aerosmith", with an artist_id of 4, which was deleted from the artist table. However, the
album "Eat the Rich" referred to this artist. With referential integrity enforced, this would
not have been possible.
Referential integrity is a property of
data which, when satisfied, requires
every value of one attribute (column)
of a relation (table) to exist as a value
of another attribute in a different (or
the same) relation (table).
[1]
Less formally, and in relational
databases: For referential integrity to
hold, any field in a table that is
declared a foreign key can contain only
values from a parent table's primary
key or a candidate key. For instance,
deleting a record that contains a value
referred to by a foreign key in another
table would break referential integrity.
Some relational database management
systems (RDBMS) can enforce
referential integrity, normally either by
deleting the foreign key rows as well to
maintain integrity, or by returning an
error and not performing the delete.
Which method is used may be
determined by a referential integrity
constraint defined in a data dictionary.
References
[1] Mike Chapple. "Referential Integrity" (http:/ / databases. about.com/ cs/ administration/g/ refintegrity.htm). http:// databases. about.com/ :
About.com. . Retrieved 2011-03-17. "Definition: Referential integrity is a database concept that ensures that relationships between tables
remain consistent. When one table has a foreign key to another table, the concept of referential integrity states that you may not add a record
to the table that contains the foreign key unless there is a corresponding record in the linked table."
Soft error
38
Soft error
In electronics and computing, a soft error is an error in a signal or datum which is wrong. Errors may be caused by a
defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is
also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a
soft error, there is no implication that the system is any less reliable than before.
If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems
use error correction to correct soft errors on the fly. However, in many systems, it may be impossible to determine
the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the
system may have crashed, in which case the recovery procedure must include a reboot.
Soft errors involve changes to data — the electrons in a storage circuit, for example — but not changes to the
physical circuit itself, the atoms. If the data is rewritten, the circuit will work perfectly again.
Soft errors can occur on transmission lines, in digital logic, analog circuits, magnetic storage, and elsewhere, but are
most commonly known in semiconductor storage.
Critical charge
Whether a circuit experiences a soft error depends on the energy of the incoming particle, the geometry of the
impact, the location of the strike, and the design of the logic circuit. Logic circuits with higher capacitance and
higher logic voltages are less likely to suffer an error. This combination of capacitance and voltage is described by
the critical charge parameter, Q
crit
, the minimum electron charge disturbance needed to change the logic level. A
higher Q
crit
means fewer soft errors. Unfortunately, a higher Q
crit
also means a slower logic gate and a higher power
dissipation. Reduction in chip feature size and supply voltage, desirable for many reasons, decreases Q
crit
. Thus, the
importance of soft errors increases as chip technology advances.
In a logic circuit, Q
crit
is defined as the minimum amount of induced charge required at a circuit node to cause a
voltage pulse to propagate from that node to the output and be of sufficient duration and magnitude to be reliably
latched. Since a logic circuit contains many nodes that may be struck, and each node may be of unique capacitance
and distance from output, Q
crit
is typically characterized on a per-node basis.
Causes of soft errors
Alpha particles from package decay
Soft errors became widely known with the introduction of dynamic RAM in the 1970s. In these early devices, chip
packaging materials contained small amounts of radioactive contaminants. Very low decay rates are needed to avoid
excess soft errors, and chip companies have occasionally suffered problems with contamination ever since. It is
extremely hard to maintain the material purity needed. Controlling alpha particle emission rates for critical
packaging materials to less than a level of 0.001 counts per hour per cm
2
(cph/cm
2
) is required for reliable
performance of most circuits. For comparison, the count rate of a typical shoe's sole is between 0.1 and 10 cph/cm
2
.
Package radioactive decay usually causes a soft error by alpha particle emission. The positively charged alpha
particle travels through the semiconductor and disturbs the distribution of electrons there. If the disturbance is large
enough, a digital signal can change from a 0 to a 1 or vice versa. In combinational logic, this effect is transient,
perhaps lasting a fraction of a nanosecond, and this has led to the challenge of soft errors in combinational logic
mostly going unnoticed. In sequential logic such as latches and RAM, even this transient upset can become stored
for an indefinite time, to be read out later. Thus, designers are usually much more aware of the problem in storage
circuits.
Soft error
39
Cosmic rays creating energetic neutrons and protons
Once the electronics industry had determined how to control package contaminants, it became clear that other causes
were also at work. James F. Ziegler led a program of work at IBM which culminated in the publication of a number
of papers (Ziegler and Lanford, 1979) demonstrating that cosmic rays also could cause soft errors. Indeed, in modern
devices, cosmic rays may be the predominant cause. Although the primary particle of the cosmic ray does not
generally reach the Earth's surface, it creates a shower of energetic secondary particles. At the Earth's surface
approximately 95% of the particles capable of causing soft errors are energetic neutrons with the remainder
composed of protons and pions (Ziegler, 1996).
[1]
This flux of energetic neutrons is typically referred to as "cosmic
rays" in the soft error literature. Neutrons are uncharged and cannot disturb a circuit on their own, but undergo
neutron capture by the nucleus of an atom in a chip. This process may result in the production of charged
secondaries, such as alpha particles and oxygen nuclei, which can then cause soft errors.
Cosmic ray flux depends on altitude. For the common reference location of 40.7N, 74W at 0 meters (sea level in
New York City, NY, USA) the flux is approximately 14 neutrons / cm
2
/hour. Burying a system in a cave reduces the
rate of cosmic-ray induced soft errors to a negligible level. In the lower levels of the atmosphere, the flux increases
by a factor of about 2.2 for every 1000 m (1.3 for every 1000 ft) increase in altitude above sea level. Computers
operated on top of mountains experience an order of magnitude higher rate of soft errors compared to sea level. The
rate of upsets in aircraft may be more than 300 times the sea level upset rate. This is in contrast to package decay
induced soft errors, which do not change with location. A model of the energetic neutron flux is presented in
(Gordon & Goldhagen, 2004).
[2]
An online calculator for this model is available at www.seutest.com
[3]
.
The average rate of cosmic-ray soft errors is inversely proportional to sunspot activity. That is, the average number
of cosmic-ray soft errors decreases during the active portion of the sunspot cycle and increases during the quiet
portion. This counterintuitive result occurs for two reasons. The sun does not generally produce cosmic ray particles
with energy above 1 GeV that are capable of penetrating to the Earth's upper atmosphere and creating particle
showers, so the changes in the solar flux do not directly influence the number of errors. Further, the increase in the
solar flux during an active sun period does have the effect of reshaping the Earth's magnetic field providing some
additional shielding against higher energy cosmic rays, resulting in a decrease in the number of particles creating
showers. The effect is fairly small in any case resulting in a +/- 7% modulation of the energetic neutron flux in New
York City. Other locations are similarly affected.
Energetic neutrons produced by cosmic rays may lose most of their kinetic energy and reach thermal equilibrium
with their surroundings as they are scattered by materials. The resulting neutrons are simply referred to as thermal
neutrons and have an average kinetic energy of about 25 millielectron-volts at 25°C. Thermal neutrons are also
produced by environmental radiation sources such as the decay of naturally occurring uranium or thorium. The
thermal neutron flux from sources other than cosmic-ray showers may still be noticeable in an underground location
and an important contributor to soft errors for some circuits.
Thermal neutrons
Neutrons that have lost kinetic energy until they are in thermal equilibrium with their surroundings are an important
cause of soft errors for some circuits. At low energies many neutron capture reactions become much more probable
and result in fission of certain materials creating charged secondaries as fission byproducts. For some circuits the
capture of a thermal neutron by the nucleus of the B-10 isotope of boron is particularly important. This nuclear
reaction is an efficient producer of an alpha particle, Li-7 nucleus and gamma ray. Either of the charged particles
(alpha or Li-7) may cause a soft error if produced in very close proximity, approximately 5 micrometers, to a critical
circuit node. The capture cross section for B-11 is 6 orders of magnitude smaller and does not contribute to soft
errors (Baumann et al., 1995)
[4]
Boron has been used in BPSG, the insulator in the interconnection layers of integrated circuits, particularly in the
lowest one. The inclusion of boron lowers the melt temperature of the glass providing better reflow and planarization
Soft error
40
characteristics. In this application the glass is formulated with a boron content of 4% to 5% by weight. Naturally
occurring boron is 20% B-10 with the remainder the B-11 isotope. Soft errors are caused by the high level of B-10 in
this critical lower layer of some older integrated circuit processes. Boron-11, used at low concentrations as a p-type
dopant, does not contribute to soft errors. Integrated circuit manufacturers eliminated borated dielectrics by the
150 nm process node, largely due to this problem.
In critical designs, depleted boron—consisting almost entirely of Boron-11 is used, to avoid this effect and therefore
to reduce the soft error rate. Boron-11 is a by-product of the nuclear industry.
For applications in medical electronic devices this soft error mechanism may be extremely important. Neutrons are
produced during high energy cancer radiation therapy using photon beam energies above 10 MV. These neutrons are
moderated as they are scattered from the equipment and walls in the treatment room resulting in a thermal neutron
flux that is about 40x10
6
higher than the normal environmental neutron flux. This high thermal neutron flux will
generally result in a very high rate of soft errors and consequent circuit upset (Wilkinson et al., 2005),
[5]
(Franco et
al., 2005).
[6]
Other causes
Soft errors can also be caused by random noise or signal integrity problems, such as inductive or capacitive
crosstalk. However, in general, these sources represent a small contribution to the overall soft error rate when
compared to radiation effects.
Designing around soft errors
Soft error mitigation
A designer can attempt to minimize the rate of soft errors by judicious device design, choosing the right
semiconductor, package and substrate materials, and the right device geometry. Often, however, this is limited by the
need to reduce device size and voltage, to increase operating speed and to reduce power dissipation. The
susceptibility of devices to upsets is described in the industry using the JEDEC JESD-89 standard.
One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening. This
involves increasing the capacitance at selected circuit nodes in order to increase its effective Q
crit
value. This reduces
the range of particle energies to which the logic value of the node can be upset. Radiation hardening is often
accomplished by increasing the size of transistors who share a drain/source region at the node. Since the area and
power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes
which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can
predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors.
Correcting soft errors
Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and
correction to recover gracefully. Typically, a semiconductor memory design might use forward error correction,
incorporating redundant data into each word to create an error correcting code. Alternatively, roll-back error
correction can be used, detecting the soft error with an error-detecting code such as parity, and rewriting correct data
from another source. This technique is often used for write-through cache memories.
Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant design. These
often include the use of redundant circuitry or computation of data, and typically come at the cost of circuit area,
decreased performance, and/or higher power consumption. The concept of triple modular redundancy (TMR) can be
employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a
circuit compute on the same data in parallel and outputs are fed into majority voting logic, returning the value that
occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming
Soft error
41
the other two circuits operated correctly. In practice, however, few designers can afford the greater than 200% circuit
area and power overhead required, so it is usually only selectively applied. Another common concept to correct soft
errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple
times and compares subsequent evaluations for consistency. This approach, however, often incurs performance
overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably
more area-efficient than modular redundancy.
Traditionally, DRAM has had the most attention in the quest to reduce, or work-around soft errors, due to the fact
that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer
systems (ref. the prevalence of ECC RAM in server computers). Hard figures for DRAM susceptibility are hard to
come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256
kilobit DRAMS could have clusters of five or six bits flip from a single alpha particle. Modern DRAMs have much
smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip.
The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a
very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a
multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on
a chip. So, even a multi-cell upset leads to only a number of separate single-bit upsets in multiple correction words,
rather than a multi-bit upset in a single correction word. So, an error correcting code needs only to cope with a single
bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets
affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when
multiple bits in a single correction word are in error.
Soft errors in combinational logic
The three natural masking effects in combinational logic that determine whether a single event upset (SEU) will
propagate to become a soft error are electrical masking, logical masking, and temporal (or timing-window) masking.
An SEU is logically masked if its propagation is blocked from reaching an output latch because off-path gate inputs
prevent a logical transition of that gate's output. An SEU is electrically masked if the signal is attenuated by the
electrical properties of gates on its propagation path such that the resulting pulse is of insufficient magnitude to be
reliably latched. An SEU is temporally masked if the erroneous pulse reaches an output latch, but it does occur close
enough to when the latch is actually triggered to hold.
If all three masking effects fail to occur, the propagated pulse becomes latched and the output of the logic circuit will
be an erroneous value. In the context of circuit operation, this erroneous output value may be considered a soft error
event. However, from a microarchitectural-level standpoint, the affected result may not change the output of the
currently-executing program. For instance, the erroneous data could be overwritten before use, masked in subsequent
logic operations, or simply never be used. If erroneous data does not affect the output of a program, it is considered
to be an example of microarchitectural masking.
Soft error rate
Soft error rate (SER) is the rate at which a device or system encounters or is predicted to encounter soft errors. It is
typically expressed as either number of failures-in-time (FIT), or mean time between failures (MTBF). The unit
adopted for quantifying failures in time is called FIT, equivalent to 1 error per billion hours of device operation.
MTBF is usually given in years of device operation. To put it in perspective, 1 year MTBF is equal to approximately
114,077 FIT.
While many electronic systems have an MTBF that exceeds the expected lifetime of the circuit, the SER may still be
unacceptable to the manufacturer or customer. For instance, many failures per million circuits due to soft errors can
be expected in the field if the system does not have adequate soft error protection. The failure of even a few products
Soft error
42
in the field, particularly if catastrophic, can tarnish the reputation of the product and company that designed it. Also,
in safety- or cost-critical applications where the cost of system failure far outweighs the cost of the system itself, a
1% chance of soft error failure per lifetime may be too high to be acceptable to the customer. Therefore, it is
advantageous to design for low SER when manufacturing a system in high-volume or requiring extremely high
reliability.
External links
• Book on "Architecture Design for Soft Errors" by Shubu Mukherjee, published by Elsevier, Inc.
[7]
Book review
by Max Baron of Microprocessor Report (May 27, 2008), “Dr. Shubu Mukherjee’s book is a welcome surprise:
books by architecture leaders in major companies are few and far between. Written from the viewpoint of a
working engineer, the book describes sources of soft errors and solutions involving device, logic, and architecture
design to reduce the effects of soft errors”
• Ionizing Radiation Effects in MOS Devices and Circuits by Tso Ping Ma and PAUL V. Dressendorfer
[8]
, The
first comprehensive overview describing the effects of ionizing radiation on MOS devices, as well as how to
design, fabricate, and test integrated circuits intended for use in a radiation environment.
• Radiation Effects And Soft Errors In Integrated Circuits And Electronic Devices by Dan Fleetwood and Ron D
Schrimpf
[9]
, Vanderbilt University, Nashville, Tennessee, USA A collection of the most important concepts in
Radiation Effects by two pioneers in this field.
• Soft Errors in Electronic Memory - A White Paper
[10]
- A good summary paper with many references - Tezzaron
Jan 2004. Concludes that 1000–5000 FIT per Mbit (0.2–1 error per day per Gbyte) is a typical DRAM soft error
rate.
• Benefits of Chipkill-Correct ECC for PC Server Main Memory
[11]
- A 1997 discussion of SDRAM reliability -
some interesting information on "soft errors" from cosmic rays, especially with respect to Error-correcting code
schemes
• Soft errors' impact on system reliability
[12]
- Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor,
2004
• Scaling and Technology Issues for Soft Error Rates
[13]
- A Johnston - 4th Annual Research Conference on
Reliability Stanford University, October 2000
• Evaluation of LSI Soft Errors Induced by Terrestrial Cosmic rays and Alpha Particles
[14]
- H. Kobayashi, K.
Shiraishi, H. Tsuchiya, H. Usuki (all of Sony), and Y. Nagai, K. Takahisa (Osaka University), 2001.
• SELSE Workshop Website
[15]
- Website for the workshop on the System Effects of Logic Soft Errors
• TRAD Tests & Radiations
[16]
- A company dedicated to Single events and soft error Test, solutions and products
• iRoC Technologies
[17]
- A company dedicated to Soft Error solutions and products
• EADS Nucletudes
[18]
- A company dedicated to hardening system in harsh elctromagnetic and radiative
environments
Soft error
43
References
[1] J.F. Ziegler, Terrestrial cosmic rays, IBM Journal of Research and Development, Vol. 40, no. 1, pp. 19-40, Jan 1996.
[2] Gordon, Goldhagen, "Measurement of the Flux and Energy Spectrum of Cosmic-Ray Induced Neutrons on the Ground, IEEE Trans on
Nuclear Science, vol. 51, no. 6, pp. 3427-34, Dec. 2004.
[3] http:/ / www. seutest. com
[4] R. Baumann, T. Hossain, S. Murata, H. Kitagawa, Boron compounds as a dominant source of alpha particles in semiconductor devices, IRPS
Proceedings, pp. 297-302, 1995.
[5] J. Wilkinson, C. Bounds, T. Brown, B.J. Gerbi, J. Peltier, Cancer radiotherapy equipment as a cause of soft errors in electronic equipment,
IEEE Trans Device and Materials Reliability, Vol. 5, No. 3, pp. 449-51, Apr. 2005
[6] Franco, L., Gómez, F., Iglesias, A., Pardo, J., Pazos, A., Pena, J., Zapata, M., SEUs on commercial SRAM induced by low energy neutrons
produced at a clinical linac facility, RADECS Proceedings, Sept. 2005
[7] http:// www. amazon. com/ dp/ 0123695295
[8] http:/ / www. borders.com. au/ book/ionizing-radiation-effects-in-mos-devices-and-circuits/2733970/
[9] http:/ / www. amazon. com/ dp/ 9812389407
[10] http:/ / www. tezzaron. com/ about/ papers/soft_errors_1_1_secure.pdf
[11] http:/ / www-1.ibm. com/ servers/ eserver/ pseries/ campaigns/ chipkill. pdf
[12] http:// www. edn. com/ article/CA454636. html
[13] http:/ / www. nepp. nasa. gov/ DocUploads/ 40D7D6C9-D5AA-40FC-829DC2F6A71B02E9/ Scal-00. pdf
[14] http:// www. rcnp.osaka-u. ac. jp/ ~annurep/2001/ genkou/ sec3/ kobayashi. pdf
[15] http:/ / www. selse. org/
[16] http:// www. trad.fr
[17] http:/ / www. iroctech.com
[18] http:/ / www. nucletudes. com
• Ziegler, J. F. and W. A. Lanford, "Effect of Cosmic Rays on Computer Memories", Science, 206, 776 (1979).
• Mukherjee, S, "Architecture Design for Soft Errors," Elsevier, Inc., Feb. 2008.
• Mukherjee, S, "Computer Glitches from Soft Errors: A Problem with Multiple Solutions," Microprocessor Report,
May 19, 2008.
Two pass verification
44
Two pass verification
Two pass verification, also called double data entry, is a data entry quality control method that was originally
employed when data records were entered onto sequential 80 column Hollerith cards with a keypunch. In the first
pass through a set of records, the data keystrokes were entered onto each card as the data entry operator typed them.
On the second pass through the batch, an operator at a separate machine, called a verifier, entered the same data. The
verifier compared the second operator's keystrokes with the contents of the original card. If there were no
differences, a verification notch was punched on the right edge of the card. [1]
The later IBM 129 keypunch also could operate as a verifier. In that mode, it read a completed card (record) and
loaded the 80 keystrokes into a buffer. A data entry operator reentered the record and the keypunch compared the
new keystrokes with those loaded into the buffer. If a discrepancy occurred the operator was given a chance to
reenter that keystroke and ultimately overwrite the entry in the buffer. If all keystrokes matched the original card, it
was passed through and received a verification punch. If corrections were required then the operator was prompted to
discard the original card and insert a fresh card on which corrected keystrokes were typed. The corrected record
(card) was passed through and received a corrected verification punch. [2]
Modern use
While this method of quality control clearly is not proof against systematic errors or operator misread entries from a
source document, it is very useful in catching and correcting random miskeyed strokes which occur even with
experienced data entry operators. However, it proved to be a fatally tragic flaw in the Therac 25 incident. This
method has survived the keypunch and is available in some currently available data entry programs (e.g. PSPP/SPSS
Data Entry). At least one study suggests that single pass data entry with range checks and skip rules approaches the
reliability of two-pass data entry (see Controlled Clinical Trials from sometime in the 1990s - Control Clin Trials.
1998 Feb;19(1):15-24.?); however it is desirable to implement both systems in a data entry application.
References
[1] http:/ / www. museumwaalsdorp. nl/ computer/en/ punchcards.html
[2] http:/ / ed-thelen.org/ comp-hist/ IBM-ProdAnn/129. pdf
Validation rule
45
Validation rule
A Validation rule is a criterion used in the process of data validation, carried out after the data has been encoded
onto an input medium and involves a data vet or validation program. This is distinct from formal verification, where
the operation of a program is determined to be that which was intended, and that meets the purpose.
The method is to check that data falls the appropriate parameters defined by the systems analyst. A judgement as to
whether data is valid is made possible by the validation program, but it cannot ensure complete accuracy. This can
only be achieved through the use of all the clerical and computer controls built into the system at the design stage.
The difference between data validity and accuracy can be illustrated with a trivial example. A company has
established a Personnel file and each record contains a field for the Job Grade. The permitted values are A, B, C, or
D. An entry in a record may be valid and accepted by the system if it is one of these characters, but it may not be the
correct grade for the individual worker concerned. Whether a grade is correct can only be established by clerical
checks or by reference to other files. During systems design, therefore, data definitions are established which place
limits on what constitutes valid data. Using these data definitions, a range of software validation checks can be
carried out.
Criteria
An example of a validation check is the procedure used to verify an ISBN.
[1]
ben
Size. The number of characters in a data item value is checked; for example, an ISBN must consist of 10 characters
only (in the previous version--the standard for 1997 and later has been changed to 13 characters.)
Format checks. Data must conform to a specified format. Thus, the first 9 characters must be the digits 0 through 9'
the 10th must be either those digits or an X
Consistency. Codes in the data items which are related in some way can thus be checked for the consistency of their
relationship. The first number of the ISBN designates the language of publication. for example, books published in
French-speaking countries carry the digit "2". This must match the address of the publisher, as given elsewhere in the
record. .
Range. Does not apply to ISBN, but typically data must lie within maximum and minimum preset values. For
example, customer account numbers may be restricted within the values 10000 to 20000, if this is the arbitrary range
of the numbers used for the system.
Check digit. An extra digit calculated on, for example, an account number, can be used as a self-checking device.
When the number is input to the computer, the validation program carries out a calculation similar to that used to
generate the check digit originally and thus checks its validity. This kind of check will highlight transcription errors
where two or more digits have been transposed or put in the wrong order. The 10th character of the 10-character
ISBN is the check digit.
Validation rule
46
References
[1] Frequently Asked Questions about the new ISBN standard (http:/ / www.lac-bac.gc.ca/ iso/ tc46sc9/ isbn. htm) ISO.
• Information integrity : a structure for its definition and management. Becker, Hal B. New York : McGraw-Hill,
1983. ISBN: 0070041911.
• ValidationRule Class (http:/ / msdn2. microsoft.com/ en-us/ library/microsoft.visualstudio. testtools. webtesting.
validationrule(VS.80).aspx) at Microsoft
• Validation Rule Property (http:/ / docs. codecharge. com/ studio31/ html/ index. html?http:/ / docs. codecharge.
com/ studio31/ html/ Components/ Properties/ValidationRule. html) at CodeCharge Studio
• Create a validation rule to validate data in a field (http:// office.microsoft. com/ en-us/ access/
HA100963121033. aspx) at Microsoft
Abstraction (computer science)
In computer science, abstraction is the process by which data and programs are defined with a representation
similar to its meaning (semantics), while hiding away the implementation details. Abstraction tries to reduce and
factor out details so that the programmer can focus on a few concepts at a time. A system can have several
abstraction layers whereby different meanings and amounts of detail are exposed to the programmer. For example,
low-level abstraction layers expose details of the hardware where the program is run, while high-level layers deal
with the business logic of the program.
The following English definition of abstraction helps to understand how this term applies to computer science, IT
and objects:
abstraction - a concept or idea not associated with any specific instance
[1]
Abstraction captures only those detail about an object that are relevant to the current perspective. The concept
originated by analogy with abstraction in mathematics. The mathematical technique of abstraction begins with
mathematical definitions, making it a more technical approach than the general concept of abstraction in philosophy.
For example, in both computing and in mathematics, numbers are concepts in the programming languages, as
founded in mathematics. Implementation details depend on the hardware and software, but this is not a restriction
because the computing concept of number is still based on the mathematical concept.
In computer programming, abstraction can apply to control or to data: Control abstraction is the abstraction of
actions while data abstraction is that of data structures.
• Control abstraction involves the use of subprograms and related concepts control flows
• Data abstraction allows handling data bits in meaningful ways. For example, it is the basic motivation behind
datatype.
One can regard the notion of an object (from object-oriented programming) as an attempt to combine abstractions of
data and code.
The same abstract definition can be used as a common interface for a family of objects with different
implementations and behaviors but which share the same meaning. The inheritance mechanism in object-oriented
programming can be used to define an abstract class as the common interface.
The recommendation that programmers use abstractions whenever suitable in order to avoid duplication (usually of
code) is known as the abstraction principle. The requirement that a programming language provide suitable
abstractions is also called the abstraction principle.
Abstraction (computer science)
47
Rationale
Computing mostly operates independently of the concrete world: The hardware implements a model of computation
that is interchangeable with others. The software is structured in architectures to enable humans to create the
enormous systems by concentration on a few issues at a time. These architectures are made of specific choices of
abstractions. Greenspun's Tenth Rule is an aphorism on how such an architecture is both inevitable and complex.
A central form of abstraction in computing is language abstraction: new artificial languages are developed to express
specific aspects of a system. Modeling languages help in planning. Computer languages can be processed with a
computer. An example of this abstraction process is the generational development of programming languages from
the machine language to the assembly language and the high-level language. Each stage can be used as a stepping
stone for the next stage. The language abstraction continues for example in scripting languages and domain-specific
programming languages.
Within a programming language, some features let the programmer create new abstractions. These include the
subroutine, the module, and the software component. Some other abstractions such as software design patterns and
architectural styles remain invisible to a programming language and operate only in the design of a system.
Some abstractions try to limit the breadth of concepts a programmer needs by completely hiding the abstractions
they in turn are built on. Joel Spolsky has criticised these efforts by claiming that all abstractions are leaky — that
they can never completely hide the details below; however this does not negate the usefulness of abstraction. Some
abstractions are designed to interoperate with others, for example a programming language may contain a foreign
function interface for making calls to the lower-level language.
Language features
Programming languages
Different programming languages provide different types of abstraction, depending on the intended applications for
the language. For example:
• In object-oriented programming languages such as C++, Object Pascal, or Java, the concept of abstraction has
itself become a declarative statement - using the keywords virtual (in C++) or abstract (in Java). After such a
declaration, it is the responsibility of the programmer to implement a class to instantiate the object of the
declaration.
• Functional programming languages commonly exhibit abstractions related to functions, such as lambda
abstractions (making a term into a function of some variable), higher-order functions (parameters are functions),
bracket abstraction (making a term into a function of a variable).
Specification methods
Analysts have developed various methods to formally specify software systems. Some known methods include:
• Abstract-model based method (VDM, Z);
• Algebraic techniques (Larch, CLEAR, OBJ, ACT ONE, CASL);
• Process-based techniques (LOTOS, SDL, Estelle);
• Trace-based techniques (SPECIAL, TAM);
• Knowledge-based techniques (Refine, Gist).
Abstraction (computer science)
48
Specification languages
Specification languages generally rely on abstractions of one kind or another, since specifications are typically
defined earlier in a project (and at a more abstract level) than an eventual implementation. The UML specification
language, for example, allows the definition of abstract classes, which remain abstract during the architecture and
specification phase of the project.
Control abstraction
Programming languages offer control abstraction as one of the main purposes of their use. Computer machines
understand operations at the very low level such as moving some bits from one location of the memory to another
location and producing the sum of two sequences of bits. Programming languages allow this to be done in the higher
level. For example, consider this statement written in a Pascal-like fashion:
a := (1 + 2) * 5
To a human, this seems a fairly simple and obvious calculation ("one plus two is three, times five is fifteen").
However, the low-level steps necessary to carry out this evaluation, and return the value "15", and then assign that
value to the variable "a", are actually quite subtle and complex. The values need to be converted to binary
representation (often a much more complicated task than one would think) and the calculations decomposed (by the
compiler or interpreter) into assembly instructions (again, which are much less intuitive to the programmer:
operations such as shifting a binary register left, or adding the binary complement of the contents of one register to
another, are simply not how humans think about the abstract arithmetical operations of addition or multiplication).
Finally, assigning the resulting value of "15" to the variable labeled "a", so that "a" can be used later, involves
additional 'behind-the-scenes' steps of looking up a variable's label and the resultant location in physical or virtual
memory, storing the binary representation of "15" to that memory location, etc.
Without control abstraction, a programmer would need to specify all the register/binary-level steps each time he
simply wanted to add or multiply a couple of numbers and assign the result to a variable. Such duplication of effort
has two serious negative consequences:
1. it forces the programmer to constantly repeat fairly common tasks every time a similar operation is needed
2. it forces the programmer to program for the particular hardware and instruction set
Structured programming
Structured programming involves the splitting of complex program tasks into smaller pieces with clear flow-control
and interfaces between components, with reduction of the complexity potential for side-effects.
In a simple program, this may aim to ensure that loops have single or obvious exit points and (where possible) to
have single exit points from functions and procedures.
In a larger system, it may involve breaking down complex tasks into many different modules. Consider a system
which handles payroll on ships and at shore offices:
• The uppermost level may feature a menu of typical end-user operations.
• Within that could be standalone executables or libraries for tasks such as signing on and off employees or printing
checks.
• Within each of those standalone components there could be many different source files, each containing the
program code to handle a part of the problem, with only selected interfaces available to other parts of the
program. A sign on program could have source files for each data entry screen and the database interface (which
may itself be a standalone third party library or a statically linked set of library routines).
• Either the database or the payroll application also has to initiate the process of exchanging data with between ship
and shore, and that data transfer task will often contain many other components.
Abstraction (computer science)
49
These layers produce the effect of isolating the implementation details of one component and its assorted internal
methods from the others. Object-oriented programming embraced and extended this concept.
Data abstraction
Data abstraction enforces a clear separation between the abstract properties of a data type and the concrete details of
its implementation. The abstract properties are those that are visible to client code that makes use of the data
type—the interface to the data type—while the concrete implementation is kept entirely private, and indeed can
change, for example to incorporate efficiency improvements over time. The idea is that such changes are not
supposed to have any impact on client code, since they involve no difference in the abstract behaviour.
For example, one could define an abstract data type called lookup table which uniquely associates keys with values,
and in which values may be retrieved by specifying their corresponding keys. Such a lookup table may be
implemented in various ways: as a hash table, a binary search tree, or even a simple linear list of (key:value) pairs.
As far as client code is concerned, the abstract properties of the type are the same in each case.
Of course, this all relies on getting the details of the interface right in the first place, since any changes there can
have major impacts on client code. As one way to look at this: the interface forms a contract on agreed behaviour
between the data type and client code; anything not spelled out in the contract is subject to change without notice.
Languages that implement data abstraction include Ada and Modula-2. Object-oriented languages are commonly
claimed to offer data abstraction; however, their inheritance concept tends to put information in the interface that
more properly belongs in the implementation; thus, changes to such information ends up impacting client code,
leading directly to the Fragile binary interface problem.
Abstraction in object oriented programming
In object-oriented programming theory, abstraction involves the facility to define objects that represent abstract
"actors" that can perform work, report on and change their state, and "communicate" with other objects in the
system. The term encapsulation refers to the hiding of state details, but extending the concept of data type from
earlier programming languages to associate behavior most strongly with the data, and standardizing the way that
different data types interact, is the beginning of abstraction. When abstraction proceeds into the operations defined,
enabling objects of different types to be substituted, it is called polymorphism. When it proceeds in the opposite
direction, inside the types or classes, structuring them to simplify a complex set of relationships, it is called
delegation or inheritance.
Various object-oriented programming languages offer similar facilities for abstraction, all to support a general
strategy of polymorphism in object-oriented programming, which includes the substitution of one type for another in
the same or similar role. Although not as generally supported, a configuration or image or package may predetermine
a great many of these bindings at compile-time, link-time, or loadtime. This would leave only a minimum of such
bindings to change at run-time.
Common Lisp Object System or self, for example, feature less of a class-instance distinction and more use of
delegation for polymorphism. Individual objects and functions are abstracted more flexibly to better fit with a shared
functional heritage from Lisp.
C++ exemplifies another extreme: it relies heavily on templates and overloading and other static bindings at
compile-time, which in turn has certain flexibility problems.
Although these examples offer alternate strategies for achieving the same abstraction, they do not fundamentally
alter the need to support abstract nouns in code - all programming relies on an ability to abstract verbs as functions,
nouns as data structures, and either as processes.
Consider for example a sample Java fragment to represent some common farm "animals" to a level of abstraction
suitable to model simple aspects of their hunger and feeding. It defines an Animal class to represent both the state of
Abstraction (computer science)
50
the animal and its functions:
public class Animal extends LivingThing
{
private Location loc;
private double energyReserves;
private boolean isHungry() {
return energyReserves < 2.5;
}
private void eat(Food f) {
// Consume food
energyReserves += f.getCalories();
}
private void moveTo(Location l) {
// Move to new location
loc = l;
}
}
With the above definition, one could create objects of type Animal and call their methods like this:
thePig = new Animal();
theCow = new Animal();
if (thePig.isHungry()) {
thePig.eat(tableScraps);
}
if (theCow.isHungry()) {
theCow.eat(grass);
}
theCow.moveTo(theBarn);
In the above example, the class Animal is an abstraction used in place of an actual animal, LivingThing is a further
abstraction (in this case a generalisation) of Animal.
If one requires a more differentiated hierarchy of animals — to differentiate, say, those who provide milk from those
who provide nothing except meat at the end of their lives — that is an intermediary level of abstraction, probably
DairyAnimal (cows, goats) who would eat foods suitable to giving good milk, and Animal (pigs, steers) who would
eat foods to give the best meat-quality.
Such an abstraction could remove the need for the application coder to specify the type of food, so s/he could
concentrate instead on the feeding schedule. The two classes could be related using inheritance or stand alone, and
the programmer could define varying degrees of polymorphism between the two types. These facilities tend to vary
drastically between languages, but in general each can achieve anything that is possible with any of the others. A
great many operation overloads, data type by data type, can have the same effect at compile-time as any degree of
inheritance or other means to achieve polymorphism. The class notation is simply a coder's convenience.
Abstraction (computer science)
51
Object-oriented design
Decisions regarding what to abstract and what to keep under the control of the coder become the major concern of
object-oriented design and domain analysis—actually determining the relevant relationships in the real world is the
concern of object-oriented analysis or legacy analysis.
In general, to determine appropriate abstraction, one must make many small decisions about scope (domain
analysis), determine what other systems one must cooperate with (legacy analysis), then perform a detailed
object-oriented analysis which is expressed within project time and budget constraints as an object-oriented design.
In our simple example, the domain is the barnyard, the live pigs and cows and their eating habits are the legacy
constraints, the detailed analysis is that coders must have the flexibility to feed the animals what is available and thus
there is no reason to code the type of food into the class itself, and the design is a single simple Animal class of
which pigs and cows are instances with the same functions. A decision to differentiate DairyAnimal would change
the detailed analysis but the domain and legacy analysis would be unchanged—thus it is entirely under the control of
the programmer, and we refer to abstraction in object-oriented programming as distinct from abstraction in domain
or legacy analysis.
Considerations
When discussing formal semantics of programming languages, formal methods or abstract interpretation,
abstraction refers to the act of considering a less detailed, but safe, definition of the observed program behaviors.
For instance, one may observe only the final result of program executions instead of considering all the intermediate
steps of executions. Abstraction is defined to a concrete (more precise) model of execution.
Abstraction may be exact or faithful with respect to a property if one can answer a question about the property
equally well on the concrete or abstract model. For instance, if we wish to know what the result of the evaluation of a
mathematical expression involving only integers +, -, ×, is worth modulo n, we need only perform all operations
modulo n (a familiar form of this abstraction is casting out nines).
Abstractions, however, though not necessarily exact, should be sound. That is, it should be possible to get sound
answers from them—even though the abstraction may simply yield a result of undecidability. For instance, we may
abstract the students in a class by their minimal and maximal ages; if one asks whether a certain person belongs to
that class, one may simply compare that person's age with the minimal and maximal ages; if his age lies outside the
range, one may safely answer that the person does not belong to the class; if it does not, one may only answer "I
don't know".
The level of abstraction included in a programming language can influence its overall usability. The Cognitive
dimensions framework includes the concept of abstraction gradient in a formalism. This framework allows the
designer of a programming language to study the trade-offs between abstraction and other characteristics of the
design, and how changes in abstraction influence the language usability.
Abstractions can prove useful when dealing with computer programs, because non-trivial properties of computer
programs are essentially undecidable (see Rice's theorem). As a consequence, automatic methods for deriving
information on the behavior of computer programs either have to drop termination (on some occasions, they may
fail, crash or never yield out a result), soundness (they may provide false information), or precision (they may
answer "I don't know" to some questions).
Abstraction is the core concept of abstract interpretation. Model checking generally takes place on abstract versions
of the studied systems.
Abstraction (computer science)
52
Levels of abstraction
Computer science commonly presents levels (or, less commonly, layers) of abstraction, wherein each level
represents a different model of the same information and processes, but uses a system of expression involving a
unique set of objects and compositions that apply only to a particular domain.
[2]
Each relatively abstract, "higher"
level builds on a relatively concrete, "lower" level, which tends to provide an increasingly "granular" representation.
For example, gates build on electronic circuits, binary on gates, machine language on binary, programming language
on machine language, applications and operating systems on programming languages. Each level is embodied, but
not determined, by the level beneath it, making it a language of description that is somewhat self-contained.
Database systems
Since many users of database systems lack in-depth familiarity with computer data-structures, database developers
often hide complexity through the following levels:
Data abstraction levels of a database system
Physical level: The lowest level of abstraction describes how a system
actually stores data. The physical level describes complex low-level
data structures in detail.
Logical level: The next higher level of abstraction describes what data
the database stores, and what relationships exist among those data. The
logical level thus describes an entire database in terms of a small
number of relatively simple structures. Although implementation of the
simple structures at the logical level may involve complex physical
level structures, the user of the logical level does not need to be aware
of this complexity. This referred to as Physical Data Independence.
Database administrators, who must decide what information to keep in a database, use the logical level of
abstraction.
View level: The highest level of abstraction describes only part of the entire database. Even though the logical level
uses simpler structures, complexity remains because of the variety of information stored in a large database. Many
users of a database system do not need all this information; instead, they need to access only a part of the database.
The view level of abstraction exists to simplify their interaction with the system. The system may provide many
views for the same database.
Layered architecture
The ability to provide a design of different levels of abstraction can
• simplify the design considerably
• enable different role players to effectively work at various levels of abstraction
Systems design and business process design can both use this. Some design processes specifically generate designs
that contain various levels of abstraction.
Layered architecture partitions the concerns of the application into stacked groups (layers). it is a technique used in
designing computer software, hardware, and communications in which system or network components are isolated in
layers so that changes can be made in one layer without affecting the others.
Abstraction (computer science)
53
Notes
This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed
under the GFDL.
[1] Thefreedictionary.com (http:// www.thefreedictionary.com/ abstraction)
[2] Luciano Floridi, Levellism and the Method of Abstraction (http:// www.philosophyofinformation.net/ pdf/ latmoa.pdf) IEG – Research
Report 22.11.04
Further reading
• Abelson, Harold, Gerald Jay Sussman with Julie Sussman. (1996) ISBN 0-262-01153-0 Structure and
Interpretation of Computer Programs (Second edition). The MIT Press (See MIT.edu (http:/ / mitpress. mit. edu/
sicp/full-text/book/ book-Z-H-10.html))
• Joel Spolsky. Joelonsoftware.com (http:/ / www. joelonsoftware.com/ articles/ LeakyAbstractions. html) The
Law of Leaky Abstractions. 2002-11-11.
ADO.NET
54
ADO.NET
ADO.NET
Operating system Microsoft Windows
Type Software framework
License MS-EULA, BCL under Microsoft Reference License
Website
ADO.NET Overview on MSDN
[1]
ADO.NET (ActiveX Data Object for .NET) is a set of computer software components that programmers can use to
access data and data services. It is a part of the base class library that is included with the Microsoft .NET
Framework. It is commonly used by programmers to access and modify data stored in relational database systems,
though it can also access data in non-relational sources. ADO.NET is sometimes considered an evolution of ActiveX
Data Objects (ADO) technology, but was changed so extensively that it can be considered an entirely new product.
Architecture
This technology forms a part of .NET Framework 3.0 (having
been part of the framework since version 1.0)
ADO.NET is conceptually divided into consumers and
data providers. The consumers are the applications that
need access to the data, and the providers are the software
components that implement the interface and thereby
provides the data to the consumer.
ADO.NET and Visual Studio
Functionality exists in the Visual Studio IDE to create
specialized subclasses of the DataSet classes for a
particular database schema, allowing convenient access to
each field through strongly typed properties. This helps
catch more programming errors at compile-time and
makes the IDE's Intellisense feature more beneficial.
ADO.NET and O/R Mapping
Entity Framework
The ADO.NET Entity Framework is a set of
data-access APIs for the Microsoft .NET Framework,
similar to the Java Persistence API, targeting the version of ADO.NET that ships with .NET Framework 4.0.
ADO.NET Entity Framework is included with .NET Framework 4.0 and Visual Studio 2010, released in April 2010.
An Entity Framework Entity is an object which has a key representing the primary key of a logical datastore entity.
A conceptual Entity Data Model (Entity-relationship model) is mapped to a datastore schema model. Using the
Entity Data Model, the Entity Framework allows data to be treated as entities independently of their underlying
datastore representations.
ADO.NET
55
Entity SQL, a SQL-like language, serves for querying the Entity Data Model (instead of the underlying datastore).
Similarly, LINQ extension LINQ to Entities provides typed querying on the Entity Data Model. Entity SQL and
LINQ to Entities queries are converted internally into a Canonical Query Tree which is then converted into a query
understandable to the underlying datastore (e.g. into SQL in the case of a relational database). The entities can use
their relationships, with their changes committed back to the datastore.
External links
ADO.NET
• ADO.NET Overview on MSDN
[1]
• ADO.NET for the ADO Programmer
[2]
• ADO.NET Connection Strings
[3]
• ADO.NET Team Blog
[4]
Incubation Projects
• Data Access Incubation Projects
[5]
• Jasper
[6]
, download
[7]
References
[1] http:/ / msdn2. microsoft.com/ en-us/ library/aa286484. aspx
[2] http:/ / msdn2. microsoft.com/ en-us/ library/ms973217. aspx
[3] http:/ / www. devlist. com/ ConnectionStringsPage. aspx
[4] http:/ / blogs. msdn. com/ adonet/
[5] http:// msdn2. microsoft.com/ en-us/ data/ bb419139. aspx
[6] http:/ / blogs. msdn. com/ adonet/ archive/ 2007/ 04/ 30/project-codename-jasper-announced-at-mix-07.aspx
[7] http:/ / www. microsoft.com/ downloads/ details. aspx?FamilyId=471BB3AC-B31A-49CD-A567-F2E286715C8F& displaylang=en
ADO.NET data provider
56
ADO.NET data provider
An ADO.NET data provider is a software component enabling an ADO.NET consumer to interact with a data
source. ADO.NET data providers are analogous to ODBC drivers, JDBC drivers, and OLE DB providers.
ADO.NET providers can be created to access such simple data stores as a text file and spreadsheet, through to such
complex databases as Oracle, Microsoft SQL Server, MySQL, PostgreSQL, SQLite, DB2, Sybase ASE, and many
others. They can also provide access to hierarchical data stores such as email systems.
However, because different data store technologies can have different capabilities, every ADO.NET provider cannot
implement every possible interface available in the ADO.NET standard. Microsoft describes the availability of an
interface as "provider-specific," as it may not be applicable depending on the data store technology involved. Note
also that providers may augment the capabilities of a data store; these capabilities are known as services in Microsoft
parlance.
ADO.NET data providers
• Universal Data Access
[1]
and Virtuoso
[2]
components from OpenLink Software provide full Entity Frameworks
functionality as well as simple ADO.NET-based access to Oracle, Microsoft SQL Server, IBM DB2,
Progress/OpenEdge, PostgreSQL, MySQL, Ingres, Informix, and others
• Connector/Net
[3]
: native data provider for MySQL database server (free)
• DataDirect Connect for ADO.NET
[4]
: data providers for Oracle, DB2, Microsoft SQL Server, and Sybase
database servers from DataDirect (commercial)
• DB2 .NET
[5]
: data provider for DB2 database server from IBM (free)
• dotConnect
[6]
: data providers for Oracle, MySQL, PostgreSQL, SQL Server, and SQLite database servers from
Devart (free and commercial)
• Empress ADO.NET Data Provider
[7]
: data provider for Empress Embedded Database from Empress Software
with local access and Client-server
• Npgsql
[8]
: open source data povider for PostgreSQL database server (free)
• Oracle Data Provider for .NET (ODP.NET)
[9]
: data provider for Oracle database server from Oracle (free)
• VistaDB
[10]
: 100% managed ADO.NET provider with SQL Server like syntax
• EffiProz
[11]
: open source ADO.NET provider for EffiProz pure C# database
• RDM Server
[12]
: data provider for the RDM Server database system from Birdstep Technology, Inc (free)
• System.Data.SQLite
[13]
: open source ADO.NET provider for SQLite databases (free)
External links
• "ADO.NET Data Providers"
[14]
. Microsoft. MSDN: Data Developer Center. Retrieved 22 March 2011.
References
[1] http:/ / uda.openlinksw. com/ dotnet/
[2] http:// wikis. openlinksw. com/ dataspace/ owiki/ wiki/ VirtuosoWikiWeb/ VirtAdoNet35Provider
[3] http:// dev.mysql. com/ downloads/ connector/ net/
[4] http:/ / www. datadirect. com/ products/net/ index.ssp
[5] http:/ / www-01.ibm. com/ software/ data/ db2/ ad/ dotnet. html
[6] http:// www. devart.com/ products-adonet. html
[7] http:// www. empress. com/ whatsnew/ techNews/ Jul2010EmpressADO. NETProvider.html
[8] http:// npgsql. projects. postgresql. org/
[9] http:// www. oracle. com/ technology/tech/ windows/ odpnet/ index.html
[10] http:/ / www. vistadb. net/ vistadb/ default. aspx
[11] http:/ / www. effiproz.com/
ADO.NET data provider
57
[12] http:/ / www. raima.com/
[13] http:/ / sqlite. phxsoftware.com/
[14] http:/ / msdn. microsoft.com/ en-us/ data/ dd363565
WCF Data Services
WCF Data Services (formerly ADO.NET Data Services
[1]
, codename "Astoria")
[2]
is a platform for what
Microsoft calls Data Services. It is actually a combination of the runtime and a web service through which the
services are exposed. In addition, it also includes the Data Services Toolkit which lets Astoria Data Services be
created from within ASP.NET itself. The Astoria project was announced at MIX 2007, and the first developer
preview was made available on April 30, 2007. The first CTP was made available as a part of the ASP.NET 3.5
Extensions Preview. The final version was released as part of Service Pack 1 of the .NET Framework 3.5 on August
11, 2008. The name change from ADO.NET Data Services to WCF data Services was announced at the 2009 PDC.
Overview
ADO.NET Data Services exposes data, represented as Entity Data Model (EDM) objects, via web services accessed
over HTTP. The data can be addressed using a REST-like URI. The Astoria service, when accessed via the HTTP
GET method with such a URI, will return the data. The web service can be configured to return the data in either
plain XML, JSON or RDF+XML. In the initial release, formats like RSS and ATOM are not supported, though they
may be in the future. In addition, using other HTTP methods like PUT, POST or DELETE, the data can be updated
as well. POST can be used to create new entities, PUT for updating an entity, and DELETE for deleting an entity.
The URIs representing the data will contain the physical location of the service, as well as the service name. In
addition, it will also need to specify an EDM Entity-Set or a specific entity instance, as in respectively
http://dataserver/service.svc/MusicCollection
or
http://dataserver/service.svc/MusicCollection[SomeArtist]
The former will list all entities in the Collection set whereas the latter will list only for the entity which is indexed by
SomeArtist.
In addition, the URIs can also specify a traversal of a relationship in the Entity Data Model. For example,
http://dataserver/service.svc/MusicCollection[SomeSong]/Genre
traverses the relationship Genre (in SQL parlance, joins with the Genre table) and retrieves all instances of Genre
that are associated with the entity SomeSong. Simple predicates can also be specified in the URI, like
http://dataserver/service.svc/MusicCollection[SomeArtist]/ReleaseDate[Year eq 2006]
will fetch the items that are indexed by SomeArtist and had their release in 2006. Filtering and partition information
can also be encoded in the URL as
http://dataserver/service.svc/MusicCollection?$orderby=ReleaseDate&$skip=100&$top=50
It is important to note that although the presence of skip and top keywords indicate paging support, in Data Services
version 1 there is no method of determining the number of records available and thus impossible to determine how
many pages there may be. The OData 2.0 spec adds support for the $count path segment (to return just a count of
entities) and $inlineCount (to retrieve a page worth of entities and a total count without a separate round-trip).
[3]
WCF Data Services
58
References
[1] "Simplifying our n-tier development platform: making 3 things 1 thing" (http:// blogs. msdn.com/ astoriateam/ archive/ 2009/ 11/ 17/
simplifying-our-n-tier-development-platform-making-3-things-1-thing.aspx). ADO.NET Data Services Team Blog. 2009-11-17. . Retrieved
2009-12-17.
[2] "ADO.NET Data Services CTP Released!" (http:// blogs. msdn. com/ data/ archive/2007/ 12/ 10/ ado-net-data-services-ctp-released.aspx). .
Retrieved 2007-11-12.
[3] http:/ / msdn. microsoft.com/ en-us/ library/ee373845. aspx
• "Codename "Astoria": Data Services for the Web" (http:/ / blogs. msdn. com/ pablo/ archive/2007/ 04/ 30/
codename-astoria-data-services-for-the-web.aspx). Retrieved 2007-04-30.
• ADO.NET Data Services Framework (formerly "Project Astoria") (http:// astoria. mslivelabs. com/ )
External links
• Using Microsoft ADO.NET Data Services (http:// msdn. microsoft. com/ en-us/ library/cc907912. aspx)
• ASP.NET 3.5 Extensions Preview (http:/ / www.asp. net/ downloads/ 3.5-extensions/ )
• ADO.NET Data Services (Project Astoria) Team Blog (http:// blogs. msdn. com/ astoriateam/ )
• Access Cloud Data with Astoria: ENT News Online (http:/ / entmag. com/ news/ article.asp?EditorialsID=9105)
Age-Based Content Rating System
The Age-Based Content Rating System (ABC Rating System) is a proprietary classification of web addresses
based on age appropriate content stored in the metadata of websites. The system was developed by Covenant Eyes,
Inc., which also pioneered the concept of Accountability software in 2000, as a means to accurately report on and
block websites for their members and subscribers.
Rating levels
The ABC Rating System categorizes all URLs into one of six age-based ratings.
[1]
Rating Description
E (Everyone) Generally appropriate for all ages.
Y (Youth) Generally appropriate for all ages, but parents might object for younger children.
T (Teen) Generally appropriate for adults and teenagers, but parents might object to these sites for children. This content may include
social networking sites like Facebook, chat rooms, and games with violence.
MT (Mature
Teen)
Generally appropriate for adults and mature teenagers. This content may include mild profanity or contain material inappropriate
for younger teens.
M (Mature) May be considered appropriate by many mature adults, but is generally inappropriate for teenagers. This content may include
dating sites, lingerie, crude humor, intense violence, and material of a sexual nature.
HM (Highly
Mature)
Likely to be inappropriate for everyone. This content may include anonymizers, nudity, erotica, and pornography.
Age-Based Content Rating System
59
Concept and history
In March 2011, Covenant Eyes’ President Ron DeHaas stated the purpose rating Internet content: “Our mission is to
make it easy for families to talk about how the Internet is used in their home. Our reports allow parents to know how
each of their kids use the Internet, and the age-based ratings for every web page visited helps parents tailor
conversations to each child.”
Covenant Eyes released its Internet accountability service for computers using the Windows operating system in the
summer of 2000 and the same service for Macintosh computers in the summer of 2006.
[2]
The service listed all
websites visited by a subscriber in an accountability report, placing questionable or pornographic websites at the top
of this log. The program “scored” each web address based on objectionable terms or phrases and “flagged” or
highlighted these websites on the accountability logs.
[3]

[4]

[5]
Between 2000 and 2010, Covenant Eyes used a numerical scoring system for web addresses.
[6]
In 2010 the
numerical score range was zero (0) to sixty-seven (67).
[7]
In December 2010, the numerical scoring system was replaced with a letter rating system. This was done to more
closely match entertainment industry standards, such as the content ratings used by the Entertainment Software
Rating Board or the film ratings used by the Motion Picture Association of America. DeHaas is noted saying, “Just
as movie ratings are highly important to every movie viewer, ratings of websites will certainly become a household
concept for the Internet.”
[8]
References
[1] ““What do the ratings mean?” (http:// www.covenanteyes. com/ support-articles/ what-do-the-ratings-mean) Covenanteyes.com. Retrieved
2011-04-04.
[2] “Internet Accountability Available for Mac Computers Includes Intel Versions” (http:// www.prweb.com/ releases/ 2006/ 6/ prweb397683.
htm). PRWeb.com. 2006-06-14. Retrieved 2011-04-05.
[3] “Parents can monitor Internet use” (http:/ / news. google. com/ newspapers?id=-AgoAAAAIBAJ& sjid=LQUGAAAAIBAJ&
pg=5851,5366085&dq=covenant-eyes&hl=en). The Argus-Press. 2001-12-31. Retrieved 2011-04-05.
[4] “Covenant Eyes welcomes newcomer to it accountability software staff” (http:// news.google.com/ newspapers?id=KHAiAAAAIBAJ&
sjid=K60FAAAAIBAJ&pg=2759,4525775&dq=covenant-eyes& hl=en). The Argus-Press. 2003-06-20. Retrieved 2011-04-05.
[5] “Covenant Eyes Version 3 Now Reports System Restore Usage as Part of its Patent-Pending Tampering Notification System” (http:/ / news.
google.com/ newspapers?id=KHAiAAAAIBAJ& sjid=K60FAAAAIBAJ& pg=2759,4525775&dq=covenant-eyes& hl=en). PRWeb.com.
2004-10-07. Retrieved 2011-04-05.
[6] “Covenant Eyes Provides Effective Monitoring and Filtering for Sites Like Myspace and Google Images” (http:/ / www. prweb.com/
releases/ 2006/ 2/ prweb350631.htm). PRWeb.com. 2006-02-25. Retrieved 2011-04-05.
[7] “How do the numerical scores correspond to the new letter ratings?” (http:// www.covenanteyes. com/ support-articles/
how-do-the-numerical-scores-correspond-to-the-new-letter-ratings/) Covenanteyes.com. Retrieved 2011-04-05.
[8] “Internet Revolution: How Rating the Web Changes Everything” (http://www.prweb.com/ releases/ 2010/ 12/ prweb4922264.htm).
PRWeb.com. 2010-12-27. Retrieved 2011-04-05.
External links
• Covenant Eyes official web site (http:// www.covenanteyes. com)
Aggregate (Data Warehouse)
60
Aggregate (Data Warehouse)
Aggregates are used in dimensional models of the data warehouse to produce dramatic positive effects on the time it
takes to query large sets of data. At the simplest form an aggregate is a simple summary table that can be derived by
performing a Group by SQL query. A more common use of aggregates is to take a dimension and change the
granularity of this dimension. When changing the granularity of the dimension the fact table has to be partially
summarized to fit the new grain of the new dimension, thus creating new dimensional and fact tables, fitting this new
level of grain. Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually
precomputed, partially summarized data, that are stored in new aggregated tables. When facts are aggregated, it is
either done by eliminating dimensionality or by associating the facts with a rolled up dimension. Rolled up
dimensions should be shrunken versions of the dimensions associated with the granular base facts. This way, the
aggregated dimension tables should conform to the base dimension tables.
[1]
So the reason why aggregates can make
such an dramatic increase in the performance of the data warehouse is the reduction of the number of rows to be
accessed when responding to a query.
[2]
Kimball which is widely regarded as one of the original architects of data warehousing says
[3]
:
The single most dramatic way to affect performance in a large data warehouse is to provide a proper set
of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very
significant effect on performance, in some cases speeding queries by a factor of one hundred or even
one thousand. No other means exist to harvest such spectacular gains.
Having aggregates and atomic data increases the complexity of the dimensional model. This complexity should be
transparent to the users of the data warehouse, thus when a request is made, the data warehouse should return data
from the table with the correct grain. So when requests to the data warehouse are made, aggregate navigator
functionality should be implemented, to help determine the correct table with the correct grain. The number of
possible aggregations is determined by every possible combination of dimension granularities. Since it would
produce a lot of overhead to build all possible aggregations, it is a good idea to choose a subset of tables on which to
make aggregations. The best way to choose this subset and decide which aggregations to build is to monitor queries
and design aggregations to match query patterns.
[4]
Aggregate navigator
Having aggregate data in the dimensional model makes the environment more complex. To make this extra
complexity transparent to the user, functionality known as aggregate navigation is used to query the dimensional and
fact tables with the correct grain level. The aggregate navigation essentially examines the query to see if it can be
answered using a smaller, aggregate table.
[5]
Implementations of aggregate navigators can be found in a range of technologies:
• OLAP engines
• Materialized views
• Relational OLAP (ROLAP) services
• BI application servers or query tools
It is generally recommended to use either of the first three technologies, since the benefits in the latter case is
restricted to a single front end BI tool
[6]
Aggregate (Data Warehouse)
61
Problems/challenges
• Since dimensional models only gains from aggregates on large data sets, at what size of the data sets should one
start considering using aggregates?
• Similarly, is a data warehouses always handling data sets that are too large for direct queries, or is it sometimes a
good idea to omit the aggregate tables, when starting a new data warehouse project. Thus will, omitting
aggregates in the first iteration of building a new data warehouse, make the structure of the dimensional model
simpler?
References
[1] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, Second Edition, Wiley Computer
Publishing, 2002. ISBN 0-471-20024-7, Page 356
[2] Christopher Adamson, Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance, Wiley Publishing, Inc., 2006 ISBN
978-0-471-77709-0, Page 23
[3] "Aggregate Navigation With (Almost) No Metadata" (http:/ / www. rkimball.com/ html/ articles_search/ articles1996/ 9608d54.html).
1995-08-15. . Retrieved 2010-11-22.
[4] Ralph Kimball et al., The Data Warehouse Toolkit, Second Edition, Wiley Publishing, Inc., 2008. ISBN 978-0-470-14977-5, Page 355
[5] Ralph Kimball et al., The Data Warehouse Toolkit, Second Edition, Wiley Publishing, Inc., 2008. ISBN 978-0-470-14977-5, Page 137
[6] Ralph Kimball et al., The Data Warehouse Toolkit, Second Edition, Wiley Publishing, Inc., 2008. ISBN 978-0-470-14977-5, Page 354
Data archaeology
Data archaeology refers to the art and science of recovering computer data encrypted in now obsolete media or
formats.
The term originally appeared in 1993 as part of the Global Oceanographic Data Archaeology and Rescue Project
(GODAR). The original impetus for data archaeology came from the need to recover computerized records of
climatic conditions stored on old computer tape, which can provide valuable evidence for testing theories of climate
change. These approaches allowed the reconstruction of an image of the Arctic that had been captured by the
Nimbus 2 satellite on September 23, 1966, in higher resolution than ever seen before from this type of data.
[1]
NASA also utilizes the services of data archaeologists to recover information stored on 1960s era vintage computer
tape, as exemplified by the Lunar Orbiter Image Recovery Project (LOIRP).
[2]
To prevent the need of data archeology, creators and holders of digital documents should take care of digital
preservation.
References
[1] Techno-archaeology rescues climate data from early satellites (http:// nsidc. org/monthlyhighlights/ january2010.html) U.S. National Snow
and Ice Data Center (NSIDC), January 2010 Archived (http:// www.webcitation.org/ 5xN1sNyDp)
[2] LOIRP Overview (http:// www. nasa. gov/ topics/ moonmars/ features/ LOIRP/) NASA website November 14, 2008 Archived (http://
www.webcitation. org/5xN1DjLG4)
• World Wide Words: Data Archaeology (http:// www.worldwidewords.org/ turnsofphrase/tp-dat1.htm)
• O'Donnell, James Joseph. Avatars of the Word: From Papyrus to Cyperspace Harvard University Press, 1998.
Archive site
62
Archive site
In web archiving, an archive site is a website that stores information on, or the actual, webpages from the past for
anyone to view.
Common techniques
Two common techniques are #1 using a web crawler or #2 user submissions.
1. By using a web crawler the service will not depend on an active community for their content, thereby building a
larger database faster, which usually results in the community growing larger as well. However, web site
developers and system administrators do have the ability to block these robots from accessing [certain] web pages
(using a robots.txt).
2. While it can be difficult to start such services due to potentially low rates of user submission, this system can
yield some of the best results. By crawling web pages one is only able to obtain the information the public has
bothered to post to the Internet. They may have not bothered to post it due to not thinking anyone would be
interested in it, lack of a proper medium, etc. However, if they see someone wants their information then they
may be more apt to submit it.
Examples
Google Groups
On February 12, 2001, Google acquired the Usenet discussion group archives from Deja.com and turned it into their
Google Groups service [1]. They allow users to search old discussions with Google's search technology, while still
allowing users to post to the mailing lists.
Internet Archive
The Internet Archive (official website
[2]
) is building a compendium of websites and digital media. Starting in 1996,
Archive has been employing a web crawler to build up their database. They are one of the best known archive sites.
TextFiles.com
TextFiles.com
[3]
is a large library of old text files maintained by Jason Scott Sadofsky. Its mission is to archive the
old documents that had floated around the bulletin board systems (BBS) of his youth and to document other people's
experiences on the BBSes.
PANDORA Archive
PANDORA (Pandora Archive), founded in 1996 by the National Library of Australia, stands for Preserving and
Accessing Networked Documentary Resources of Australia, which encapsulates their mission. They provide a
long-term catalog of select online publications and web sites authored by Australians or that are of an Australian
topic. They employ their PANDAS (PANDORA Digital Archiving System) when building their catalog.
Archive site
63
Nextpoint
Nextpoint
[4]
offers an automated cloud-based, SaaS for marketing, compliance and litigation related needs including
electronic discovery.
References
[1] http:/ / www. google. com/ press/ pressrel/ pressrelease48. html
[2] http:// www. archive. org
[3] http:/ / www. textfiles. com
[4] http:/ / www. nextpoint. com/ preservation.html
Association rule learning
In data mining, association rule learning is a popular and well researched method for discovering interesting
relations between variables in large databases. Piatetsky-Shapiro
[1]
describes analyzing and presenting strong rules
discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawal et
al.
[2]
introduced association rules for discovering regularities between products in large scale transaction data
recorded by point-of-sale (POS) systems in supermarkets. For example, the rule
found in the sales data of a supermarket would indicate that if a customer
buys onions and potatoes together, he or she is likely to also buy burger. Such information can be used as the basis
for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the
above example from market basket analysis association rules are employed today in many application areas
including Web usage mining, intrusion detection and bioinformatics.
Definition
Example data base with 4 items and 5 transactions
transaction ID milk bread butter beer
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
Following the original definition by Agrawal et al.
[2]
the problem of association rule mining is defined as: Let
be a set of binary attributes called items. Let be a set of
transactions called the database. Each transaction in has a unique transaction ID and contains a subset of the
items in . A rule is defined as an implication of the form where and . The
sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and consequent
(right-hand-side or RHS) of the rule respectively.
To illustrate the concepts, we use a small example from the supermarket domain. The set of items is
and a small database containing the items (1 codes presence and 0 absence of
an item in a transaction) is shown in the table to the right. An example rule for the supermarket could be
meaning that if butter and bread is bought, customers also buy milk.
Association rule learning
64
Note: this example is extremely small. In practical applications, a rule needs a support of several hundred
transactions before it can be considered statistically significant, and datasets often contain thousands or millions of
transactions.
Useful Concepts
To select interesting rules from the set of all possible rules, constraints on various measures of significance and
interest can be used. The best-known constraints are minimum thresholds on support and confidence.
• The support of an itemset is defined as the proportion of transactions in the data set which contain
the itemset. In the example database, the itemset has a support of since it
occurs in 20% of all transactions (1 out of 5 transactions).
• The confidence of a rule is defined . For example, the rule
has a confidence of in the database, which means that for
50% of the transactions containing milk and bread the rule is correct.
• Confidence can be interpreted as an estimate of the probability , the probability of finding the RHS
of the rule in transactions under the condition that these transactions also contain the LHS.
[3]
• The lift of a rule is defined as or the ratio of the observed support to
that expected if X and Y were independent. The rule has a lift of
.
• The conviction of a rule is defined as . The rule
has a conviction of , and can be interpreted as the ratio of the
expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect
prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this
example, the conviction value of 1.2 shows that the rule would be incorrect
20% more often (1.2 times as often) if the association between X and Y was purely random chance.
• The property of succinctness(Characterized by clear, precise expression in few words) of a constraint. A
constraint is succinct if we are able to explicitly write down all Item-sets,that satisfy the constraint.
Example : Constraint C = S.Type = {NonFood}
Products that would satisfy this constraint are for ex. {Headphones,Shoes,Toilet paper}
Usage Example: Instead of using Apriori algorithm to obtain the Frequent-Item-sets we can instead create all the
Item-sets and run support counting only once.
Association rule learning
65
Process
Frequent itemset lattice, where the color of the
box indicates how many transactions contain the
combination of items. Note that lower levels of
the lattice can contain at most the minimum
number of their parents' items; e.g. {ac} can have
only at most items. This is called
the downward-closure property.
[2]
Association rules are usually required to satisfy a user-specified
minimum support and a user-specified minimum confidence at the
same time. Association rule generation is usually split up into two
separate steps:
1. First, minimum support is applied to find all frequent itemsets in a
database.
2. Second, these frequent itemsets and the minimum confidence
constraint are used to form rules.
While the second step is straight forward, the first step needs more
attention.
Finding all frequent itemsets in a database is difficult since it involves
searching all possible itemsets (item combinations). The set of possible
itemsets is the power set over and has size (excluding the
empty set which is not a valid itemset). Although the size of the
powerset grows exponentially in the number of items in ,
efficient search is possible using the downward-closure property of support
[2]

[4]
(also called anti-monotonicity
[5]
)
which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its
supersets must also be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori
[6]
and Eclat
[7]
) can
find all frequent itemsets.
History
The concept of association rules was popularised particularly due to the 1993 article of Agrawal,
[2]
which has
acquired more than 6000 citations according to Google Scholar, as of March 2008, and is thus one of the most cited
papers in the Data Mining field. However, it is possible that what is now called "association rules" is similar to what
appears in the 1966 paper
[8]
on GUHA, a general data mining method developed by Petr Hájek et al.
[9]
Alternative measures of interestingness
Next to confidence also other measures of interestingness for rules were proposed. Some popular measures are:
• All-confidence
[10]
• Collective strength
[11]
• Conviction
[12]
• Leverage
[13]
• Lift (originally called interest)
[12]
A definition of these measures can be found here
[14]
. Several more measures are presented and compared by Tan et
al.
[15]
Looking for techniques that can model what the user has known (and using this models as interestingness
measures) is currently an active research trend under the name of "Subjective Interestingness"
Association rule learning
66
Statistically sound associations
One limitation of the standard approach to discovering associations is that by searching massive numbers of possible
associations to look for collections of items that appear to be associated, there is a large risk of finding many
spurious associations. These are collections of items that co-occur with unexpected frequency in the data, but only do
so by chance. For example, suppose we are considering a collection of 10,000 items and looking for rules containing
two items in the left-hand-side and 1 item in the right-hand-side. There are approximately 1,000,000,000,000 such
rules. If we apply a statistical test for independence with a significance level of 0.05 it means there is only a 5%
chance of accepting a rule if there is no association. If we assume there are no associations, we should nonetheless
expect to find 50,000,000,000 rules. Statistically sound association discovery
[16]

[17]
controls this risk, in most cases
reducing the risk of finding any spurious associations to a user-specified significance level.
Algorithms
Many algorithms for generating association rules were presented over time.
Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are
algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets
found in a database.
Apriori algorithm
Apriori
[6]
is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to counting the
support of itemsets and uses a candidate generation function which exploits the downward closure property of
support.
Eclat algorithm
Eclat
[7]
is a depth-first search algorithm using set intersection.
FP-growth algorithm
FP-growth (frequent pattern growth)
[18]
uses an extended prefix-tree (FP-tree) structure to store the database in a
compressed form. FP-growth adopts a divide-and-conquer approach to decompose both the mining tasks and the
databases. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing
used by Apriori.
GUHA procedure ASSOC
GUHA is a general method for exploratory data analysis that has theoretical foundations in observational calculi.
[19]
The ASSOC procedure
[20]
is a GUHA method which mines for generalized association rules using fast bitstrings
operations. The association rules mined by this method are more general than those output by apriori, for example
"items" can be connected both with conjunction and disjunctions and the relation between antecedent and consequent
of the rule is not restricted to setting minimum support and confidence as in apriori: an arbitrary combination of
supported interest measures can be used.
Association rule learning
67
One-attribute rule
The one-attribute rule, one-attribute-rule algorithm, or OneR, is an algorithm for finding association rules.
According to Ross, very simple association rules, involving just one attribute in the condition part, often work well
in practice with real-world data.
[21]
The idea of the OneR (one-attribute-rule) algorithm is to find the one attribute to
use to classify a novel datapoint that makes fewest prediction errors.
For example, to classify a car you haven't seen before, you might apply the following rule: If Fast Then Sportscar, as
opposed to a rule with multiple attributes in the condition: If Fast And Softtop And Red Then Sportscar.
The algorithm is as follows:
For each attribute A:
For each value V of that attribute, create a rule:
1. count how often each class appears
2. find the most frequent class, c
3. make a rule "if A=V then C=c"
Calculate the error rate of this rule
Pick the attribute whose rules produce the lowest error rate
OPUS search
OPUS is an efficient algorithm for rule discovery that, in contrast to most alternatives, does not require either
monotone or anti-monotone constraints such as minimum support.
[22]
Initially used to find rules for a fixed
consequent
[22]

[23]
it has subsequently been extended to find rules with any item as a consequent.
[24]
OPUS search is
the core technology in the popular Magnum Opus
[25]
association discovery system.
Zero-attribute rule
The zero-attribute rule, or ZeroR, does not involve any attribute in the condition part, and always returns the most
frequent class in the training set. This algorithm is frequently used to measure the classification success of other
algorithms.
Lore
A famous story about association rule mining is the "beer and diaper" story. A purported survey of behavior of
supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer.
This anecdote became popular as an example of how unexpected association rules might be found from everyday
data. There are varying opinions as to how much of the story is true.
[26]
Daniel Powers says
[26]
In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an
analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were
developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that
consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship
by moving the products closer together on the shelves.
Association rule learning
68
Other types of association mining
Contrast set learning is a form of associative learning. Contrast set learners use rules that differ meaningfully in
their distribution across subsets.
[27]
Weighted class learning is another form of associative learning in which weight may be assigned to classes to give
focus to a particular issue of concern for the consumer of the data mining results.
K-optimal pattern discovery provides an alternative to the standard approach to association rule learning that
requires that each pattern appear frequently in the data.
Mining frequent sequences uses support to find sequences in temporal data.
[28]
Generalized Association Rules hierarchical taxonomy (concept hierarchy)
Quantitiative Association Rules categorical and quantitative data
Interval Data Association Rules e.g. partition the age into 5-year-increment ranged
Maximal Association Rules
Sequential Assocatiation Rules temporal data e.g. first buy computer, then CD Roms, then a webcam.
References
[1] Piatetsky-Shapiro, G. (1991), Discovery, analysis, and presentation of strong rules, in G. Piatetsky-Shapiro & W. J. Frawley, eds, ‘Knowledge
Discovery in Databases’, AAAI/MIT Press, Cambridge, MA.
[2] R. Agrawal; T. Imielinski; A. Swami: Mining Association Rules Between Sets of Items in Large Databases", SIGMOD Conference 1993:
207-216
[3] Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh. Algorithms for association rule mining - A general survey and comparison.
SIGKDD Explorations, 2(2):1-58, 2000.
[4] Tan, Pang-Ning; Michael, Steinbach; Kumar, Vipin (2005). "Chapter 6. Association Analysis: Basic Concepts and Algorithms" (http:/ /
www-users.cs. umn. edu/ ~kumar/dmbook/ ch6. pdf). Introduction to Data Mining. Addison-Wesley. ISBN 0321321367. .
[5] Jian Pei, Jiawei Han, and Laks V.S. Lakshmanan. Mining frequent itemsets with convertible constraints. In Proceedings of the 17th
International Conference on Data Engineering, April 2–6, 2001, Heidelberg, Germany, pages 433-442, 2001.
[6] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Jorge B. Bocca, Matthias
Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499,
Santiago, Chile, September 1994.
[7] Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390,
May/June 2000.
[8] Hajek P., Havel I., Chytil M.: The GUHA method of automatic hypotheses determination, Computing 1(1966) 293-308.
[9] Petr Hajek, Tomas Feglar, Jan Rauch, David Coufal. The GUHA method, data preprocessing and mining. Database Support for Data Mining
Applications, ISBN 978-3-540-22479-2, Springer, 2004
[10] Edward R. Omiecinski. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data
Engineering, 15(1):57-69, Jan/Feb 2003.
[11] C. C. Aggarwal and P. S. Yu. A new framework for itemset generation. In PODS 98, Symposium on Principles of Database Systems, pages
18-24, Seattle, WA, USA, 1998.
[12] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data.
In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 255-264, Tucson, Arizona, USA,
May 1997.
[13] Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases, 1991: p. 229-248.
[14] http:/ / michael.hahsler. net/ research/association_rules/ measures.html
[15] Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava. Selecting the right objective measure for association analysis. Information Systems,
29(4):293-313, 2004.
[16] Webb, G.I. (2007). Discovering Significant Patterns. Machine Learning 68(1). Netherlands: Springer, pages 1-33. online access (http://
springerlink.metapress. com/ content/ 4r35537x6vxg0523/?p=9291269dbfed4750a6e1d6e9bf6f3c13&pi=0)
[17] A. Gionis, H. Mannila, T. Mielikainen, and P. Tsaparas, Assessing Data Mining Results via Swap Randomization, ACM Transactions on
Knowledge Discovery from Data (TKDD), Volume 1 , Issue 3 (December 2007) Article No. 14.
[18] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns without candidate generation. Data Mining and Knowledge
Discovery 8:53-87, 2004.
[19] J. Rauch, Logical calculi for knowledge discovery in databases. Proceedings of the First European Symposium on Principles of Data Mining
and Knowledge Discovery, Springer, 1997, pgs. 47-57.
Association rule learning
69
[20] Hájek, P.; Havránek P (1978). Mechanising Hypothesis Formation – Mathematical Foundations for a General Theory (http:/ / www.cs. cas.
cz/hajek/ guhabook/ ). Springer-Verlag. ISBN 0-7869-1850-8. .
[21] Ross, Peter. "OneR: the simplest method" (http:/ / www.dcs. napier.ac. uk/~peter/ vldb/ dm/ node8. html). .
[22] Webb, G. I. (1995). OPUS: An Efficient Admissible Algorithm For Unordered Search. Journal of Artificial Intelligence Research 3. Menlo
Park, CA: AAAI Press, pages 431-465 online access (http:/ / www.cs.washington. edu/ research/ jair/abstracts/ webb95a.html).
[23] Bayardo, R.J.; Agrawal, R.; Gunopulos, D. (2000). "Constraint-based rule mining in large, dense databases". Data Mining and Knowledge
Discovery 4 (2): 217–240. doi:10.1023/A:1009895914772.
[24] Webb, G. I. (2000). Efficient Search for Association Rules. In R. Ramakrishnan and S. Stolfo (Eds.), Proceedings of the Sixth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000) Boston, MA. New York: The Association for
Computing Machinery, pages 99-107. online access (http:/ / www.csse. monash.edu/ ~webb/ Files/ Webb00b. pdf)
[25] http:// www. giwebb. com
[26] http:/ / www. dssresources. com/ newsletters/ 66. php
[27] T. Menzies, Y. Hu, "Data Mining For Very Busy People." IEEE Computer, October 2003, pgs. 18-25.
[28] M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31–60.
External links
Bibliographies
• Annotated Bibliography on Association Rules (http:/ / michael. hahsler. net/ research/ bib/ association_rules/ ) by
M. Hahsler
• Statsoft Electronic Statistics Textbook: Association Rules (http://www. statsoft. com/ textbook/
association-rules/)
Implementations
• KXEN, a commercial Data Mining software (http:// www.KXEN.com)
• Silverlight widget for live demonstration of association rule mining using Apriori algorithm (http:// codeding.
com/ ?article=13)
• RapidMiner, a free Java data mining software suite (Community Edition: GNU)
• Orange, a free data mining software suite, module orngAssoc (http:// www.ailab. si/ orange/doc/ modules/
orngAssoc. htm)
• Ruby implementation (AI4R) (http:// ai4r. rubyforge.org)
• arules (http:/ / cran.r-project.org/ package=arules), a package for mining association rules and frequent itemsets
with R.
• C. Borgelt's implementation of Apriori and Eclat (http:/ / www.borgelt.net/ fpm. html)
• Frequent Itemset Mining Implementations Repository (FIMI) (http:/ / fimi.cs. helsinki. fi/)
• Frequent pattern mining implementations from Bart Goethals (http:/ / adrem.ua. ac.be/ ~goethals/ software/)
• Weka (http:/ / www. cs. waikato. ac. nz/ ml/ weka/ ), a collection of machine learning algorithms for data mining
tasks written in Java.
• KNIME an open source workflow oriented data preprocessing and analysis platform.
• Data Mining Software by Mohammed J. Zaki (http:// www.cs. rpi.edu/ ~zaki/ software/ )
• Magnum Opus (http:/ / www. giwebb. com), a system for statistically sound association discovery.
• LISp Miner (http:// lispminer.vse. cz), Mines for generalized (GUHA) association rules. Uses bitstrings not
apriori algorithm.
• Ferda Dataminer (http:/ / ferda.sourceforge.net), An extensible visual data mining platform, implements GUHA
procedures ASSOC. Features multirelational data mining.
• STATISTICA (http:/ / www. statsoft. com), commercial statistics software with an Association Rules module.
• SPMF (http:/ / www. philippe-fournier-viger.com/ spmf/ ), Java implementations of more than twenty frequent
itemset, sequential pattern and association rule mining algorithms
Association rule learning
70
Open Standards
• Association Rules in PMML (http:/ / www. dmg. org/v4-0/AssociationRules. html)
Atomicity (database systems)
In database systems, atomicity (or atomicness; from Gr. a-tomos, undividable) is one of the ACID transaction
properties. In an atomic transaction, a series of database operations either all occur, or nothing occurs. A guarantee
of atomicity prevents updates to the database occurring only partially, which can cause greater problems than
rejecting the whole series outright. With other words, atomicity means indivisibility and irreducibility.
[1]
The etymology of the phrase originates in the Classical Greek concept of a fundamental and indivisible component;
see atom.
An example of atomicity is ordering an airline ticket where two actions are required: payment, and a seat reservation.
The potential passenger must either:
1. both pay for and reserve a seat; OR
2. neither pay for nor reserve a seat.
The booking system does not consider it acceptable for a customer to pay for a ticket without securing the seat, nor
to reserve the seat without payment succeeding.
Orthogonality
Atomicity does not behave completely orthogonally with regard to the other ACID properties of the transactions. For
example, isolation relies on atomicity to roll back changes in the event of isolation failures such as deadlock;
consistency also relies on rollback in the event of a consistency-violation by an illegal transaction. Finally, atomicity
itself relies on durability to ensure the atomicity of transactions even in the face of external failures.
As a result of this, failure to detect errors and manually roll back the enclosing transaction may cause failures of
isolation and consistency.
Implementation
Typically, systems implement atomicity by providing some mechanism to indicate which transactions have started
and which finished; or by keeping a copy of the data before any changes occurred. Several filesystems have
developed methods for avoiding the need to keep multiple copies of data, using journaling (see journaling file
system). Databases usually implement this using some form of logging/journaling to track changes. The system
synchronizes the logs (often the metadata) as necessary once the actual changes have successfully taken place.
Afterwards, crash recovery simply ignores incomplete entries. Although implementations vary depending on factors
such as concurrency issues, the principle of atomicity — i.e. complete success or complete failure — remain.
Ultimately, any application-level implementation relies on operating-system functionality, which in turn makes use
of specialized hardware to guarantee that an operation remains non-interruptible: either by software attempting to
re-divert system resources (see pre-emptive multitasking) or by resource-unavailability (such as power-outages). For
example, POSIX-compliant systems provide the open(2) system call that allows applications to atomically open a
file. Other popular system-calls that may assist in achieving atomic operations from userspace include fcntl(2),
fdatasync(2), flock(2), fsync(2), mkdir(2), rasctl(2) (NetBSD re-startable sequences), rename(2), semop(2),
sem_post(2), and sem_wait(2).
The hardware level requires atomic operations such as test-and-set (TAS), or atomic increment/decrement
operations. In their absence, or when necessary, raising the interrupt level to disable all possible interrupts (of
hardware and software origin) may serve to implement the atomic synchronization function primitives. Systems
Atomicity (database systems)
71
often implement these low-level operations in machine language or in assembly language.
In NoSQL data stores with eventual consistency, the atomicity is also weaker specified than in relational database
systems, and exists only in rows (i.e. column families).
[2]
References
[1] "atomic operation" (http:/ / www. webopedia. com/ TERM/ A/ atomic_operation.html). http:// www.webopedia.com/ : Webopedia. .
Retrieved 2011-03-23. "An operation during which a processor can simultaneously read a location and write it in the same bus operation. This
prevents any other processor or I/O device from writing or reading memory until the operation is complete."
[2] Olivier Mallassi (2010-06-09). "Let’s play with Cassandra… (Part 1/3)" (http:// blog.octo. com/ en/ nosql-lets-play-with-cassandra-part-13/
). http:// blog. octo.com/ en/ : OCTO Talks!. . Retrieved 2011-03-23. "Atomicity is also weaker than what we are used to in the relational
world. Cassandra guarantees atomicity within a ColumnFamily so for all the columns of a row."
Australian National Data Service
The Australian National Data Service (ANDS) was established in 2008 to help address the challenges of storing
and managing Australia's research data, and making it discoverable and accessible for validation and reuse.
Background
ANDS is funded by the Australian Government's National Collaborative Research Infrastructure Strategy (NCRIS)
as part of the Platforms for Collaboration Investment Plan
[1]
.
External links
• Official website
[2]
References
[1] "Platforms for Collaboration" (http:// ncris. innovation. gov. au/ Capabilities/ Pages/ PfC. aspx#ANDS). . Retrieved 2011-06-02.
[2] http:// www. ands. org.au
Automated Tiered Storage
72
Automated Tiered Storage
Automated Tiered Storage is a product from Compellent Technologies. It stores information in a database in such
way that information searched for more often is moved to a different tier, producing faster search results.
Overview
The collection and sorting of data has become a very important business tool for many companies; information about
customers, products, sales, and research are all needed to help a company stay competitive. In conjunction with the
growing amounts of information, data management systems have become more popular and in some cases a
necessity. One of these growing data management technologies is called Automated Tiered Storage.
The concept behind Automated Tiered Storage is that the information is classified using much more efficient and
cheaper methods than a simple data base. What makes Automated Tiered Storage cheaper and faster is the way the
information is stored. Information which is part of more searches, or is used more frequently are automatically
moved to a different "tier" than information that is seldom searched. This helps produce faster search results and can
reduce the costs of storing data.
The costs of Automated Tiered Storage can be lower than regular databases because of where the data is stored. The
tier that holds the most searched information and recent queries would have the fastest rotation speed and a smaller
amount of disk space, however as you move down the tiers the rotation speed becomes lower and the storage space
becomes greater. This eliminates having a massive database all with the fastest rotation speed possible.
Costs of storing data can be reduced using an automated tiered storage because less time is needed to manage the
information, search results return quicker, and most importantly you don’t need to organize the information into tiers;
it’s done automatically. Tiers can range from mainframes and servers to hard copies of data (disks or data tapes)
depending on the importance and usage rate of the information.
Data Progression from Compellent
An example of Automated Tiered Storage is a feature called Data Progression from Compellent Technologies. Data
Progression has the capability to transparently move blocks of data between different drive types and RAID groups
such as RAID 10 and RAID 5. The blocks are part of the "same virtual volume even as they span different RAID
groups and drive types. Compellent can do this because they keep metadata about every block -- which allows them
to keep track of each block and its associations."
[1]
References
• Russ Taddiken – Senior Storage Architect (2006). Automating Data Movement Between Storage Tiers. Retrieved
from the UW Records Management Web site: http:/ / www.compellent. com/
[1] Tony Asaro, Computerworld. "Compellent-Intelligent Tiered Storage." (http:// blogs. computerworld.com/ compellent_ilm) January 19,
2009.
External links
• http:/ / www. compellent. com/ Products/ Software/Automated-Tiered-Storage.aspx
Automatic data processing
73
Automatic data processing
In telecommunication, the term automatic data processing (ADP) has the following meanings:
1. An interacting assembly of procedures, processes, methods, personnel, and equipment to perform automatically a
series of data processing operations on data. (The data processing operations may result in a change in the
semantic content of the data.)
2. Data processing by means of one or more devices that use common storage for all or part of a computer program,
and also for all or part of the data necessary for execution of the program; that execute user-written or
user-designated programs; that perform user-designated symbol manipulation, such as arithmetic operations, logic
operation, or character string manipulations; and that can execute programs that modify themselves during their
execution. Automatic data processing may be performed by a stand-alone unit or by several connected units.
3. Data processing largely performed by automatic means.
4. That branch of science and technology concerned with methods and techniques relating to data processing largely
performed by automatic means.
References
•  This article incorporates public domain material from websites or documents of the General Services
Administration (in support of MIL-STD-188).
•  This article incorporates public domain material from the United States Department of Defense document
"Dictionary of Military and Associated Terms".
Automatic data processing equipment
Automatic data processing equipment, legally defined, is any equipment or interconnected system or subsystems
of equipment that is used in the
• automatic acquisition,
• storage,
• manipulation,
• management,
• movement,
• control,
• display,
• switching,
• interchange,
• transmission, or
• reception,
of data or information
• by a Federal agency, or
• under a contract with a Federal agency
which
• requires the use of such equipment, or
• requires the performance of a service or the furnishing of a product which is performed or produced making
significant use of such equipment.
The term includes
Automatic data processing equipment
74
• computer,
• ancillary equipment,
• software,
• firmware, and similar procedures,
• services, including support services, and
• related resources as defined by regulations issued by the Administrator for General Services.
References
• Public Law 99-500, Title VII, Sec. 822 (a) Section 111(a) of the Federal Property and Administrative Services
Act of 1949 (40 U.S.C. § 759
[1]
(a)) revised.
 This article incorporates public domain material from websites or documents of the General Services
Administration (in support of MIL-STD-188).
References
[1] http:/ / www. law. cornell.edu/ uscode/ 40/ 759. html
BBC Archives
The BBC Archives are collections documenting the BBC's broadcasting history.
Overview
The archives contain 1 million hours of media (audio and audio/visual) material dating back to the 1890s, with early
material on wax cylinder.
[1]
With other materials such as photos and written documents the archive contains 11
million items.
[1]
The BBC is in the process of digitising the entire archive; as of summer 2010 they have spent
approximately ten years digitising half of the media content and due to improving work practices expect to complete
the other half in five years.
[1]
The BBC estimates that the 11 million items will comprise approximately 52 petabytes
of information.
[1]
Typically, one programme minute for video requires 1.4 gigabytes of storage.
[1]
The BBC uses the Material eXchange Format (MXF) which is an uncompressed, non-proprietary format which the
BBC has been publicising in order to mitigate the threat of the format becoming obsolete (as digital formats can and
do).
[1]
Undigitised the archive takes up 66 miles of shelving on which are held at least 15 video formats, two different
gauges of film and 11 formats on which radio recordings are stored.
[1]
The stock is managed using bar codes which
help to locate material on the shelves and also track material that has been lent out.
[1]
The storage environment is
controlled for temperature and humidity, different for audio than for video.
[1]
The BBC says that the budget for managing, protecting and digitising the archive accounts for only a small part of
the BBC's overall spend.
[1]
The archives were relaunched online in 2008 and have released new historical material regularly since then. The
BBC works in partnership with the British Film Institute (BFI), The National Archives and other partners in working
with and using the materials.
[1]
A related project called "Genome" is expected to complete in 2011 and will make programme listings (not the media
itself) dating back to 1923, sourced from The Radio Times, available to search online.
[1]
BBC Archives
75
Written Archives Centre
The BBC Written Archives Centre is part of the BBC Archives situated in Caversham, Berkshire, a suburb of
Reading in England.
The Centre holds the written records of the British Broadcasting Corporation, dating from 1922 to the present day.
The current guidelines restrict access to post-1980 production files, although some later documents (such as scripts
and Programme as Broadcast records) may be released. It is open to writers and academic researchers in higher
education by appointment only. The Centre has also contributed documents for many major documentaries on radio
and television.
Creative Archive Licence
The BBC together with the British Film Institute, the Open University, Channel 4 and Teachers' TV formed a
collaboration, the Creative Archive Licence Group, to create a copyright licence for the re-release of archived
material.
The Licence was a trial launched in 2005 and was notable for the re-release of part of the BBC's news archives and
natural history for creative use by the public The Creative Archive Licence is a copyright licence developed by the
Creative Archive Licence Group, initially a collaboration of the British Broadcasting Corporation, British Film
Institute, the Open University, Channel 4 and Teachers' TV. While artists and teachers are encouraged to use the
content to create works of their own, the terms of the licence are restrictive compared to other copyleft licences. Use
of Creative Archive content for commercial, "endorsement, campaigning, defamatory or derogatory purposes"
[2]
is
forbidden, any derivative works must be released under the same licence, and content may only be used within the
UK.
Works released by the BBC under the licence were a part of a trial service that has now been withdrawn for review
by the BBC Trust under the Public Value testing process. The Creative Archive trial ended in 2006.
Archives Treasure Hunt
The BBC launched the BBC Archive Treasure Hunt as a public appeal to recover pre-1980s lost BBC radio and
television productions.
[3]
Material was lost due to wiping, copyright issues and technological reasons.
[4]

[5]
Productions recovered
As of September, 2009, more than one hundred productions have been recovered including:
[6]
Television
• The Men from the Ministry
• Something To Shout About
• Man and Superman
• The Doctor's Dilemma
• I'm Sorry, I'll Read That Again
• Hancock's Half Hour
• I'm Sorry, I Haven't A Clue
• The Ronnie Corbett Thing
BBC Archives
76
Audio recording sessions
• Elton John
• Ringo Starr
• Paul Simon
[7]

[8]
Various
• Peter Sellers received from the Peter Sellers Estate Collection.
List of BBC TV Series Affected by Wiping
• Abigail and Roger - All 9 episodes missing
• All Gas and Gaiters - 22 episodes missing
• The Artful Dodger
• B-And-B
• BBC-3
• Bachelor Father
• The Bed-Sit Girl - all 12 episodes missing
• Beggar My Neighbour - 17 episodes missing
• Broaden Your Mind
• Citizen James - 24 episodes missing
• Comedy Playhouse
• Compact - 369 episodes missing
• Dad's Army - 3 episodes missing, plus 2 sketches
• Dixon of Dock Green - 381 episodes missing
• Doctor Who - 108 episodes missing
• Faces of Jim
• The Frost Report
• The Gnomes of Dulwich - all 6 episodes missing
• The Goodies (TV series) - 1 episode missing
• Hancock's Half Hour
• His Lordship Entertains - 6 episodes missing
• Hugh and I - 62 episodes missing
• Hugh and I Spy - all 6 episodes missing
• It Ain't Half Hot Mum - 2 episodes "lost at 1 stage after first broadcast & rerun"
• The Likely Lads - 12 episodes missing
• The Liver Birds - 4 episodes missing
• Marriage Lines
• Me Mammy - 13 episodes missing
• Meet the Wife - 22 episodes missing
• Not Only... But Also
• Not in Front of the Children - 30 episodes missing
• Now Take My Wife - 2 episodes missing
• Oh, Brother - 11 episodes missing
• Pinwright's Progress - all 10 episodes missing
• Play for Today - 13 episodes
• Q
• The Rag Trade - 15 episodes missing
• Softly, Softly
BBC Archives
77
• Son of the Bride - all 6 episodes missing
• Sykes and A... - 34 episodes missing
• Sykes and a Big, Big Show - 4 episodes missing
• Till Death Us Do Part
• United! - all 147 episodes missing
• The Wednesday Play - 119 episodes missing
• Whack-O!
• Wild, Wild Women - 6 episodes missing
• Z-Cars
Voices from the archives
Voices from the Archives is a BBC website providing free access to audio interviews with authors, artists, actors,
architects, broadcasters, cartoonists, composers, dancers, filmmakers, musicians, painters, philosophers,
photographers, playwrights, poets, political activists, religious thinkers, scientists, sculptors, sports, writers.
References
[1] Kiss, Jemima (2010-08-18). "In The BBC Archive" (http:/ / www. guardian.co. uk/ technology/ blog/ audio/ 2010/ aug/ 18/
bbc-archive-roly-keating-windmill-road). Tech Weekly (London: Guardian News & Media Ltd). . Retrieved 21 August 2010.
[2] BBC Creative Archive Licence pilot (http:/ / www.bbc.co.uk/ calc/ news/ rules. shtml) BBC Online
[3] "BBC Online - Cult - Treasure Hunt - About the Campaign" (http:// www.bbc.co.uk/ cult/ treasurehunt/ about/ about. shtml). Bbc.co.uk. .
Retrieved 2010-07-30.
[4] "BBC Online - Cult - Treasure Hunt - About the Campaign" (http:// www.bbc.co.uk/ cult/ treasurehunt/ about/ lost. shtml). Bbc.co.uk. .
Retrieved 2010-07-30.
[5] Stuart Douglas - www.thiswaydown.org (1965-07-07). "missing episodes articles" (http:/ / www. btinternet.com/ ~m. brown1/bbchunt.htm).
Btinternet.com. . Retrieved 2010-07-30.
[6] "No 4 2001 - Missing Believed Wiped" (http:/ / fiatifta.org/aboutfiat/ news/ old/ 2001/ 2001-04/ 03. light.html). Fiat/Ifta. . Retrieved
2010-07-30.
[7] "BBC Online - Cult - Treasure Hunt - List of Finds" (http:/ / www.bbc.co.uk/ cult/ treasurehunt/about/ listoffinds.shtml). Bbc.co.uk. .
Retrieved 2010-07-30.
[8] "'hunt' Unearths Bbc Treasures From Radio, Tv | Business solutions from" (http:// www.allbusiness. com/ services/ motion-pictures/
4848337-1. html). AllBusiness.com. 2001-11-09. . Retrieved 2010-07-30.
External links
• BBC Archive (http:/ / www. bbc. co. uk/ archive)
• Treasure Hunt website (http:// www. bbc. co. uk/ cult/ treasurehunt/index.shtml)
• List of recovered productions (http:// www.bbc. co.uk/ cult/ treasurehunt/about/ listoffinds. shtml)
• British TV Missing Episodes Index (http:/ / www. missing-episodes. com/ )
• Wiped News.Com - A news and features website devoted to missing TV, Film & Radio (http:// www.
wipednews.com/ )
• BBC archiving policy (http:/ / www. bbc. co. uk/ guidelines/ dq/ contents/ archives. shtml)
• Creative Archive Licence Group (http:// creativearchive.bbc.co. uk/ )
• Open University - Creative Archive (http:/ / www.open2.net/ creativearchive/index.html)
• BBC Four - Audio Interviews (http:/ / www. bbc. co.uk/ bbcfour/audiointerviews) at BBC Online
• BBC Written Archives Centre site (http:/ / www.bbc.co. uk/ archive/written.shtml)
• Tech Weekly podcast: In the BBC archives (http:/ / www.guardian.co. uk/ technology/ blog/ audio/ 2010/ aug/
18/ bbc-archive-roly-keating-windmill-road) from The Guardian website.
• This podcast is entirely devoted to the BBC archive and includes interviews with archive staff.
Bitmap index
78
Bitmap index
A bitmap index is a special kind of database index that uses bitmaps.
Bitmap indexes have traditionally been considered to work well for data such as gender, which has a small number
of distinct values, for example male and female, but many occurrences of those values. This would happen if, for
example, you had gender data for each resident in a city. Bitmap indexes have a significant space and performance
advantage over other structures for such data. Some researchers argue that Bitmap indexes are also useful for unique
valued data which is not updated frequently.
[1]
Bitmap indexes use bit arrays (commonly called bitmaps) and answer
queries by performing bitwise logical operations on these bitmaps.
Bitmap indexes are also useful in data warehousing applications for joining a large fact table to smaller dimension
tables
[2]
such as those arranged in a star schema.
Example
Continuing the gender example, a bitmap index may be logically viewed as follows:
Identifier Gender Bitmaps
F M
1 Female 1 0
2 Male 0 1
3 Male 0 1
4 Unspecified 0 0
5 Female 1 0
On the left, identifier refers to the unique number assigned to each customer, gender is the data to be indexed, the
content of the bitmap index is shown as two columns under the heading bitmaps. Each column in the above
illustration is a bitmap in the bitmap index. In this case, there are two such bitmaps, one for gender Female and one
for gender Male. It is easy to see that each bit in bitmap M shows whether a particular row refers to a male. This is
the simplest form of bitmap index. Most columns will have more distinct values. For example, the sales amount is
likely to have a much larger number of distinct values. Variations on the bitmap index can effectively index this data
as well. We briefly review three such variations.
Note: many of the references cited here are reviewed at.
[3]
For those who might be interested in experimenting with
some of the ideas mentioned here, many of them are implemented in open source software such as FastBit,
[4]
the
Lemur Bitmap Index C++ Library,
[5]
, the Apache Hive Data Warehouse system and LucidDB.
Compression
Software can compress each bitmap in a bitmap index to save space. There has been considerable amount of work on
this subject.
[6]

[7]
Bitmap compression algorithms typically employ run-length encoding, such as the Byte-aligned
Bitmap Code,
[8]
the Word-Aligned Hybrid code,
[9]
the Position List Word Aligned Hybrid],
[10]
the Compressed
Adaptive Index (COMPAX)
[11]
and the COmpressed 'N' Composable Integer SEt.
[12]

[13]
These compression
methods require very little effort to compress and decompress. More importantly, bitmaps compressed with BBC,
WAH, COMPAX, PLWAH and CONCISE can directly participate in bitwise operations without decompression.
This gives them considerable advantages over generic compression techniques such as LZ77. BBC compression and
its derivatives are used in a commercial database management system. BBC is effective in both reducing index sizes
and maintaining query performance. BBC encodes the bitmaps in bytes, while WAH encodes in words, better
Bitmap index
79
matching current CPUs. "On both synthetic data and real application data, the new word aligned schemes use only
50% more space, but perform logical operations on compressed data 12 times faster than BBC."
[14]
PLWAH bitmaps
were reported to take 50% of the storage space consumed by WAH bitmaps and offer up to 20% faster performance
on logical operations.
[10]
Similar considerations can be done for CONCISE.
[13]
The performance of schemes such as BBC, WAH, PLWAH, COMPAX and CONCISE is dependent on the order of
the rows. A simple lexicographical sort can divide the index size by 9 and make indexes several times faster.
[15]
The
larger the table, the more important it is to sort the rows. Reshuffling techniques have also been proposed to achieve
the same results of sorting when indexing streaming data.
[16]
Encoding
Basic bitmap indexes use one bitmap for each distinct value. It is possible to reduce the number of bitmaps used by
using a different encoding method.
[17]

[18]
For example, it is possible to encode C distinct values using log(C)
bitmaps with binary encoding.
[19]
This reduces the number of bitmaps, further saving space, but to answer any query, most of the bitmaps have to be
accessed. This makes it potentially not as effective as scanning a vertical projection of the base data, also known as a
materialized view or projection index. Finding the optimal encoding method that balances (arbitrary) query
performance, index size and index maintenance remains a challenge.
Without considering compression, Chan and Ioannidis analyzed a class of multi-component encoding methods and
came to the conclusion that two-component encoding sits at the kink of the performance vs. index size curve and
therefore represents the best trade-off between index size and query performance.
Binning
For high-cardinality columns, it is useful to bin the values, where each bin covers multiple values and build the
bitmaps to represent the values in each bin. This approach reduces the number of bitmaps used regardless of
encoding method.
[20]
However, binned indexes can only answer some queries without examining the base data. For
example, if a bin covers the range from 0.1 to 0.2, then when the user asks for all values less than 0.15, all rows that
fall in the bin are possible hits and have to be checked to verify whether they are actually less than 0.15. The process
of checking the base data is known as the candidate check. In most cases, the time used by the candidate check is
significantly longer than the time needed to work with the bitmap index. Therefore, binned indexes exhibit irregular
performance. They can be very fast for some queries, but much slower if the query does not exactly match a bin.
History
The concept of bitmap index was first introduced by Professor Israel Spiegler and Rafi Maayan in their research
"Storage and Retrieval Considerations of Binary Data Bases", published in 1985.
[21]
The first commercial database
product to implement a bitmap index was Computer Corporation of America's Model 204. Patrick O'Neil
implemented the bitmap index around 1987.
[22]
This implementation is a hybrid between the basic bitmap index
(without compression) and the list of Row Identifiers (RID-list). Overall, the index is organized as a B+tree. When
the column cardinality is low, each leaf node of the B-tree would contain long list of RIDs. In this case, it requires
less space to represent the RID-lists as bitmaps. Since each bitmap represents one distinct value, this is the basic
bitmap index. As the column cardinality increases, each bitmap becomes sparse and it may take more disk space to
store the bitmaps than to store the same content as RID-lists. In this case, it switches to use the RID-lists, which
makes it a B+tree index.
[23]

[24]
Bitmap index
80
In-memory bitmaps
One of the strongest reasons for using bitmap indexes is that the intermediate results produced from them are also
bitmaps and can be efficiently reused in further operations to answer more complex queries. Many programming
languages support this as a bit array data structure. For example, Java has the BitSet class.
Some database systems that do not offer persistent bitmap indexes use bitmaps internally to speed up query
processing. For example, PostgreSQL versions 8.1 and later implement a "bitmap index scan" optimization to speed
up arbitrarily complex logical operations between available indexes on a single table.
For tables with many columns, the total number of distinct indexes to satisfy all possible queries (with equality
filtering conditions on either of the fields) grows very fast, being defined by this formula:
.
[25]

[26]
A bitmap index scan combines expressions on different indexes, thus requiring only one index per column to support
all possible queries on a table.
Applying this access strategy to B-tree indexes can also combine range queries on multiple columns. In this
approach, a temporary in-memory bitmap is created with one bit for each row in the table (1 MiB can thus store over
8 million entries). Next, the results from each index are combined into the bitmap using bitwise operations. After all
conditions are evaluated, the bitmap contains a "1" for rows that matched the expression. Finally, the bitmap is
traversed and matching rows are retrieved. In addition to efficiently combining indexes, this also improves locality
of reference of table accesses, because all rows are fetched sequentially from the main table.
[27]
The internal bitmap
is discarded after the query. If there are too many rows in the table to use 1 bit per row, a "lossy" bitmap is created
instead, with a single bit per disk page. In this case, the bitmap is just used to determine which pages to fetch; the
filter criteria are then applied to all rows in matching pages.
References
Notes
[1] Bitmap Index vs. B-tree Index: Which and When? (http:// www.oracle. com/technetwork/ articles/ sharma-indexes-093638.html), Vivek
Sharma, Oracle Technical Network.
[2] http:/ / www. dwoptimize. com/ 2007/ 06/ 101010-answer-to-life-universe-and.html
[3] John Wu (2007). "Annotated References on Bitmap Index" (http:/ / www.cs. umn. edu/ ~kewu/ annotated.html). .
[4] FastBit (http:/ / codeforge.lbl. gov/ projects/ fastbit/ )
[5] Lemur Bitmap Index C++ Library (http:/ / code. google. com/ p/ lemurbitmapindex/ )
[6] T. Johnson (1999). "Performance Measurements of Compressed Bitmap Indices" (http:/ / www. vldb.org/ conf/1999/ P29. pdf). In Malcolm
P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie. VLDB'99, Proceedings of 25th International
Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann. pp. 278–89.
ISBN 1-55860-615-7. .
[7] Wu K, Otoo E, Shoshani A (March 5, 2004). "On the performance of bitmap indices for high cardinality attributes" (http:// www. osti. gov/
energycitations/ servlets/ purl/822860-LOzkmz/native/ 822860. pdf). .
[8] Byte aligned data compression (http:// www.google. com/ patents?vid=5363098)
[9] Word aligned bitmap compression method, data structure, and apparatus (http:// www.google.com/ patents?vid=6831575)
[10] Deliège F, Pedersen TB (2010). "Position list word aligned hybrid: optimizing space and performance for compressed bitmaps" (http://
alpha.uhasselt. be/ icdt/ edbticdt2010proc/ edbt/ papers/ p0228-Deliege.pdf). In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner,
Masaru Kitsuregawa, Alain Leger, Felix Naumann, Anastasia Ailamaki, and Fatma Ozcan. EDBT '10, Proceedings of the 13th International
Conference on Extending Database Technology. New York, NY, USA: ACM. pp. 228–39. doi:10.1145/1739041.1739071.
ISBN 978-1-60558-945-9. .
[11] F. Fusco, M. Stoecklin, M. Vlachos (September 2010). "NET-FLi: on-the-fly compression, archiving and indexing of streaming network
traffic" (http:// www.comp. nus. edu. sg/ ~vldb2010/ proceedings/ files/ papers/ I01. pdf). Proc. VLDB Endow 3 (1–2): 1382–93. .
[12] Concise: Compressed 'n' Composable Integer Set (http:/ / ricerca.mat. uniroma3.it/ users/ colanton/ concise.html)
[13] Colantonio A, Di Pietro R (31 July 2010). "Concise: Compressed 'n' Composable Integer Set" (http:// ricerca.mat. uniroma3.it/ users/
colanton/docs/ concise. pdf). Information Processing Letters 110 (16): 644–50. doi:10.1016/j.ipl.2010.05.018. .
Bitmap index
81
[14] Wu K, Otoo EJ, Shoshani A (2001). "A Performance comparison of bitmap indexes" (http:// crd.lbl.gov/ ~kewu/ ps/ LBNL-48975.pdf). In
Henrique Paques, Ling Liu, and David Grossman. CIKM '01 Proceedings of the tenth international conference on Information and knowledge
management. New York, NY, USA: ACM. pp. 559–61. doi:10.1145/502585.502689. ISBN 1-58113-436-3. .
[15] D. Lemire, O. Kaser, K. Aouiche (January 2010). "Sorting improves word-aligned bitmap indexes". Data & Knowledge Engineering 69 (1):
3–28. arXiv:0901.3751. doi:10.1016/j.datak.2009.08.006.
[16] F. Fusco, M. Stoecklin, M. Vlachos (September 2010). "NET-FLi: on-the-fly compression, archiving and indexing of streaming network
traffic" (http:/ / www.comp. nus. edu. sg/ ~vldb2010/ proceedings/ files/ papers/ I01. pdf). Proc. VLDB Endow 3 (1–2): 1382–93. .
[17] C.-Y. Chan and Y. E. Ioannidis (1998). "Bitmap index design and evaluation" (http:/ / www.comp. nus. edu.sg/ ~chancy/ sigmod98. pdf).
In Ashutosh Tiwary, Michael Franklin. Proceedings of the 1998 ACM SIGMOD international conference on Management of data (SIGMOD
'98). New York, NY, USA: ACM. pp. 355–6. doi:10.1145/276304.276336. .
[18] C.-Y. Chan and Y. E. Ioannidis (1999). "An efficient bitmap encoding scheme for selection queries" (http:// www.ist. temple.edu/
~vucetic/cis616spring2005/ papers/ P4 p215-chan.pdf). Proceedings of the 1999 ACM SIGMOD international conference on Management of
data (SIGMOD '99). New York, NY, USA: ACM. pp. 215–26. doi:10.1145/304182.304201. .
[19] P. E. O'Neil and D. Quass (1997). "Improved Query Performance with Variant Indexes". In Joan M. Peckman, Sudha Ram, Michael
Franklin. Proceedings of the 1997 ACM SIGMOD international conference on Management of data (SIGMOD '97). New York, NY, USA:
ACM. pp. 38–49. doi:10.1145/253260.253268.
[20] N. Koudas (2000). "Space efficient bitmap indexing". Proceedings of the ninth international conference on Information and knowledge
management (CIKM '00). New York, NY, USA: ACM. pp. 194–201. doi:10.1145/354756.354819.
[21] Spiegler I; Maayan R (1985). "Storage and retrieval considerations of binary data bases". Information Processing and Management: an
International Journal 21 (3): 233–54. doi:10.1016/0306-4573(85)90108-6.
[22] O'Neil, Patrick (1987). "Model 204 Architecture and Performance". In Dieter Gawlick, Mark N. Haynie, and Andreas Reuter (Eds.).
Proceedings of the 2nd International Workshop on High Performance Transaction Systems. London, UK: Springer-Verlag. pp. 40–59.
[23] D. Rinfret, P. O'Neil and E. O'Neil (2001). "Bit-sliced index arithmetic". In Timos Sellis (Ed.). New York, NY, USA: ACM. pp. 47–57.
doi:10.1145/375663.375669.
[24] E. O'Neil, P. O'Neil, K. Wu (2007). "Bitmap Index Design Choices and Their Performance Implications" (http:// crd.lbl.gov/ ~kewu/ ps/
LBNL-62756.pdf). 11th International Database Engineering and Applications Symposium (IDEAS 2007). pp. 72–84.
doi:10.1109/IDEAS.2007.19. ISBN 0-7695-2947-X. .
[25] Alex Bolenok (2009-05-09). "Creating indexes" (http:// explainextended.com/ 2009/ 05/ 09/ creating-indexes/). .
[26] Egor Timoshenko. "On minimal collections of indexes" (http:/ / explainextended.com/ files/ index-en.pdf). .
[27] Tom Lane (2005-12-26). "Re: Bitmap indexes etc." (http:// archives.postgresql. org/pgsql-performance/2005-12/msg00623. php).
PostgreSQL mailing lists. . Retrieved 2007-04-06.
Bibliography
• O'Connell, S. (2005). Advanced Databases Course Notes. Southampton: University of Southampton
• O'Neil, P.; O'Neil, E. (2001). Database Principles, Programming, and Performance. San Francisco: Morgan
Kaufmann Publishers
• Zaker, M.; Phon-Amnuaisuk, S.; Haw, S.C. (2008). "An Adequate Design for Large Data Warehouse Systems:
Bitmap index versus B-tree index" (http:/ / www.universitypress. org.uk/ journals/ cc/ cc-21.pdf). International
Journal of Computers and Communications 2 (2). Retrieved 2010-01-07
British Oceanographic Data Centre
82
British Oceanographic Data Centre
British Oceanographic Data Centre
Formation 1969
Location Liverpool, UK
L3 5DA
Director Dr Juan Brown
Parent organization Natural Environment Research Council (NERC)
Website
www.bodc.ac.uk
[1]
The British Oceanographic Data Centre (BODC) is a national facility for looking after and distributing data about
the marine environment. BODC deal with a range of physical, chemical and biological data, which help scientists
provide answers to both local questions (such as the likelihood of coastal flooding) and global issues (such as the
impact of climate change). BODC is the designated marine science data centre for the UK’s Natural Environment
Research Council (NERC). The centre provides a resource for science, education and industry, as well as the general
public. BODC is hosted by the National Oceanography Centre (NOC), Liverpool.
Approach and goals
BODC's approach to marine data management involves:
• Working alongside scientists during marine research projects
• Distributing data to scientists, education, industry and the public and improving online access to marine data
• Careful storage, quality control and archiving of data so they are unaffected by changes in technology and will be
available into the future
• Producing innovative marine data products and digital atlases
Bidston Observatory, home of BODC from 1975 to 2004.
History
The origins of BODC go back to 1969 when NERC
created the British Oceanographic Data Service
(BODS). Located at the National Institute of
Oceanography, Wormley in Surrey, its purpose was to:
• Act as the UK’s National Oceanographic Data
Centre
• Participate in the international exchange of data as
part of the Intergovernmental Oceanographic
Commission (IOC) network of national data centres
British Oceanographic Data Centre
83
Joseph Proudman Building, Liverpool.
In 1975 BODS was transferred to Bidston Observatory
on the Wirral, near Liverpool, as part of the newly
formed Institute of Oceanographic Sciences. The
following year BODS became the Marine Information
and Advisory Service (MIAS)[2]. Its primary activity
was to manage the data collected from weather ships,
oil rigs and data buoys. The data banking component of
MIAS was restructured to form BODC in April 1989.
Its mission was to 'operate as a world-class data centre
in support of UK marine science'. BODC pioneered a
start to finish approach to marine data management.
This involved:
• Assisting in the collection of data at sea
• Quality control of data
• Assembling the data for use by the scientists
• The publication of data sets on CDROM
In December 2004, BODC moved to the purpose built Joseph Proudman Building on the campus of the University of
Liverpool.
National role
BODC current meter data holdings from around the UK.
BODC has a range of national roles and
responsibilities:
• BODC performs data management for a variety of
UK marine projects
• BODC maintains and develops the National
Oceanographic Database (NODB), a collection of
marine data sets originating mainly from UK
research establishments
• BODC manages, checks and archives data from tide
gauges around the UK coast for the National Tide
Gauge Network, which aims to obtain high quality
tidal information and to provide warning of possible
flooding of coastal areas around the British Isles.
This is part of the National Tidal & Sea Level
Facility (NTSLF)
• BODC hosts the Marine Environmental Data and
Information Network (MEDIN)[3] Project Manager
• BODC is one of six designated data centres that manage NERC's environmental data
• BODC has forged active partnerships with the following NERC marine research centres:
• British Antarctic Survey (BAS)
• National Oceanography Centre (NOC), Liverpool, formerly Proudman Oceanographic Laboratory (POL)
• National Oceanography Centre (NOC), Southampton
• Plymouth Marine Laboratory (PML)
• Scottish Association for Marine Science (SAMS)
• Sea Mammal Research Unit (SMRU)
British Oceanographic Data Centre
84
International role
BODC also has a range of international roles and responsibilities, for example:
• BODC is one of over 60 national oceanographic data centres that form part of the IOC’s International
Oceanographic Data and Information Exchange (IODE), helping to ensure that marine data are used efficiently
and are distributed to the widest possible audience
• BODC plays an active role in the International Council for the Exploration of the Sea (ICES) Marine Data
Management, allowing the exchange of expertise and ideas for those in ICES member countries
• BODC is the international centre for creating, maintaining and publishing the General Bathymetric Chart of the
Oceans (GEBCO) Digital Atlas
Projects and initiatives
BODC is and has been involved with a wide range of national and international projects, including:
Servicing of a RAPID mooring.
• Atlantic Meridional Transect (AMT)
The AMT programme [4] undertook a twice
yearly transect between the UK and the Falkland
Islands to study the factors determining the
ecological and biogeochemical variability in the
planktonic ecosystems.
• Autosub Under Ice (AUI)
The AUI programme [5] investigated the role of
sub-ice shelf processes in the climate system.
The marine environment beneath floating ice
shelves will be explored using Autosub
• Enabling Parameter Discovery (EnParDis)
EnParDis [6] was a project to develop BODC's
Parameter Dictionary. Parameter dictionaries are used to label data with a standard description and are crucial
when searching out and exchanging data
• Liverpool - East Anglia Coastal Study 2 (LEACOAST2)
The LEACOAST2 project [7] follows the LEACOAST project (2002 - 2005) in studying sediment transport
around the sea defences at Sea Palling, Norfolk, UK
• Marine Productivity (MarProd)
MarProd [8] helped to develop coupled models and observation systems for the pelagic ecosystem, with
emphasis on the physical factors affecting zooplankton dynamics
• NERC DataGrid
The challenge for NERC DataGrid [9] was to build an e-grid which makes data discovery, delivery and use
much easier than it is now
• Rapid Climate Change (RAPID)
The RAPID programme [10] aimed to improve understanding of the causes of sudden changes in the Earth's
climate
• Ocean Margin Exchange (OMEX)
The OMEX project [11] studied, measured and modelled the physical, chemical and biological processes and
fluxes at the ocean margin - the interface between the open Atlantic ocean and the European continental shelf
British Oceanographic Data Centre
85
• SeaDataNet
SeaDataNet [12] aims to develop a standardised, distributed system which provides transparent access to
marine data sets and data products from countries in and around Europe. It will build on the work already
completed by the SEA-SEARCH [13] and EDIOS [14] projects
• System of Industry Metocean data for the Offshore and Research Communities (SIMORC)
SIMORC [15] aimed to create a central index and database of metocean data sets collected globally by the oil
and gas industry
External links
• BODC homepage
[1]
• BODC News and events
[16]
• Natural Environment Research Council (NERC) homepage
[17]
• NERC Data centres
[18]
• NERC Data Discovery Service
[19]
• National Tidal and Sea Level Facility (NTSLF)
[20]
External partner pages
• British Antarctic Survey (BAS)
[21]
• National Oceanography Centre, Southampton (NOCS)
[22]
• Plymouth Marine Laboratory (PML)
[23]
• Proudman Oceacanographic Laboratory (POL)
[24]
• Scottish Association for Marine Science (SAMS)
[25]
• Sea Mammal Research Unit (SMRU)
[26]
External project pages
• Atlantic Meridional Transect (AMT) homepage
[27]
• Autosub Under Ice (AUI) homepage
[28]
• Liverpool - East Anglia Coastal Study 2 (LEACOAST2) homepage
[29]
• NERC DataGrid (NDG) homepage
[30]
• RAPID Climate Change homepage
[31]
• SEADATANET homepage
[32]
• System of Industry Metocean data for the Offshore and Research Communities (SIMORC) homepage
[33]
References
[1] http:/ / www. bodc. ac. uk
[2] http:// www. soton. ac. uk/ library/about/ nol/ mias. html
[3] http:// www. oceannet. org/
[4] http:/ / www. bodc. ac. uk/ projects/ uk/ amt/
[5] http:/ / www. bodc. ac. uk/ projects/ uk/ aui/
[6] http:/ / www. bodc. ac. uk/ projects/ uk/ enpardis/
[7] http:/ / www. bodc. ac. uk/ projects/ uk/ leacoast2/
[8] http:/ / www. bodc. ac. uk/ projects/ uk/ marprod/
[9] http:/ / www. bodc. ac. uk/ projects/ uk/ ndg/
[10] http:/ / www. bodc. ac. uk/ projects/ uk/ rapid/
[11] http:/ / www. bodc. ac. uk/ projects/ european/omex/
[12] http:/ / www. bodc. ac. uk/ projects/ european/seadatanet/
[13] http:/ / www. bodc. ac. uk/ projects/ european/seasearch/
[14] http:/ / www. bodc. ac. uk/ projects/ european/edios/
[15] http:/ / www. bodc. ac. uk/ projects/ european/simorc/
[16] http:/ / www. bodc. ac. uk/ about/ news_and_events/
British Oceanographic Data Centre
86
[17] http:/ / www. nerc.ac. uk
[18] http:// www. nerc.ac. uk/ research/sites/ data/
[19] http:/ / ndg.nerc.ac. uk/discovery
[20] http:// www. pol. ac. uk/ ntslf/
[21] http:/ / www. antarctica. ac. uk/
[22] http:// www. noc. soton. ac. uk/
[23] http:/ / www. pml. ac. uk/
[24] http:// www. pol. ac. uk
[25] http:// www. sams. ac. uk/
[26] http:// www. smru. st-and. ac. uk/
[27] http:/ / web.pml. ac. uk/ amt/
[28] http:/ / www. noc. soton. ac. uk/ aui/
[29] http:/ / pcwww. liv. ac. uk/ civilCRG/ leacoast2/
[30] http:/ / ndg.nerc.ac. uk/
[31] http:// www. noc. soton. ac. uk/ rapid/rapid.php
[32] http:// www. seadatanet. org/
[33] http:// www. simorc. com/
Business intelligence
Business intelligence (BI) refers to computer-based techniques used in identifying, extracting, and analyzing
business data, such as sales revenue by products and/or departments, or by associated costs and incomes.
[1]
BI technologies provide historical, current and predictive views of business operations. Common functions of
business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining,
business performance management, benchmarking, text mining and predictive analytics.
Business intelligence aims to support better business decision-making. Thus a BI system can be called a decision
support system (DSS).
[2]
Though the term business intelligence is sometimes used as a synonym for competitive
intelligence, because they both support decision making, BI uses technologies, processes, and applications to analyze
mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and
disseminates information with a topical focus on company competitors. Business intelligence understood broadly can
include the subset of competitive intelligence.
[3]
History
In a 1958 article, IBM researcher Hans Peter Luhn used the term business intelligence. He defined intelligence as:
"the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired
goal."
[4]
Business intelligence as it is understood today is said to have evolved from the decision support systems which
began in the 1960s and developed throughout the mid-80s. DSS originated in the computer-aided models created to
assist with decision making and planning. From DSS, data warehouses, Executive Information Systems, OLAP and
business intelligence came into focus beginning in the late 80s.
In 1989 Howard Dresner (later a Gartner Group analyst) proposed "business intelligence" as an umbrella term to
describe "concepts and methods to improve business decision making by using fact-based support systems."
[2]
It was
not until the late 1990s that this usage was widespread.
[5]
Business intelligence
87
Business intelligence and data warehousing
Often BI applications use data gathered from a data warehouse or a data mart. However, not all data warehouses are
used for business intelligence, nor do all business intelligence applications require a data warehouse.
In order to distinguish between concepts of business intelligence and data warehouses, Forrester Research often
defines business intelligence in one of two ways:
Typically, Forrester uses the following broad definition: "Business Intelligence is a set of methodologies, processes,
architectures, and technologies that transform raw data into meaningful and useful information used to enable more
effective strategic, tactical, and operational insights and decision-making."
[6]
When using this definition, business
intelligence also includes technologies such as data integration, data quality, data warehousing, master data
management, text and content analytics, and many others that the market sometimes lumps into the Information
Management segment. Therefore, Forrester refers to data preparation and data usage as two separate, but closely
linked segments of the business intelligence architectural stack.
Forrester defines the latter, narrower business intelligence market as "referring to just the top layers of the BI
architectural stack such as reporting, analytics and dashboards."
[7]
Business intelligence and business analytics
Thomas Davenport has argued that business intelligence should be divided into querying, reporting, OLAP, an
"alerts" tool, and business analytics. In this definition, business analytics is the subset of BI based on statistics,
prediction, and optimization.
[8]
Applications in an enterprise
Business Intelligence can be applied to the following business purposes (MARCKM), in order to drive business
value:
1. Measurement – program that creates a hierarchy of Performance metrics (see also Metrics Reference Model) and
Benchmarking that informs business leaders about progress towards business goals (AKA Business process
management).
2. Analytics – program that builds quantitative processes for a business to arrive at optimal decisions and to perform
Business Knowledge Discovery. Frequently involves: data mining, process mining, statistical analysis, Predictive
analytics, Predictive modeling, Business process modeling
3. Reporting/Enterprise Reporting – program that builds infrastructure for Strategic Reporting to serve the Strategic
management of a business, NOT Operational Reporting. Frequently involves: Data visualization, Executive
information system, OLAP
4. Collaboration/Collaboration platform – program that gets different areas (both inside and outside the business) to
work together through Data sharing and Electronic Data Interchange.
5. Knowledge Management – program to make the company data driven through strategies and practices to identify,
create, represent, distribute, and enable adoption of insights and experiences that are true business knowledge.
Knowledge Management leads to Learning Management and Regulatory compliance/Compliance
Business intelligence
88
Requirements gathering
According to Kimball
[9]
business users and their requirements impact nearly every decision made throughout the
design and implementation of a DW/BI system. The business requirements sit at the center of the business core and
are related to all aspects of the daily business processes. They are therefore extremely critical to successful data
warehousing. Business requirements analysis occurs at two distinct levels:
• Macro level: understand the business’s needs and priorities relative to a program perspective
• Micro level: understand the users’ needs and desires in the context of a single, relatively narrowly defined project.
Approach
There are two basic interactive techniques for gathering requirements:
1. Conduct interviews: You need to talk to the users about their jobs, their objectives, and their challenges. This is
either done with individuals or small groups.
2. Facilitated sessions and seminars: Can be used to encourage creative brainstorming.
Identify the interview team
• Lead interviewer – directing the questioning
• Scribe – take copious notes during the interview: A tape recorder may be used to supplement the scribe, since it is
useful as a backup
• Observers – optional part of the team. A good possibility for other team members to gain knowledge about
interviewing techniques. It is advisable that there is no more than two observers present.
Research the organization
Reports, review of business operations, part of the annual report to gain insight regarding organizational structure. If
applicable, a copy of the resulting documentation from the latest internal business/ IT strategy and planning meeting.
Select the interviewees
Select a cross section of representatives. Study the organization to get a good idea of all the stakeholders in the
project. These include:
• Business interviewees (to understand the key business processes)
• IT and Compliance/Security Interviewees (to assess preliminary feasibility of the underlying operational source
systems to support the requirements emerging from the business side of the house)
Develop the interview questionnaires
Multiple questionnaires should be developed because the questioning will vary by job function and level.
• The questionnaires for the data audit sessions will differ from business requirements questionnaires
• Be structured. This will help the interview flow and help organize your thoughts before the interview.
Schedule and sequence the interviews
Scheduling and rescheduling takes time; prepare these a good time in advance! Sequence your interviews by
beginning with the business driver, followed by the business sponsor. This is to understand the playing field from
their perspective. The optimal sequence would be:
• Business driver
• Business sponsor
• An interviewee from the middle of the organizational hierarchy
• Bottom of the organizational hierarchy
Business intelligence
89
The bottom is a disastrous place to begin because you have no idea where you are headed. The top is great for
overall vision, but you need the business background, confidence, and credibility to converse at those levels. If you
are not adequately prepared with in-depth business familiarity, the safest route is to begin in the middle of the
organization.
Prepare the interviewees
Make sure the interviewees are appropriately briefed and prepared to participate. As a minimum, a letter should be
emailed to all interview participants to inform them about the process and the importance of their participation and
contribution. The letter should explain that the goal is to understand their job responsibilities and business objectives,
which then translate into the information and analyses required to get their job done. In addition they should be
asked to bring copies of frequently used reports or spreadsheet analyses.
The letter should be signed by a high ranking sponsor, someone well respected by the interviewees. It is advisable
not to attach a list of the fifty questions you might ask in hopes that the interviewees will come prepared with
answers. The odds are that they won’t take the time to prepare responses and even get intimidated by the volume of
your questions.
Issues with requirements gathering and interviews
The process of conducting an interview may seem exhaustive at first, but the ground rule is to be well prepared in all
steps. Techniques for questioning may be a good idea to investigate before conducting the interview. Ask
open-ended questions such as why, how, what-if, and what-then questions. Ask unbiased questions.
Wrongfully asked questions can lead to wrong answers and, in the worst case, wrong requirements are gathered. The
whole process is valuable in time and resources, and the wrong data can slow down the development of the whole BI
installation. Be sure that everyone in the interviewee team is aware of their role to support that everything goes as
planned. The next part is to synthesize around the business processes
Prioritization of business intelligence projects
It is often difficult to provide a positive business case for business intelligence (BI) initiatives and often the projects
will need to be prioritized through strategic initiatives. Here are some hints to increase the benefits for a BI project.
• As described by Kimball
[10]
you must determine the tangible benefits such as eliminated cost of producing legacy
reports.
• Enforce access to data for the entire organization. In this way even a small benefit, such as a few minutes saved,
will make a difference when it is multiplied by the number of employees in the entire organization.
• As described by Ross, Weil & Roberson for Enterprise Architecture,
[11]
consider letting the BI project be driven
by other business initiatives with excellent business cases. To support this approach, the organization must have
Enterprise Architects, which will be able to detect suitable business projects.
Business intelligence
90
Success factors of implementation
Before implementing a BI solution, it is worth taking different factors into consideration before proceeding.
According to Kimball et al. These are the three critical areas that you need to assess within your organization before
getting ready to do a BI project
[12]
:
1. The level of commitment and sponsorship of the project from senior management
2. The level of business need for creating a BI implementation
3. The amount and quality of business data available.
Business Sponsorship
The commitment and sponsorship of senior management is according to Kimball et al., the most important criteria
for assessment.
[13]
This is because having strong management backing will help overcome shortcomings elsewhere
in the project. But as Kimball et al. state: “even the most elegantly designed DW/BI system cannot overcome a lack
of business [management] sponsorship”.
[14]
It is very important that the management personnel who participate in
the project have a vision and an idea of the benefits and drawbacks of implementing a BI system. The best business
sponsor should have organizational clout and should be well connected within the organization. It is ideal that the
business sponsor is demanding but also able to be realistic and supportive if the implementation runs into delays or
drawbacks. The management sponsor also needs to be able to assume accountability and to take responsibility for
failures and setbacks on the project. It is imperative that there is support from multiple members of the management
so the project will not fail if one person leaves the steering group. However, having many managers that work
together on the project can also mean that the there are several different interests that attempt to pull the project in
different directions. For instance if different departments want to put more emphasis on their usage of the
implementation. This issue can be countered by an early and specific analysis of the different business areas that will
benefit the most from the implementation. All stakeholders in project should participate in this analysis in order for
them to feel ownership of the project and to find common ground between them. Another management problem that
should be encountered before start of implementation is if the Business sponsor is overly aggressive. If the
management individual gets carried away by the possibilities of using BI and starts wanting the DW or BI
implementation to include several different sets of data that were not included in the original planning phase.
However, since extra implementations of extra data will most likely add many months to the original plan. It is
probably a good idea to make sure that the person from management is aware of his actions.
Implementation should be driven by clear business needs.
Because of the close relationship with senior management, another critical thing that needs to be assessed before the
project is implemented is whether or not there actually is a business need and whether there is a clear business
benefit by doing the implementation.
[15]
The needs and benefits of the implementation are sometimes driven by
competition and the need to gain an advantage in the market. Another reason for a business-driven approach to
implementation of BI is the acquisition of other organizations that enlarge the original organization it can sometimes
be beneficial to implement DW or BI in order to create more oversight.
The amount and quality of the available data.
This ought to be the most important factor, since without good data – it does not really matter how good your
management sponsorship or your business-driven motivation is. If you do not have the data, or the data does not
have sufficient quality any BI implementation will fail. Before implementation it is a very good idea to do data
profiling, this analysis will be able to describe the “content, consistency and structure [..]”
[15]
of the data. This should
be done as early as possible in the process and if the analysis shows that your data is lacking; it is a good idea to put
the project on the shelf temporarily while the IT department figures out how to do proper data collection.
Other scholars have added more factors to the list than these three. In his thesis “Critical Success Factors of BI
Implementation”
[16]
Naveen Vodapalli does research on different factors that can impact the final BI product. He
lists 7 crucial success factors for the implementation of a BI project, they are as follows:
Business intelligence
91
1. Business-driven methodology and project management
2. Clear vision and planning
3. Committed management support & sponsorship
4. Data management and quality
5. Mapping solutions to user requirements
6. Performance considerations of the BI system
7. Robust and expandable framework
User aspect
Some considerations must be made in order to successfully integrate the usage of business intelligence systems in a
company. Ultimately the BI system must be accepted and utilized by the users in order for it to add value to the
organization.
[17]

[18]
If the usability of the system is poor, the users may become frustrated and spend a considerable
amount of time figuring out how to use the system or may not be able to really use the system. If the system does not
add value to the users´ mission, they will simply not use it.
[18]
In order to increase the user acceptance of a BI system, it may be advisable to consult the business users at an early
stage of the DW/BI lifecycle such as for example at the requirements gathering phase.
[17]
This can provide an insight
into the business process and what the users need from the BI system. There are several methods for gathering this
information such as e.g. questionnaires and interview sessions.
When gathering the requirements from the business users, the local IT department should also be consulted in order
to determine to which degree it is possible to fulfill the business's needs based on the available data.
[17]
Taking on a user-centered approach throughout the design and development stage may further increase the chance of
rapid user adoption of the BI system.
[18]
Besides focusing on the user experience offered by the BI applications, it may also possible to motivate the users to
utilize the system by adding an element of competition. Kimball
[17]
suggests implementing a function on the
Business Intelligence portal website where reports on system usage can be found. By doing so, managers can see
how well their departments are doing and compare themselves to others and this may spur them to encourage their
staff to utilize the BI system even more.
In a 2007 article, H. J. Watson gives an example of how the competitive element can act as an incentive.
[19]
Watson
describes how a large call centre has implemented performance dashboards for all the call agents and that monthly
incentive bonuses have been tied up to the performance metrics. Furthermore the agents can see how their own
performance compares to the other team members. The implementation of this type of performance measurement
and competition significantly improved the performance of the agents.
Other elements which may increase the success of BI can be by involving senior management in order to make BI a
part of the organizational culture and also by providing the users with the necessary tools, training and support.
[19]
By offering user training, more people may actually use the BI application.
[17]
Providing user support is necessary in order to maintain the BI system and assist users who run into problems.
[18]
User support can be incorporated in many ways, for example by creating a website. The website should contain great
content and tools for finding the necessary information. Furthermore, helpdesk support can be used. The helpdesk
can be manned by e.g. power users or the DW/BI project team.
[17]
Business intelligence
92
Marketplace
There are a number of business intelligence vendors, often categorized into the remaining independent "pure-play"
vendors and the consolidated "megavendors" which have entered the market through a recent trend of acquisitions in
the BI industry.
[20]
Some companies adopting BI software decide to pick and choose from different product offerings (best-of-breed)
rather than purchase one comprehensive integrated solution (full-service).
[21]
Industry-specific
Specific considerations for business intelligence systems have to be taken in some sectors such as governmental
banking regulations. The information collected by banking institutions and analyzed with BI software must be
protected from some groups or individuals, while being fully available to other groups or individuals. Therefore BI
solutions must be sensitive to those needs and be flexible enough to adapt to new regulations and changes to existing
laws.
Semi-structured or unstructured data
Businesses create a huge amount of valuable information in the form of e-mails, memos, notes from call-centers,
news, user groups, chats, reports, web-pages, presentations, image-files, video-files, and marketing material and
news. According to Merrill Lynch, more than 85 percent of all business information exists in these forms. These
information types are called either semi-structured or unstructured data. However, organizations often only use these
documents once.
[22]
The management of semi-structured data is recognized as a major unsolved problem in the information technology
industry.
[23]
According to projections from Gartner (2003), white collar workers will spend anywhere from 30 to 40
percent of their time searching, finding and assessing unstructured data. BI uses both structured and unstructured
data, but the former is easy to search, and the latter contains a large quantity of the information needed for analysis
and decision making.
[23]

[24]
Because of the difficulty of properly searching, finding and assessing unstructured or
semi-structured data, organizations may not draw upon these vast reservoirs of information, which could influence a
particular decision, task or project. This can ultimately lead to poorly-informed decision making.
[22]
Therefore, when designing a Business Intelligence/DW-solution, the specific problems associated with
semi-structured and unstructured data must be accommodated for as well as those for the structured data.
[24]
Unstructured data vs. Semi-structured data
Unstructured and semi-structured data have different meanings depending on their context. In the context of
relational database systems, it refers to data that cannot be stored in columns and rows. It must be stored in a BLOB
(binary large object), a catch-all data type available in most relational database management systems.
But many of these data types, like e-mails, word processing text files, PPTs, image-files, and video-files conform to
a standard that offers the possibility of metadata. Metadata can include information such as author and time of
creation, and this can be stored in a relational database. Therefore it may be more accurate to talk about this as
semi-structured documents or data,
[23]
but no specific consensus seems to have been reached.
Business intelligence
93
Problems with semi-structured or unstructured data
There are several challenges to developing BI with semi-structured data. According to Inmon & Nesavich,
[25]
some
of those are:
1. Physically accessing unstructured textual data – unstructured data is stored in a huge variety of formats.
2. Terminology – Among researchers and analysts, there is a need to develop a standardized terminology.
3. Volume of data – As stated earlier, up to 85% of all data exists as semi-structured data. Couple that with the need
for word-to-word and semantic analysis..
4. Searchability of unstructured textual data – A simple search on some data, e.g. apple, results in links where there
is a reference to that precise search term. (Inmon & Nesavich, 2008)
[25]
gives an example: “a search is made on
the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to
an unstructured document is made. But a simple search is crude. It does not find references to crime, arson,
murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies.”
The use of metadata
To solve the problem with the searchability and assessment of the data, it is necessary to know something about the
content. This can be done by adding context through the use of metadata.
[22]
A lot of system already captures some
metadata, e.g. filename, author, size etc. But much more useful could be metadata about the actual content – e.g.
summaries, topics, people or companies mentioned. Two technologies designed for generating metadata about
content is automatic categorization and information extraction.
Future
A 2009 Gartner paper predicted
[26]
these developments in the business intelligence market:
• Because of lack of information, processes, and tools, through 2012, more than 35 percent of the top 5,000 global
companies will regularly fail to make insightful decisions about significant changes in their business and markets.
• By 2012, business units will control at least 40 percent of the total budget for business intelligence.
• By 2012, one-third of analytic applications applied to business processes will be delivered through coarse-grained
application mashups.
A 2009 Information Management special report predicted the top BI trends: "green computing, social networking,
data visualization, mobile BI, predictive analytics, composite applications, cloud computing and multitouch."
[27]
According to a study by the Aberdeen Group, there has been increasing interest in Software-as-a-Service (SaaS)
business intelligence over the past years, with twice as many organizations using this deployment approach as one
year ago – 15% in 2009 compared to 7% in 2008.
An article by InfoWorld’s Chris Kanaracus points out similar growth data from research firm IDC, which predicts
the SaaS BI market will grow 22 percent each year through 2013 thanks to increased product sophistication, strained
IT budgets, and other factors.
[28]
Business intelligence
94
References
[1] "BusinessDictionary.com definition" (http:/ / www.businessdictionary. com/ definition/business-intelligence-BI. html). . Retrieved 17 March
2010.
[2] D. J. Power (2007-03-10). "A Brief History of Decision Support Systems, version 4.0" (http:/ / dssresources. com/ history/ dsshistory. html).
DSSResources.COM. . Retrieved 2008-07-10.
[3] Kobielus, James (April 30, 2010). "What’s Not BI? Oh, Don’t Get Me Started....Oops Too Late...Here Goes...." (http:/ / blogs. forrester.com/
james_kobielus/ 10-04-30-what s_not_bi_oh_don t_get_me_startedoops_too_latehere_goes). . "“Business” intelligence is a
non-domain-specific catchall for all the types of analytic data that can be delivered to users in reports, dashboards, and the like. When you
specify the subject domain for this intelligence, then you can refer to “competitive intelligence,” “market intelligence,” “social intelligence,”
“financial intelligence,” “HR intelligence,” “supply chain intelligence,” and the like."
[4] H. P. Luhn (October 1958). "A Business Intelligence System" (http:// www.research.ibm. com/ journal/rd/024/ ibmrd0204H. pdf) (PDF).
IBM Journal. . Retrieved 2008-07-10.
[5] Power, D. J.. "A Brief History of Decision Support Systems" (http:// dssresources. com/ history/ dsshistory. html). . Retrieved November 1,
2010.
[6] Evelson, Boris (November 21, 2008). "Topic Overview: Business Intelligence" (http:/ / www. forrester.com/ rb/Research/
topic_overview_business_intelligence/q/ id/ 39218/ t/ 2). .
[7] Evelson, Boris (April 29, 2010). "Want to know what Forrester's lead data analysts are thinking about BI and the data domain?" (http://
blogs.forrester.com/ boris_evelson/ 10-04-29-want_know_what_forresters_lead_data_analysts_are_thinking_about_bi_and_data_domain). .
[8] Tom Davenport. Interview. Analytics at Work: Q&A with Tom Davenport (http:// intelligent-enterprise.informationweek.com/ showArticle.
jhtml;jsessionid=1XNPBXHF0WN3XQE1GHPSKH4ATMY32JVN?articleID=222200096). January 4, 2010.
[9] Kimball et al., 2008: 63
[10] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.), page 29
[11] Jeanne W. Ross, Peter Weil, David C. Robertson (2006) "Enterprise Architecture As Strategy", page 117.
[12] Kimball et al. 2008: pp. 298
[13] Kimball et al., 2008: 16
[14] Kimball et al., 2008: 18
[15] Kimball et al., 2008: 17
[16] Naveen K Vodapalli (2009-11-02). "Critical Success Factors of BI Implementation" (http:// mit. itu.dk/ ucs/ pb/ download/ BI Thesis
Report-New. pdf?file_id=871821). IT University of Copenhagen. . Retrieved 2009-11-12.
[17] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.)
[18] Swain Scheps "Business Intelligence For Dummies", 2008, ISBN 978-0-470-12726-0
[19] H.J. Watson and B.H. Wixom "The Current State of Business Intelligence", Computer Volume 40 Issue 9, September 2007
[20] Pendse, Nigel (March 7, 2008). "Consolidations in the BI industry" (http://www.bi-verdict.com/ fileadmin/ FreeAnalyses/ consolidations.
htm). The OLAP Report. .
[21] Imhoff, Claudia (April 4, 2006). "Three Trends in Business Intelligence Technology" (http:/ / www. b-eye-network.com/ view/ 2608). .
[22] R. Rao "From Unstructured Data to Actionable Information", IT Pro, November | December 2003, p. 14-16
[23] Blumberg, R. & S. Atre "The Problem with Unstructured Data", DM Review November 2003b
[24] Negash, S "Business Intelligence", Communications of the Association of Information Systems, vol. 13, 2004, p. 177.195.
[25] Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice
Hall 2008, p. 1-13
[26] "Gartner Reveals Five Business Intelligence Predictions for 2009 and Beyond", http:// www.gartner.com/ it/ page.jsp?id=856714
[27] Campbell, Don (June 23, 2009). "10 Red Hot BI Trends" (http:/ / www.information-management.com/ specialreports/ 2009_148/
business_intelligence_data_vizualization_social_networking_analytics-10015628-1.html). Information Management. .
[28] http:// infoworld.com/ d/ cloud-computing/saas-bi-growth-will-soar-in-2010-511
ltg:Biznesa analitika
Business Intelligence Project Planning
95
Business Intelligence Project Planning
When building a Business Intelligence application, before the requirements are defined, you have to plan the project.
In doing so, a variety of tasks have to be completed and multiple steps taken.
Project Prerequisites
First and foremost, the project needs strong sponsors within the business. A sponsor is an important figure within the
company, that has the power to drive the project forward. An ideal sponsor is someone who is both realistic and
supportive
[1]
. It is best if they have a broad understanding of Business Intelligence Applications, but it is advisable
to choose a sponsor that does not know too much about BI apps: They might interfere with the project, demanding
their ideas be brought to life and there methods used in the management, design and implementation faces. Without a
key sponsor to keep the project afloat, business can soon lose interest and track of what is actually going on in the
development unit.
Adding to sponsor support, support from the business community is also important. If the community is severely
against the project, they might misuse it by purpose, act overly unknowing about the nature of the system or might
refuse to use it at all. In other words, Critical Mass
[2]
becomes impossible to reach, and a Discontinuance of the
project becomes likely
[3]
.
Lastly the Business data that is the scope of the project, has to be ready for the BI application in question. In other
words, the project has to be feasible. Thus not said, that the data has to be presented in clear excel format with
perfectly sorted columns and rows, but the data must at least be collectable and sortable
[4]
.
Beside these 3 mentioned factors, there are other circumstances that has to be taken into consideration. Among these
are the financial perspective of the project, the technical hardware and an accurate initial scoping. Especially
financial scoping can be hard, because it’s difficult to measure the economic outcome on an application that sorts
data and makes it available to the business.
Project planning
First a project identity has to be established by the project lead
[5]
. It is a good idea to choose a sound acronym for
the project. Something that does not sound to silly, and does not confuse users. Project identity has a lot to do with
branding, promotion and commercialization of the project.
An important task, if not the most important, is staffing of the project. Without the right hands and eyes, it can be
very difficult to see the project completed successfully. There are a number of key figures in the process that I will
list below:
Sponsors and drivers – the business line-up
• Business sponsor: The “cheerleader” of the project. As described earlier, this is a key figure that all projects must
secure in place before attempted. The sponsor is the business owner of the project, and often have the financial
responsibility.
• BI director: This figure complement the sponsor as the primary leader of the project. They are the drivers of the
project involved in both the development and continuesly sales of the business case and maintaining it. This role
will most often report into the CIO (Chief Information Officer).
[6]
Business Intelligence Project Planning
96
Project Managers and Leads – the coaches
• Project Manager: The project manager is the head coach of the project is the day-to-day manager. They must
understand every aspect of the process and be able to direct everyone in the right direction when discrepancies
arise. A project manager should emphasize management tools, without becoming a “slave” to them. The project
manager also needs to be good with people, since most of what he/she will be doing is talking to people and
sorting out issues.
• Business project lead: This is a part time role. This person have detailed information about the project
requirements, and constantly strive to keep the project on track.
[7]
Core Team – the bulk of the design
• Business Analyst: This person, who in small projects very well can be the same as the business lead, is the one
that actually take the requirements and turn them into some concrete models. As with the business project lead,
the business analyst needs to be respected amongst the business community, since they will be in the continued
vacuum between business and project staff.
• Data Steward: The caretaker of the data. This task should in best case scenario be carried out by someone from
the business, since they will have the most detailed understanding on how the data actually should be handled.
Some data might be more or less sensitive, and such tacit knowledge can be impossible for an external consultant
to gather and accumulate.
• Data Modeler: This is the administrator of the database and responsible for setup and maintenance. The Data
modeler will participate in dimensional modeling and must as such have knowledge of this field.
• Metadata Manager: Steward of metadata. They are caretakers of collecting, coordinating and inputting accurate
metadata.
• ETL architect: The true workhorse of the project, who is responsible for the extract-load-transform process. They
must be experience coders with understanding of agile coding methods. As the trend dictates, a programmer
should be sure to stay in touch with other units of the process to not get lost in the work.
• BI Architect: This person create the user interface to the system. He must have a strong knowledge of the business
requirements, as well as basic understand on interface design and HCI.
[8]
Special Teams – part time actors.
• Technical architect: Specialist in technical infrastructure who assist the project in being compatible with existing
technical setups in the company.
• Security Manager: This role can be filled by many different specialist, since security is different from each layer
of the project. The interface must be able to handle things like SQL injections and bad requests, while the deeper
layers should provide covers and firewalls for trojacs and other viruses.
• Lead Tester: A consultant responsible for coordinating tests among users or perhaps external testers.
• Data Mining: This is the expert in statistic theory that drive the data mining and provide invaluable input on how
to structure, cluster and interpret data.
• Educator: Educator does not have to be one person, but can a role that each specialist takes on his shoulders when
the other teams has to be enlighten within a certain field of the project.
[9]
Business Intelligence Project Planning
97
References
[1] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) p 16
[2] Everett M. Rogers "Diffusion of Innovations" (5nd ed.) p 344
[3] Everett M. Rogers "Diffusion of Innovations" (5nd ed.) p 190
[4] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) p 17
[5] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) p 31
[6] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) p 33
[7] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) p 34
[8] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) p 35
[9] Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) p 38
Change data capture
In databases, change data capture (CDC) is a set of software design patterns used to determine (and track) the data
that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach
to data integration that is based on the identification, capture and delivery of the changes made to enterprise data
sources.
CDC solutions occur most often in data-warehouse environments since capturing and preserving the state of data
across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data
repository system.
Methodology
System developers can set up CDC mechanisms in a number of ways and in any one or a combination of system
layers from application logic down to physical storage.
In a simplified CDC context, one computer system has data believed to have changed from a previous point in time,
and a second computer system needs to take action based on that changed data. The former is the source, the latter is
the target. It is possible that the source and target are the same system physically, but that does not change the design
patterns logically.
Not uncommonly, multiple CDC solutions can exist in a single system.
Timestamps on rows
Tables whose changes must be captured may have a column that represents the time of last change. Names such as
LAST_UPDATE, etc. are common. Any row in any table that has a timestamp in that column that is more recent
than the last time data was captured is considered to have changed.
Version Numbers on rows
Database designers give tables whose changes must be captured a column that contains a version number. Names
such as VERSION_NUMBER, etc. are common. When data in a row changes, its version number is updated to the
current version. A supporting construct such as a reference table with the current version in it is needed. When a
change capture occurs, all data with the latest version number is considered to have changed. When the change
capture is complete, the reference table is updated with a new version number.
Three or four major techniques exist for doing CDC with version numbers, the above paragraph is just one.
Change data capture
98
Status indicators on rows
This technique can either supplant or complement timestamps and versioning. It can configure an alternative if, for
example, a status column is set up on a table row indicating that the row has changed (e.g. a boolean column that,
when set to true, indicates that the row has changed). Otherwise, it can act as a complement to the previous methods,
indicating that a row, despite having a new version number or an earlier date, still shouldn't be updated on the target
(for example, the data may require human validation).
Time/Version/Status on rows
This approach combines the three previously discussed methods. As noted, it is not uncommon to see multiple CDC
solutions at work in a single system, however, the combination of time, version, and status provides a particularly
powerful mechanism and programmers should utilize them as a trio where possible. The three elements are not
redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that
changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for
production."
Triggers on tables
May include a publish/subscribe pattern to communicate the changed data to multiple targets. In this approach,
triggers log events that happen to the transactional table into another queue table that can later be "played back". For
example, imagine an Accounts table, when transactions are taken against this table, triggers would fire that would
then store a history of the event or even the deltas into a separate queue table. The queue table might have schema
with the following fields: Id, TableName, RowId, TimeStamp, Operation. The data inserted for our Account sample
might be: 1, Accounts, 76, 11/02/2008 12:15am, Update. More complicated designs might log the actual data that
changed. This queue table could then be "played back" to replicate the data from the source system to a target.
[More discussion needed]
An example of this technique is the pattern known as the log trigger.
Log scanners on databases
Most database management systems manage a transaction log that records changes made to the database contents
and to metadata. By scanning and interpreting the contents of the database transaction log one can capture the
changes made to the database in a non-intrusive manner.
Using transaction logs for change data capture offers a challenge in that the structure, contents and use of a
transaction log is specific to a database management system. Unlike data access, no standard exists for transaction
logs. Most database management systems do not document the internal format of their transaction logs, although
some provide programmatic interfaces to their transaction logs (for example: Oracle, DB2, SQL/MP and SQL Server
2008).
Other challenges in using transaction logs for change data capture include:
• Coordinating the reading of the transaction logs and the archiving of log files (database management software
typically archives log files off-line on a regular basis).
• Translation between physical storage formats that are recorded in the transaction logs and the logical formats
typically expected by database users (e.g., some transaction logs save only minimal buffer differences that are not
directly useful for change consumers).
• Dealing with changes to the format of the transaction logs between versions of the database management system.
• Eliminating uncommitted changes that the database wrote to the transaction log and later rolled back.
• Dealing with changes to the metadata of tables in the database.
CDC solutions based on transaction log files have distinct advantages that include:
Change data capture
99
• minimal impact on the database (even more so if one uses log shipping to process the logs on a dedicated host).
• no need for programmatic changes to the applications that use the database.
• low latency in acquiring changes.
• transactional integrity: log scanning can produce a change stream that replays the original transactions in the order
they were committed. Such a change stream include changes made to all tables participating in the captured
transaction.
• no need to change the database schema
Several off-the-shelf products perform change data capture using database transaction log files. These include:
• Attunity Stream
• Centerprise Data Integrator from Astera
• DatabaseSync from WisdomForce
• Golden Gate Transactional Data Integration
• HVR from HVR Software
• DBMoto from HiT Software
• Shadowbase from Gravic
• IBM InfoSphere Change Data Capture (previously DataMirror Transformation Server)
• Informatica PowerExchange CDC Option (previously Striva)
• Oracle Streams
[1]
• Oracle Data Guard
[2]
• Replicate1 from Vision Solutions
• SharePlex from Quest Software
• FlexCDC, part of Flexviews for MySQL
[3]
Functionality of CDC
Confounding factors
As often occurs in complex domains, the final solution to a CDC problem may have to balance many competing
concerns.
Push versus pull
CDC Tool Comparison
External links
• Best Practices for Change Data Capture and replication
[4]
• IBM Infosphere CDC
[5]
• Tutorial on setting up CDC in Oracle 9i
[6]
• Tutorial on setting up SQL Azure Change Data Capture
[7]
• Details of the CDC facility included in Microsoft Sql Server 2008 Feb '08 CTP
[8]
Change data capture
100
References
[1] Van de Wiel, Mark (September 2007). "Asynchronous Change Data Capture Cookbook" (http:/ / www.oracle.com/ technology/ products/ bi/
db/ 10g/ pdf/twp_cdc_cookbook_0206. pdf) (PDF). Oracle Corporation. pp. 6. . Retrieved 2009-02-04. "Oracle Streams provides the
underlying infrastructure for this CDC method."
[2] Schupmann, Vivian; et al. (2008). "What's New in Oracle Data Guard?" (http:// download.oracle.com/ docs/ cd/ B19306_01/ server.102/
b14239/whatsnew. htm). Oracle Data Guard Concepts and Administration 10g Release 2 (10.2). Oracle Corporation. . Retrieved 2009-02-04.
"Data Guard enhancements in Oracle Enterprise Manager: [...] New support for Change Data Capture and Streams: [...] Distributed
(heterogeneous) Asynchronous Change Data Capture"
[3] Swanhart, Justin (2011-02-01). "Flexviews Google Code Homepage" (http:/ / flexvie.ws). Flexviews Google Code Homepage. Justin
Swanhart. . Retrieved 2011-02-24. "Flexviews includes FlexCDC, a change data capture utility for MySQL 5.1+"
[4] http:/ / www. wisdomforce.com/ resources/ docs/ databasesync/ DatabaseSyncBestPracticesforTeradata.pdf
[5] http:// www-01.ibm. com/ software/ data/ infosphere/change-data-capture/
[6] http:/ / www. oracle. com/ technology/oramag/ oracle/03-nov/ o63tech_bi. html
[7] http:// social. technet. microsoft.com/ wiki/ contents/ articles/ how-to-enable-sql-azure-change-data-capture.aspx
[8] http:// msdn2. microsoft.com/ en-us/ library/bb522489(SQL.100).aspx
Chunked transfer encoding
Chunked transfer encoding is a data transfer mechanism in version 1.1 of the Hypertext Transfer Protocol (HTTP)
in which a web server serves content in a series of chunks. It uses the Transfer-Encoding HTTP response header in
place of the Content-Length header, which the protocol would otherwise require. Because the Content-Length header
is not used, the server does not need to know the length of the content before it starts transmitting a response to the
client (usually a web browser). Web servers can begin transmitting responses with dynamically-generated content
before knowing the total size of that content.
The size of each chunk is sent right before the chunk itself so that a client can tell when it has finished receiving data
for that chunk. The data transfer is terminated by a final chunk of length zero.
Rationale
The introduction of chunked encoding into HTTP 1.1 provided a number of benefits:
• Chunked transfer encoding allows a server to maintain a HTTP persistent connection for dynamically generated
content. Normally, persistent connections require the server to send a Content-Length field in the header before
starting to send the entity body, but for dynamically generated content this is usually not known before the
content is created.
[1]
• Chunked encoding allows the sender to send additional header fields after the message body. This is important in
cases where values of a field cannot be known until the content has been produced such as when the content of
the message must be digitally signed. Without chunked encoding, the sender would have to buffer the content
until it was complete in order to calculate a field value and send it before the content.
• HTTP servers sometimes use compression (gzip) or deflate methods to optimize transmission. Chunked transfer
encoding can be used to delimit parts of the compressed object. In this case the chunks are not individually
compressed. Instead, the complete payload is compressed and the output of the compression process is chunk
encoded. In the case of compression, chunked encoding has the benefit that the compression can be performed on
the fly while the data is delivered, as opposed to completing the compression process beforehand to determine the
final size.
Chunked transfer encoding
101
Applicability
For version 1.1 of the HTTP protocol, the chunked transfer mechanism is considered to be always acceptable, even if
not listed in the TE request header field, and when used with other transfer mechanisms, should always be applied
last to the transferred data and never more than one time. This transfer coding method also allows additional entity
header fields to be sent after the last chunk if the client specified the "trailers" parameter as an argument of the TE
field. The origin server of the response can also decide to send additional entity trailers even if the client did not
specify the "trailers" option in the TE request field, but only if the metadata is optional (i.e. the client can use the
received entity without them). Whenever the trailers are used, the server should list their names in the Trailer header
field; 3 header field types are specifically prohibited from appearing as a trailer field: Transfer-Encoding,
Content-Length and Trailer.
Format
If a Transfer-Encoding field with a value of chunked is specified in an HTTP message (either a request
sent by a client or the response from the server), the body of the message consists of an unspecified number of
chunks, a terminating last-chunk, an optional trailer of entity header fields, and a final CRLF sequence.
Each chunk starts with the number of octets of the data it embeds expressed in hexadecimal followed by optional
parameters (chunk extension) and a terminating CRLF (carriage return and line feed) sequence, followed by the
chunk data. The chunk is terminated by CRLF. If chunk extensions are provided, the chunk size is terminated by a
semicolon followed with the extension name and an optional equal sign and value.
The last chunk is a zero-length chunk, with the chunk size coded as 0, but without any chunk data section.
The final chunk may be followed by an optional trailer of additional entity header fields that are normally delivered
in the HTTP header to allow the delivery of data that can only be computed after all chunk data has been generated.
The sender may indicate in a Trailer header field which additional fields it will send in the trailer after the chunks.
Example
Encoded response
HTTP/1.1 200 OK
Content-Type: text/plain
Transfer-Encoding: chunked
25
This is the data in the first chunk
1C
and this is the second one
3
con
8
sequence
0
Chunked transfer encoding
102
Anatomy of encoded response
The first two chunks in the above sample contain explicit \r\n characters in the chunk data.
"This is the data in the first chunk\r\n" (37 chars => hex: 0x25)
"and this is the second one\r\n" (28 chars => hex: 0x1C)
"con" (3 chars => hex: 0x03)
"sequence" (8 chars => hex: 0x08)
The response ends with a zero-length last chunk: "0\r\n" and the final "\r\n".
Decoded data
This is the data in the first chunk
and this is the second one
consequence
References
[1] Roy T. Fielding (10 Oct 1995). "Keep-Alive Notes" (http:// ftp.ics.uci. edu/ pub/ ietf/http/ hypermail/ 1995q4/ 0063.html). HTTP Working
Group (HTTP-WG) mailing list. .
• See RFC 2616 for further details; section 3.6.1 (http:// tools. ietf. org/html/ rfc2616#section-3.6.1) in particular.
Client-side persistent data
Client-side persistent data or CSPD is a term used in computing for storing data required by web applications to
complete internet tasks on the client-side as needed rather than exclusively on the server. As a framework it is one
solution to the needs of Occasionally Connected Computing or OCC.
A major challenge for HTTP as a stateless protocol has been asynchronous tasks. The AJAX pattern using
XMLHttpRequest was first introduced by Microsoft in the context of the Outlook e-mail product.
The first CSPD were the 'cookies' introduced by the Netscape Navigator. ActiveX components which have entries in
the Windows registry can also be viewed as a form of client-side persistence.
External links
• CSPD
[1]
• Safari
[2]
preview
• Netscape
[3]
on persistent client state
References
[1] http:/ / www. curl.com/ developer/ faq/cspd/
[2] http:/ / safari.ciscopress. com/ 0596101996/ jscript5-CHP-19-SECT-6
[3] http:/ / wp.netscape. com/ newsref/std/ cookie_spec. html
Clone (database)
103
Clone (database)
A database clone is a complete and separate copy of a database system that includes the business data, the DBMS
software and any other application tiers that make up the environment. Cloning is a different kind of operation to
replication and backups in that the cloned environment is both fully functional and separate in its own right.
Additionally the cloned environment may be modified at its inception due to configuration changes or data
subsetting.
The cloning refers to the replication of the server in order to have a backup, to upgrade the environment.
Cognos Reportnet
104
Cognos Reportnet
Cognos ReportNet
Original author(s) Cognos, an IBM Company
Initial release September, 2003
Stable release Cognos ReportNet 1.3
Development status Active
Operating system Multiple
Platform Multiple
Available in Multi-lingual
Type Business Intelligence
Website
IBM Cognos ReportNet
[1]
Cognos ReportNet (CRN) is a web-based software product for creating and managing ad-hoc and custom-made
reports. ReportNet is developed by Canada’s Ottawa based business intelligence (BI) and performance management
solutions company Cognos (formerly Cognos Incorporated), an IBM company. The web-based reporting tool was
launched in September 2003. Since IBM's acquisition of Cognos, ReportNet has been renamed IBM Cognos
ReportNet like all other Cognos products.
ReportNet uses web services standards such as XML and Simple Object Access Protocol and also supports dynamic
HTML and Java
[2]
. ReportNet is compatible with multiple databases including Oracle, SAP, Teradata, Microsoft
SQL server, DB2 and Sybase
[3]

[4]
. The product provides interface in over 10 languages,
[5]
has Web Services
architecture to meet the needs of multi-national, diversified enterprises and helps reduce total cost of ownership.
Multiple versions of Cognos ReportNet have since been released by the company. Cognos ReportNet was awarded
the Software and Information Industry Association (SIIA) 2005 Codie Awards for the "Best Business Intelligence or
Knowledge Management Solution" category
[6]
. CRN's capabilities have been further used in Cognos8, the latest
reporting tool.
[7]
CRN comes with its own software development kit (SDK).
Launch
Early adopters of Cognos ReportNet for their corporate reporting needs included Bear Stearns, BMW and Alfred
Publishing. Around this same time of launch, Cognos competitor Business Objects released version 6.1 of its
enterprise reporting tool. Cognos ReportNet has been successful since its launch, raising revenues in 2004 from
licensing fees.
[8]
Subsequently other major corporations like McDonald's adopted Cognos ReportNet.
[9]
Performance
Cognos rival Business Objects announced in 2005 that BusinessObjects XI significantly outperformed Cognos
ReportNet in benchmark tests conducted by VeriTest, an independent software testing firm. The tests performed
showed Cognos ReportNet performed poorly when processing styled reports, complex business reports and
combination of both
[10]
. The tests reported a massive 21 times higher report throughput for BusinessObjects XI than
Cognos ReportNet at capacity loads
[11]
. Cognos soon dismissed the claims by stating Business Objects dictated the
environment and testing criteria and Cognos did not provide the software to participate in benchmark test
[12]
.
Cognos later performed their own test to demonstrate Cognos ReportNet capabilities
[13]
.
Cognos Reportnet
105
Components
• Cognos Report Studio – A Web based product for creating complex professional looking reports.
[14]
• Cognos Query Studio - A Web based product for creating ad-hoc reports.
[15]
• Cognos Framework Manager – A metadata modeling tool to create BI metadata for reporting and dashboard
applications.
[16]
• Cognos Connection – Main portal used to access reports, schedule reports and perform administrator
activities.
[17]
Versions
• Cognos ReportNet 1.1 – Java EE style professional web based authoring tool. (base version)
• Cognos ReportNet IBM Special Edition – comes with an embedded version of IBM WebSphere as its application
server and IBM DB2 as its data store.
[18]
• Cognos Linux – for Intel based Linux platforms.
[19]
References
[1] http:/ / www. cognos. com/ products/ business_intelligence/ reporting/index. html
[2] Cognos ReportNet in news (http:// www. vnunet. com/ vnunet/ news/ 2123232/ bear-sterns-chooses-cognos-reportnet)
[3] Data sources (http:// www. cognos. com/ solutions/ data/ ibm/ advantages.html)
[4] CRN Environment details (http:// support. cognos. com/ en/ support/ products/ crn101_software_environments.html)
[5] CRN Features (http:// www. cognos. com/ products/ business_intelligence/ reporting/features.html)
[6] Cognos ReportNet wins award (http:// www.mywire.com/ pubs/ PRNewswire/ 2005/ 06/ 08/ 885642?extID=10051)
[7] Cognos8 (http:/ / www. cognos. com/ products/ cognos8businessintelligence)
[8] Cognos ReportNet delivers $30 Million in License Revenue in one Quarter (http:/ / www. highbeam. com/ doc/ 1G1-131525446.html)
[9] ReportNet and fries (http:/ / www. ebizq. net/ news/ 5538.html)
[10] BO XI Vs Cognos ReportNet (http:/ / www.crm2day. com/ news/ crm/ 114773. php)
[11] BO XI outperforms Cognos ReportNet (http:// goliath. ecnext.com/ coms2/ summary_0199-4404821_ITM)
[12] Cognos dismisses the Test results (http:// www. cognos. com/ news/ releases/ 2005/ 0624_3. html)
[13] Cognos scalability results (http:// www.cognos. com/ pdfs/ whitepapers/ wp_cognos_reportnet_scalability_benchmakrs_ms_windows. pdf)
[14] Refer definition in introduction page (http:// web. princeton.edu/ sites/ datamall/ documents/ ug_cr_rptstd.pdf)
[15] Refer Introduction page (http:// web. princeton.edu/ sites/ datamall/ documents/ ug_cr_qstd. pdf)
[16] Framework Manager Services (http:// www.cognos. com/ products/ framework_services)
[17] Refer page9 (http:// web. princeton.edu/ sites/ datamall/ documents/ ug_cr_qstd. pdf)
[18] IBM and Cognos join hands (http:// www.cbronline.com/ article_news.asp?guid=C20418EA-2AAF-46B1-9E91-C59ACEB1E038)
[19] ReportNet on Linux (http:/ / www. ebizq. net/ news/ 5688. html)
Commit (data management)
106
Commit (data management)
In the context of computer science and data management, commit refers to the idea of making a set of tentative
changes permanent. A popular usage is at the end of a transaction. A commit is the act of committing.
Data management
A COMMIT statement in SQL ends a transaction within a relational database management system (RDBMS) and
makes all changes visible to other users. The general format is to issue a BEGIN WORK statement, one or more
SQL statements, and then the COMMIT statement. Alternatively, a ROLLBACK statement can be issued, which
undoes all the work performed since BEGIN WORK was issued. A COMMIT statement will also release any
existing savepoints that may be in use.
In terms of transactions, the opposite of commit is to discard the tentative changes of a transaction, a rollback.
Revision control
Commits are also done for revision control systems for source code such as Subversion or Concurrent Versions
System. A commit in the context of these version control systems refers to submitting the latest changes of the
source code to the repository, and making these changes part of the head revision of the repository. Thus, when other
users do an UPDATE or a checkout from the repository, they will receive the latest committed version, unless they
specify they wish to retrieve a previous version of the source code in the repository. Version control systems also
have similar functionality to SQL databases in that they allow rolling back to previous versions easily. In this
context, a commit with version control systems is not as dangerous as it allows easy rollback, even after the commit
has been done.
Notes
Commitment ordering
107
Commitment ordering
In concurrency control of databases, transaction processing (transaction management), and related applications,
Commitment ordering (or Commit ordering; CO; Raz 1990, 1992, 1994, 2009) is a class of interoperable
Serializability techniques, both centralized and distributed. It allows optimistic (non-blocking) implementations. CO
is also the name of the resulting transaction schedule (history) property, which was defined earlier (1988; CO was
discovered independently) with the name Dynamic atomicity.
[1]
In a CO compliant schedule the chronological order
of commitment events of transactions is compatible with the precedence order of the respective transactions. CO is a
broad special case of Conflict serializability, and effective means (reliable, high-performance, distributed, and
scalable) to achieve Global serializability (Modular serializability) across any collection of database systems that
possibly use different concurrency control mechanisms (CO also makes each system serializability compliant, if not
already).
Achieving this has been characterized as open problem until the public disclosure of CO in 1991 by its inventor
Yoav Raz (see Global serializability). Each not-CO-compliant database system is augmented with a CO component
(the Commitment Order Coordinator - COCO) which orders the commitment events for CO compliance, with neither
data-access nor any other transaction operation interference. As such CO provides a low overhead, general solution
for global serializability (and distributed serializability), instrumental for Global concurrency control (and
Distributed concurrency control) of multi database systems and other transactional objects, possibly highly
distributed (e.g., within Cloud computing, Grid computing, and networks of smartphones). An Atomic commitment
protocol (ACP; of any type) is a fundamental part of the solution, utilized to break global cycles in the conflict
(precedence, serializability) graph. CO is the most general property (a necessary condition) that guarantees global
serializability, if the database systems involved do not share concurrency control information beyond atomic
commitment protocol (unmodified) messages, and have no knowledge whether transactions are global or local (the
database systems are autonomous). Thus CO (with its variants) is the only general technique that does not require the
typically costly distribution of local concurrency control information (e.g., local precedence relations, locks,
timestamps, or tickets). It generalizes the popular Strong strict two-phase locking (SS2PL) property, which in
conjunction with the Two-phase commit protocol (2PC) is the de facto standard to achieve global serializability
across (SS2PL based) database systems. As a result CO compliant database systems (with any, different concurrency
control types) can transparently join such SS2PL based solutions for global serializability.
In addition, locking based global deadlocks are resolved automatically in a CO based multi-database environment,
an important side-benefit (including the special case of a completely SS2PL based environment; a previously
unnoticed fact for SS2PL).
Furthermore, Strict commitment ordering (SCO; Raz 1991c), the intersection of Strictness and CO, provides better
performance (shorter average transaction completion time and resulting better transaction throughput) than SS2PL
whenever read-write conflicts are present (identical blocking behavior for write-read and write-write conflicts;
comparable locking overhead). The advantage of SCO is especially significant during lock contention. Strictness
allows both SS2PL and SCO to use the same effective database recovery mechanisms.
Two major generalizing variants of CO exist, Extended CO (ECO; Raz 1993a) and Multi-version CO (MVCO;
Raz 1993b). They as well provide global serializability without local concurrency control information distribution,
can be combined with any relevant concurrency control, and allow optimistic (non-blocking) implementations. Both
use additional information for relaxing CO constraints and achieving better concurrency and performance. Vote
ordering (VO or Generalized CO (GCO); Raz 2009) is a container schedule set (property) and technique for CO and
all its variants. Local VO is a necessary condition for guaranteeing Global serializability, if the Atomic commitment
protocol (ACP) participants do not share concurrency control information (have the Generalized autonomy
property). CO and its variants inter-operate transparently, guaranteeing global serializability and automatic global
deadlock resolution also together in a mixed, heterogeneous environment with different variants.
Commitment ordering
108
Comments:
1. This article utilizes concepts and terminology introduced in the article Serializability.
2. For CO's evolvement and utilization see The History of Commitment Ordering and Global serializability.
3. Several books that have appeared since 2010, with "Commitment ordering" in their titles, are collections of
related Wikipedia articles.
Overview
The Commitment ordering (CO; Raz 1990, 1992, 1994, 2009) schedule property has been referred to also as
Dynamic atomicity (since 1988
[1]
), commit ordering, commit order serializability, and strong recoverability (since
1991; see The History of Commitment Ordering and the references there). The latter is a misleading name since CO
is incomparable with recoverability, and the term "strong" implies a special case. This means that a schedule with a
strong recoverability property does not necessarily have the CO property, and vice versa.
In 2009 CO has been characterized as a major concurrency control method, together with the previously known
(since the 1980s) three major methods: Locking, Time-stamp ordering, and Serialization graph testing, and as an
enabler for the interoperability of systems using different concurrency control mechanisms.
[2]
In a federated database system or any other more loosely defined multidatabase system, which are typically
distributed in a communication network, transactions span multiple (and possibly distributed) databases. Enforcing
global serializability in such system is problematic. Even if every local schedule of a single database is serializable,
still, the global schedule of a whole system is not necessarily serializable. The massive communication exchanges of
conflict information needed between databases to reach conflict serializability would lead to unacceptable
performance, primarily due to computer and communication latency. The problem of achieving global serializability
effectively had been characterized as open until the end of 1991 (see Global serializability).
Enforcing CO is an effective way to enforce conflict serializability globally in a distributed system, since enforcing
CO locally in each database (or other transactional object) also enforces it globally. Each database may use any,
possibly different, type of concurrency control mechanism. With a local mechanism that already provides conflict
serializability, enforcing CO locally does not cause any additional aborts, since enforcing CO locally does not affect
the data access scheduling strategy of the mechanism (this scheduling determines the serializability related aborts;
such a mechanism typically does not consider the commitment events or their order). The CO solution requires no
communication overhead, since it uses (unmodified) atomic commitment protocol messages only, already needed by
each distributed transaction to reach atomicity. An atomic commitment protocol plays a central role in the distributed
CO algorithm, which enforces CO globally, by breaking global cycles (cycles that span two or more databases) in
the global conflict graph. CO, its special cases, and its generalizations are interoperable, and achieve global
serializability while transparently being utilized together in a single heterogeneous distributed environment
comprising objects with possibly different concurrency control mechanisms. As such, Commitment ordering,
including its special cases, and together with its generalizations (see CO variants below), provides a general, high
performance, fully distributed solution (no central processing component or central data structure are needed) for
guaranteeing global serializability in heterogeneous environments of multidatabase systems and other multiple
transactional objects (objects with states accessed and modified only by transactions; e.g., in the framework of
transactional processes, and within Cloud computing and Grid computing). The CO solution scales up with network
size and the number of databases without any negative impact on performance (assuming the statistics of a single
distributed transaction, e.g., the average number of databases involved with a single transaction, are unchanged).
Optimistic CO (OCO) has been also increasingly utilized to achieve serializability in Software transactional memory
(STM; a paradigm in Concurrent computing), and tens of STM articles and patents utilizing "commit order" have
already been published (e.g., Zhang et al. 2006
[3]
).
Commitment ordering
109
The commitment ordering solution for global serializability
General characterization of CO
Commitment ordering (CO) is a special case of conflict serializability. CO can be enforced with non-blocking
mechanisms (each transaction can complete its task without having its data-access blocked, which allows optimistic
concurrency control; however, commitment could be blocked). In a CO schedule the commitment events' (partial)
precedence order of the transactions corresponds to the precedence (partial) order of the respective transactions in the
(directed) conflict graph (precedence graph, serializability graph), as induced by their conflicting access operations
(usually read and write (insert/modify/delete) operations; CO also applies to higher level operations, where they are
conflicting if noncommutative, as well as to conflicts between operations upon multi-version data).
• Definition - Commitment ordering
Let be two committed transactions in a schedule, such that is in a conflict with ( precedes
). The schedule has the Commitment ordering (CO) property, if for every two such transactions
commits before commits.
The commitment decision events are generated by either a local commitment mechanism, or an atomic commitment
protocol, if different processes need to reach consensus on whether to commit or abort. The protocol may be
distributed or centralized. Transactions may be committed concurrently, if the commit partial order allows (if they do
not have conflicting operations). If different conflicting operations induce different partial orders of same
transactions, then the conflict graph has cycles, and the schedule will violate serializability when all the transactions
on a cycle are committed. In this case no partial order for commitment events can be found. Thus, cycles in the
conflict graph need to be broken by aborting transactions. However, any conflict serializable schedule can be made
CO without aborting any transaction, by properly delaying commit events to comply with the transactions'
precedence partial order.
CO enforcement by itself is not sufficient as a concurrency control mechanism, since CO lacks the recoverability
property, which should be supported as well.
The distributed CO algorithm
A fully distributed Global commitment ordering enforcement algorithm exists, that uses local CO of each
participating database, and needs only (unmodified) Atomic commitment protocol messages with no further
communication. The distributed algorithm is the combination of local (to each database) CO algorithm processes,
and an atomic commitment protocol (which can be fully distributed). Atomic commitment protocol is essential to
enforce atomicity of each distributed transaction (to decide whether to commit or abort it; this procedure is always
carried out for distributed transactions, independently of concurrency control and CO). A common example of an
atomic commitment protocol is the two-phase commit protocol, which is resilient to many types of system failure. In
a reliable environment, or when processes usually fail together (e.g., in the same integrated circuit), a simpler
protocol for atomic commitment may be used (e.g., a simple handshake of distributed transaction's participating
processes with some arbitrary but known special participant, the transaction's coordinator, i.e., a type of one-phase
commit protocol). An atomic commitment protocol reaches consensus among participants on whether to commit or
abort a distributed (global) transaction that spans these participants. An essential stage in each such protocol is the
YES vote (either explicit, or implicit) by each participant, which means an obligation of the voting participant to
obey the decision of the protocol, either commit or abort. Otherwise a participant can unilaterally abort the
transaction by an explicit NO vote. The protocol commits the transaction only if YES votes have been received from
all participants, and thus typically a missing YES vote of a participant is considered a NO vote by this participant.
Otherwise the protocol aborts the transaction. The various atomic commit protocols only differ in their abilities to
handle different computing environment failure situations, and the amounts of work and other computing resources
needed in different situations.
Commitment ordering
110
The entire CO solution for global serializability is based on the fact that in case of a missing vote for a distributed
transaction, the atomic commitment protocol eventually aborts this transaction.
Enforcing global CO
In each database system a local CO algorithm determines the needed commitment order for that database. By the
characterization of CO above, this order depends on the local precedence order of transactions, which results from
the local data access scheduling mechanisms. Accordingly YES votes in the atomic commitment protocol are
scheduled for each (unaborted) distributed transaction (in what follows "a vote" means a YES vote). If a precedence
relation (conflict) exists between two transactions, then the second will not be voted on before the first is completed
(either committed or aborted), to prevent possible commit order violation by the atomic commitment protocol. Such
can happen since the commit order by the protocol is not necessarily the same as the voting order. If no precedence
relation exists, both can be voted on concurrently. This vote ordering strategy ensures that also the atomic
commitment protocol maintains commitment order, and it is a necessary condition for guaranteeing Global CO (and
the local CO of a database; without it both Global CO and Local CO (a property meaning that each database is CO
compliant) may be violated).
However, since database systems schedule their transactions independently, it is possible that the transactions'
precedence orders in two databases or more are not compatible (no global partial order exists that can embed the
respective local partial orders together). With CO precedence orders are also the commitment orders. When
participating databases in a same distributed transaction do not have compatible local precedence orders for that
transaction (without "knowing" it; typically no coordination between database systems exists on conflicts, since the
needed communication is massive and unacceptably degrades performance) it means that the transaction resides on a
global cycle (involving two or more databases) in the global conflict graph. In this case the atomic commitment
protocol will fail to collect all the votes needed to commit that transaction: By the vote ordering strategy above at
least one database will delay its vote for that transaction indefinitely, to comply with its own commitment
(precedence) order, since it will be waiting to the completion of another, preceding transaction on that global cycle,
delayed indefinitely by another database with a different order. This means a voting-deadlock situation involving the
databases on that cycle. As a result the protocol will eventually abort some deadlocked transaction on this global
cycle, since each such transaction is missing at least one participant's vote. Selection of the specific transaction on
the cycle to be aborted depends on the atomic commitment protocol's abort policies (a timeout mechanism is
common, but it may result in more than one needed abort per cycle; both preventing unnecessary aborts and abort
time shortening can be achieved by a dedicated abort mechanism for CO). Such abort will break the global cycle
involving that distributed transaction. Both deadlocked transactions and possibly other in conflict with the
deadlocked (and thus blocked) will be free to be voted on. It is worthwhile noting that each database involved with
the voting-deadlock continues to vote regularly on transactions that are not in conflict with its deadlocked
transaction, typically almost all the outstanding transactions. Thus, in case of incompatible local (partial)
commitment orders, no action is needed since the atomic commitment protocol resolves it automatically by aborting
a transaction that is a cause of incompatibility. This means that the above vote ordering strategy is also a sufficient
condition for guaranteeing Global CO.
The following is concluded:
• The Vote ordering strategy for Global CO Enforcing Theorem
Let be undecided (neither committed nor aborted) transactions in a database system that enforces CO
for local transactions, such that is global and in conflict with ( precedes ). Then, having
ended (either committed or aborted) before is voted on to be committed (the vote ordering strategy), in
each such database system in a multidatabase environment, is a necessary and sufficient condition for
guaranteeing Global CO (the condition guarantees Global CO, which may be violated without it).
Comments:
Commitment ordering
111
1. The vote ordering strategy that enforces global CO is referred to as in (Raz 1992).
2. The Local CO property of a global schedule means that each database is CO compliant. From the necessity
discussion part above it directly follows that the theorem is true also when replacing "Global CO" with "Local
CO" when global transactions are present. Together it means that Global CO is guaranteed if and only if Local
CO is guaranteed (which is untrue for Global conflict serializability and Local conflict serializability: Global
implies Local, but not the opposite).
Global CO implies Global serializability.
The Global CO algorithm comprises enforcing (local) CO in each participating database system by ordering
commits of local transactions (see Enforcing CO locally below) and enforcing the vote ordering strategy in the
theorem above (for global transactions).
Exact characterization of voting-deadlocks by global cycles
The above global cycle elimination process by a voting deadlock can be explained in detail by the following
observation:
First it is assumed, for simplicity, that every transaction reaches the ready-to-commit state and is voted on by at least
one database (this implies that no blocking by locks occurs). Define a "wait for vote to commit" graph as a directed
graph with transactions as nodes, and a directed edge from any first transaction to a second transaction if the first
transaction blocks the vote to commit of the second transaction (opposite to conventional edge direction in a wait-for
graph). Such blocking happens only if the second transaction is in a conflict with the first transaction (see above).
Thus this "wait for vote to commit" graph is identical to the global conflict graph (has exactly the same definition -
see in Serializability). A cycle in the "wait for vote to commit" graph means a deadlock in voting. Hence there is a
deadlock in voting if and only if there is a cycle in the conflict graph. Local cycles (confined to a single database) are
eliminated by the local serializability mechanisms. Consequently only global cycles are left, which are then
eliminated by the atomic commitment protocol when it aborts deadlocked transactions with missing (blocked)
respective votes.
Secondly, also local commits are dealt with: Note that when enforcing CO also waiting for a regular local commit of
a local transaction can block local commits and votes of other transactions upon conflicts, and the situation for global
transactions does not change also without the simplifying assumption above: The final result is the same also with
local commitment for local transactions, without voting in atomic commitment for them.
Finally, blocking by a lock (which has been excluded so far) needs to be considered: A lock blocks a conflicting
operation and prevents a conflict from being materialized. If the lock is released only after transaction end, it may
block indirectly either a vote or a local commit of another transaction (which now cannot get to ready state), with the
same effect as of a direct blocking of a vote or a local commit. In this case a cycle is generated in the conflict graph
only if such a blocking by a lock is also represented by an edge. With such added edges representing events of
blocking-by-a-lock, the conflict graph is becoming an augmented conflict graph.
• Definition - Augmented conflict graph
An augmented conflict graph is a conflict graph with added edges: In addition to the original edges a
directed edge exists from transaction to transaction if two conditions are met:
1. is blocked by a data-access lock applied by (the blocking prevents the conflict of with from being
materialized and have an edge in the regular conflict graph), and
2. This blocking will not stop before ends (commits or aborts; true for any locking-based CO)
The graph can also be defined as the union of the (regular) conflict graph with the (reversed edge, regular)
wait-for graph
Comments:
Commitment ordering
112
1. Here, unlike the regular conflict graph, which has edges only for materialized conflicts, all conflicts, both
materialized and non-materialized, are represented by edges.
2. Note that all the new edges are all the (reversed to the conventional) edges of the wait-for graph. The wait-for
graph can be defined also as the graph of non-materialized conflicts. By the common conventions edge direction
in a conflict graph defines time order between conflicting operations which is opposite to the time order defined
by an edge in a wait-for graph.
3. Note that such global graph contains (has embedded) all the (reversed edge) regular local wait-for graphs, and
also may include locking based global cycles (which cannot exist in the local graphs). For example, if all the
databases on a global cycle are SS2PL based, then all the related vote blocking situations are caused by locks (this
is the classical, and probably the only global deadlock situation dealt with in the database research literature).
This is a global deadlock case where each related database creates a portion of the cycle, but the complete cycle
does not reside in any local wait-for graph.
In the presence of CO the augmented conflict graph is in fact a (reversed edge) local-commit and voting wait-for
graph: An edge exists from a first transaction, either local or global, to a second, if the second is waiting for the first
to end in order to be either voted on (if global), or locally committed (if local). All global cycles (across two or more
databases) in this graph generate voting-deadlocks. The graph's global cycles provide complete characterization for
voting deadlocks and may include any combination of materialized and non-materialized conflicts. Only cycles of
(only) materialized conflicts are also cycles of the regular conflict graph and affect serializability. One or more (lock
related) non-materialized conflicts on a cycle prevent it from being a cycle in the regular conflict graph, and make it
a locking related deadlock. All the global cycles (voting-deadlocks) need to be broken (resolved) to both maintain
global serializability and resolve global deadlocks involving data access locking, and indeed they are all broken by
the atomic commitment protocol due to missing votes upon a voting deadlock.
Comment: This observation also explains the correctness of Extended CO (ECO) below: Global transactions' voting
order must follow the conflict graph order with vote blocking when order relation (graph path) exists between two
global transactions. Local transactions are not voted on, and their (local) commits are not blocked upon conflicts.
This results in same voting-deadlock situations and resulting global cycle elimination process for ECO.
The voting-deadlock situation can be summarized as follows:
• The CO Voting-Deadlock Theorem
Let a multidatabase environment comprise CO compliant (which eliminates local cycles) database systems that
enforce, each, Global CO (using the condition in the theorem above). Then a voting-deadlock occurs if and
only if a global cycle (spans two or more databases) exists in the Global augmented conflict graph (also
blocking by a data-access lock is represented by an edge). If the cycle does not break by any abort, then all the
global transactions on it are involved with the respective voting-deadlock, and eventually each has its vote
blocked (either directly, or indirectly by a data-access lock); if a local transaction resides on the cycle,
eventually it has its (local) commit blocked.
Comment: A rare situation of a voting deadlock (by missing blocked votes) can happen, with no voting for
any transaction on the related cycle by any of the database systems involved with these transactions. This can
occur when local sub-transactions are multi-threaded. The highest probability instance of such rare event
involves two transactions on two simultaneous opposite cycles. Such global cycles (deadlocks) overlap with
local cycles which are resolved locally, and thus typically resolved by local mechanisms without involving
atomic commitment. Formally it is also a global cycle, but practically it is local (portions of local cycles
generate a global one; to see this, split each global transaction (node) to local sub-transactions (its portions
confined each to a single database); a directed edge exists between transactions if an edge exists between any
respective local sub-transactions; a cycle is local if all its edges originate from a cycle among sub-transactions
of the same database, and global if not; global and local can overlap: a same cycle among transactions can
result from several different cycles among sub-transactions, and be both local and global).
Commitment ordering
113
Also the following locking based special case is concluded:
• The CO Locking-based Global-Deadlock Theorem
In a CO compliant multidatabase system a locking-based global-deadlock, involving at least one data-access
lock (non-materialized conflict), and two or more database systems, is a reflection of a global cycle in the
Global augmented conflict graph, which results in a voting-deadlock. Such cycle is not a cycle in the (regular)
Global conflict graph (which reflects only materialized conflicts, and thus such cycle does not affect
serializability).
Comments:
1. Any blocking (edge) in the cycle that is not by a data-access lock is a direct blocking of either voting or local
commit. All voting-deadlocks are resolved (almost all by Atomic commitment; see comment above), including this
locking-based type.
2. Locking-based global-deadlocks can be generated also in a completely-SS2PL based distributed environment
(special case of CO based), where all the vote blocking (and voting-deadlocks) are caused by data-access locks.
Many research articles have dealt for years with resolving such global deadlocks, but none (except the CO
articles) is known (as of 2009) to notice that atomic commitment automatically resolves them. Such automatic
resolutions are regularly occurring unnoticed in all existing SS2PL based multidatabase systems, often bypassing
dedicated resolution mechanisms.
Voting-deadlocks are the key for the operation of distributed CO. See Examples below.
Global cycle elimination (here voting-deadlock resolution by atomic commitment) and resulting aborted transactions'
re-executions are time consuming, regardless of concurrency control used. If databases schedule transactions
independently, global cycles are unavoidable (in a complete analogy to cycles/deadlocks generated in local SS2PL;
with distribution, any transaction or operation scheduling coordination results in autonomy violation, and typically
also in substantial performance penalty). However, in many cases their likelihood can be made very low by
implementing database and transaction design guidelines that reduce the number of conflicts involving a global
transaction. This, primarily by properly handling hot spots (database objects with frequent access), and avoiding
conflicts by using commutativity when possible (e.g., when extensively using counters, as in finances, and especially
multi-transaction accumulation counters, which are typically hot spots).
Atomic commitment protocols are intended and designed to achieve atomicity without considering database
concurrency control. They abort upon detecting or heuristically finding (e.g., by timeout; sometimes mistakenly,
unnecessarily) missing votes, and typically unaware of global cycles. These protocols can be specially enhanced for
CO (including CO's variants below) both to prevent unnecessary aborts, and to accelerate aborts used for breaking
global cycles in the global augmented conflict graph (for better performance by earlier release upon transaction-end
of computing resources and typically locked data). For example, existing locking based global deadlock detection
methods, other than timeout, can be generalized to consider also local commit and vote direct blocking, besides data
access blocking. A possible compromise in such mechanisms is effectively detecting and breaking the most frequent
and relatively simple to handle length-2 global cycles, and using timeout for undetected, much less frequent, longer
cycles.
Enforcing CO locally
Commitment ordering can be enforced locally (in a single database) by a dedicated CO algorithm, or by any
algorithm/protocol that provides any special case of CO. An important such protocol, being utilized extensively in
database systems, which generates a CO schedule, is the strong strict two phase locking protocol (SS2PL: "release
transaction's locks only after the transaction has been either committed or aborted"; see below). SS2PL is a proper
subset of the intersection of 2PL and strictness.
Commitment ordering
114
A generic local CO algorithm
A generic local CO algorithm is an algorithm independent of implementation details, that enforces exactly the CO
property. It does not block data access (nonblocking), and consists of aborting transactions (only if needed) before
committing a transaction. It aborts a (uniquely determined at any given time) minimal set of other undecided (neither
committed, nor aborted) transactions that run locally and can cause serializability violation in the future (can later
generate cycles of committed transactions in the conflict graph). This set consists of all undecided transactions with
directed edges in the conflict graph to the transaction to be committed. The size of this set cannot increase when that
transaction is waiting to be committed (in ready state: processing has ended), and typically decreases in time as its
transactions are being decided. Thus, unless real-time constraints exist to complete that transaction, it is preferred to
wait with committing that transaction and let this set decrease in size. If another serializability mechanism exists
locally (which eliminates cycles in the local conflict graph), or if no cycle involving that transaction exists, the set
will be empty eventually, and no abort of set member is needed. Otherwise the set will stabilize with transactions on
local cycles, and aborting set members will have to occur to break the cycles. Since in the case of CO conflicts
generate blocking on commit, local cycles in the augments conflict graph (see above) indicate local
commit-deadlocks, and deadlock resolution techniques as in SS2PL can be used (e.g., like timeout and wait-for
graph). A local cycle in the augmented conflict graph with at least one non-materialized conflict reflects a
locking-based deadlock. The local algorithm above, applied to the local augmented conflict graph rather than the
regular local conflict graph, comprises the generic enhanced local CO algorithm, a single local cycle elimination
mechanism, for both guaranteeing local serializability and handling locking based local deadlocks. Practically an
additional concurrency control mechanism is always utilized, even solely to enforce recoverability. The generic CO
algorithm does not affect local data access scheduling strategy, when it runs alongside of any other local concurrency
control mechanism. It affects only the commit order, and for this reason it does not need to abort more transactions
than those needed to be aborted for serializability violation prevention by any combined local concurrency control
mechanism. The net effect of CO may be, at most, a delay of commit events (or voting in a distributed environment),
to comply with the needed commit order (but not more delay than its special cases, for example, SS2PL, and on the
average significantly less).
The following theorem is concluded:
• The Generic Local CO Algorithm Theorem
When running alone or alongside any concurrency control mechanism in a database system then
1. The Generic local CO algorithm guarantees (local) CO (a CO compliant schedule).
2. The Generic enhanced local CO algorithm guarantees both (local) CO and (local) locking based deadlock
resolution.
and (when not using timeout, and no real-time transaction completion constraints are applied) neither
algorithm aborts more transactions than the minimum needed (which is determined by the transactions'
operations scheduling, out of the scope of the algorithms).
Implementation considerations - The Commitment Order Coordinator (COCO)
A database system in a multidatabase environment is assumed. From a software architecture point of view a CO
component that implements the generic CO algorithm locally, the Commitment Order Coordinator (COCO), can be
designed in a straightforward way as a mediator between a (single) database system and an atomic commitment
protocol component (Raz 1991b). However, the COCO is typically an integral part of the database system. The
COCO's functions are to vote to commit on ready global transactions (processing has ended) according to the local
commitment order, to vote to abort on transactions for which the database system has initiated an abort (the database
system can initiate abort for any transaction, for many reasons), and to pass the atomic commitment decision to the
database system. For local transactions (when can be identified) no voting is needed. For determining the
commitment order the COCO maintains an updated representation of the local conflict graph (or local augmented
Commitment ordering
115
conflict graph for capturing also locking deadlocks) of the undecided (neither committed nor aborted) transactions as
a data structure (e.g., utilizing mechanisms similar to locking for capturing conflicts, but with no data-access
blocking). The COCO component has an interface with its database system to receive "conflict," "ready" (processing
has ended; readiness to vote on a global transaction or commit a local one), and "abort" notifications from the
database system. It also interfaces with the atomic commitment protocol to vote and to receive the atomic
commitment protocol's decision on each global transaction. The decisions are delivered from the COCO to the
database system through their interface, as well as local transactions' commit notifications, at a proper commit order.
The COCO, including its interfaces, can be enhanced, if it implements another variant of CO (see below), or plays a
role in the database's concurrency control mechanism beyond voting in atomic commitment.
The COCO also guarantees CO locally in a single, isolated database system with no interface with an atomic
commitment protocol.
CO is a necessary condition for global serializability across autonomous database systems
If the databases that participate in distributed transactions (i.e., transactions that span more than a single database) do
not use any shared concurrency control information and use unmodified atomic commitment protocol messages (for
reaching atomicity), then maintaining (local) commitment ordering or one of its generalizing variants (see below) is a
necessary condition for guaranteeing global serializability (a proof technique can be found in (Raz 1992), and a
different proof method for this in (Raz 1993a)); it is also a sufficient condition. This is a mathematical fact derived
from the definitions of serializability and a transaction. It means that if not complying with CO, then global
serializability cannot be guaranteed under this condition (the condition of no local concurrency control information
sharing between databases beyond atomic commit protocol messages). Atomic commitment is a minimal
requirement for a distributed transaction since it is always needed, which is implied by the definition of transaction.
(Raz 1992) defines database autonomy and independence as complying with this requirement without using any
additional local knowledge:
• Definition - (Concurrency control based) Autonomous Database System
A database system is Autonomous, if it does not share with any other entity any concurrency control
information beyond unmodified atomic commitment protocol messages. In addition it does not use for
concurrency control any additional local information beyond conflicts (the last sentence does not appear
explicitly but rather implied by further discussion in Raz 1992).
Using this definition the following is concluded:
• The CO and Global serializability Theorem
1. CO compliance of every autonomous database system (or transactional object) in a multidatabase environment is
a necessary condition for guaranteeing Global serializability (without CO Global serializability may be violated).
2. CO compliance of every database system is a sufficient condition for guaranteeing Global serializability.
However, the definition of autonomy above implies, for example, that transactions are scheduled in a way that local
transactions (confined to a single database) cannot be identified as such by an autonomous database system. This is
realistic for some transactional objects, but too restrictive and less realistic for general purpose database systems. If
autonomy is augmented with the ability to identify local transactions, then compliance with a more general property,
Extended commitment ordering (ECO, see below), makes ECO the necessary condition.
Only in (Raz 2009) the notion of Generalized autonomy captures the intended notion of autonomy:
• Definition - Generalized autonomy
A database system has the Generalized autonomy property, if it does not share with any other database system
any local concurrency information beyond (unmodified) atomic commit protocol messages (however any local
information can be utilized).
Commitment ordering
116
This definition is probably the broadest such definition possible in the context of database concurrency control, and
it makes CO together with any of its (useful: No concurrency control information distribution) generalizing variants
(Vote ordering (VO); see CO variants below) the necessary condition for Global serializability (i.e., the union of CO
and its generalizing variants is the necessary set VO, which may include also new unknown useful generalizing
variants).
Summary
The Commitment ordering (CO) solution (technique) for Global serializability can be summarized as follows:
If each database (or any other transactional object) in a multidatabase environment complies with CO, i.e., arranges
its local transactions' commitments and its votes on (global, distributed) transactions to the atomic commitment
protocol according to the local (to the database) partial order induced by the local conflict graph (serializability
graph) for the respective transactions, then Global CO and Global serializability are guaranteed. A database's CO
compliance can be achieved effectively with any local conflict serializability based concurrency control mechanism,
with neither affecting any transaction's execution process or scheduling, nor aborting it. Also the database's
autonomy is not violated. The only low overhead incurred is detecting conflicts (e.g., as with locking, but with no
data-access blocking; if not already detected for other purposes), and ordering votes and local transactions' commits
according to the conflicts.
Schedule classes containment: An arrow from class A to class B indicates that
class A strictly contains B; a lack of a directed path between classes means that
the classes are incomparable.A property is inherently-blocking, if it can be
enforced only by blocking transaction’s data access operations until certain events
occur in other transactions. (#Raz1992Raz 1992)
In case of incompatible partial orders of two
or more databases (no global partial order
can embed the respective local partial orders
together), a global cycle (spans two databases
or more) in the global conflict graph is
generated. This, together with CO, results in
a cycle of blocked votes, and a
voting-deadlock occurs for the databases on
that cycle (however, allowed concurrent
voting in each database, typically for almost
all the outstanding votes, continue to
execute). In this case the atomic commitment
protocol fails to collect all the votes needed
for the blocked transactions on that global
cycle, and consequently the protocol aborts
some transaction with a missing vote. This
breaks the global cycle, the voting-deadlock
is resolved, and the related blocked votes are
free to be executed. Breaking the global cycle
in the global conflict graph ensures that both
Global CO and Global serializability are
maintained. Thus, in case of incompatible
local (partial) commitment orders no action is
needed since the atomic commitment
protocol resolves it automatically by aborting
a transaction that is a cause for the
incompatibility. Furthermore, also global
deadlocks due to locking (global cycles in the
Commitment ordering
117
augmented conflict graph with at least one data access blocking) result in voting deadlocks and are resolved
automatically by the same mechanism.
Local CO is a necessary condition for guaranteeing Global serializability, if the databases involved do not share any
concurrency control information beyond (unmodified) atomic commitment protocol messages, i.e., if the databases
are autonomous in the context of concurrency control. This means that every global serializability solution for
autonomous databases must comply with CO. Otherwise global serializability may be violated (and thus, is likely to
be violated very quickly in a high-performance environment).
The CO solution scales up with network size and the number of databases without performance penalty when it
utilizes common distributed atomic commitment architecture.
See Examples below.
Distributed serializability and CO
The previous section explains how CO guarantees Global serializability in a multidatabase and multi
transactional-object environment. This section explores the conditions under which CO provides Distributed
serializability, i.e., Serializability in a general distributed transactional environment (e.g., a distributed database
system), achieving a major goal of distributed concurrency control.
Distributed CO
A distinguishing characteristic of the CO solution to distributed serializability from other techniques is the fact that it
requires no conflict information distributed (e.g., local precedence relations, locks, timestamps, tickets), which
makes it uniquely effective. It utilizes (unmodified) atomic commitment protocol messages (which are already used)
instead.
A common way to achieve distributed serializability in a (distributed) system is by a distributed lock manager
(DLM). DLMs, which communicate lock (non-materialized conflict) information in a distributed environment,
typically suffer from computer and communication latency, which reduces the performance of the system. CO allows
to achieve distributed serializability under very general conditions, without a distributed lock manager, exhibiting the
benefits already explored above for multidatabase environments; in particular: reliability, high performance,
scalability, possibility of using optimistic concurrency control when desired, no conflict information related
communications over the network (which have incurred overhead and delays), and automatic distributed deadlock
resolution.
All distributed transactional systems rely on some atomic commitment protocol to coordinate atomicity (whether to
commit or abort) among processes in a distributed transaction. Also, typically recoverable data (i.e., data under
transactions' control, e.g., database data; not to be confused with the recoverability property of a schedule) are
directly accessed by a single transactional data manager component (also referred to as a resource manager) that
handles local sub-transactions (the distributed transaction's portion in a single location, e.g., network node), even if
these data are accessed indirectly by other entities in the distributed system during a transaction (i.e., indirect access
requires a direct access through a local sub-transaction). Thus recoverable data in a distributed transactional system
are typically partitioned among transactional data managers. In such system these transactional data managers
typically comprise the participants in the system's atomic commitment protocol. If each participant complies with
CO (e.g., by using SS2PL, or COCOs, or a combination; see above), then the entire distributed system provides CO
(by the theorems above; each participant can be considered a separate transactional object), and thus (distributed)
serializability. Furthermore: When CO is utilized together with an atomic commitment protocol also distributed
deadlocks (i.e., deadlocks that span two or more data managers) caused by data-access locking are resolved
automatically. Thus the following corollary is concluded:
• The CO Based Distributed Serializability Theorem
Commitment ordering
118
Let a distributed transactional system (e.g., a distributed database system) comprise transactional data
managers (also called resource managers) that manage all the system's recoverable data. The data managers
meet three conditions:
1. Data partition: Recoverable data are partitioned among the data managers, i.e., each recoverable datum (data
item) is controlled by a single data manager (e.g., as common in a Shared nothing architecture; even copies of a
same datum under different data managers are physically distinct, replicated).
2. Participants in atomic commitment protocol: These data managers are the participants in the system's atomic
commitment protocol for coordinating distributed transactions' atomicity.
3. CO compliance: Each such data manager is CO compliant (or some CO variant compliant; see below).
Then
1. The entire distributed system guarantees (distributed CO and) serializability, and
2. Data-access based distributed deadlocks (deadlocks involving two or more data managers with at least one
non-materialized conflict) are resolved automatically.
Furthermore: The data managers being CO compliant is a necessary condition for (distributed) serializability
in a system meeting conditions 1, 2 above, when the data managers are autonomous, i.e., do not share
concurrency control information beyond unmodified messages of atomic commitment protocol.
This theorem also means that when SS2PL (or any other CO variant) is used locally in each transactional data
manager, and each data manager has exclusive control of its data, no distributed lock manager (which is often
utilized to enforce distributed SS2PL) is needed for distributed SS2PL and serializability. It is relevant to a wide
range of distributed transactional applications, which can be easily designed to meet the theorem's conditions.
Distributed optimistic CO (DOCO)
For implementing Distributed Optimistic CO (DOCO) the generic local CO algorithm above is utilized in all the
atomic commitment protocol participants in the system with no data access blocking and thus with no local
deadlocks. The previous theorem has the following corollary:
• The Distributed optimistic CO (DOCO) Theorem
If DOCO is utilized, then:
1. No local deadlocks occur, and
2. Global (voting) deadlocks are resolved automatically (and all are serializability related (with non-blocking
conflicts) rather than locking related (with blocking and possibly also non-blocking conflicts)).
Thus, no deadlock handling is needed.
Examples
Distributed SS2PL
A distributed database system that utilizes SS2PL resides on two remote nodes, A and B. The database system has
two transactional data managers (resource managers), one on each node, and the database data are partitioned
between the two data managers in a way that each has an exclusive control of its own (local to the node) portion of
data: Each handles its own data and locks without any knowledge on the other manager's. For each distributed
transaction such data managers need to execute the available atomic commitment protocol.
Two distributed transactions, and , are running concurrently, and both access data x and y. x is under the
exclusive control of the data manager on A (B's manager cannot access x), and y under that on B.
reads x on A and writes y on B, i.e., when using notation common for
concurrency control.
reads y on B and writes x on A, i.e.,
Commitment ordering
119
The respective local sub-transactions on A and B (the portions of and on each of the nodes) are the
following:
Local sub-transactions
Transaction \ Node A B
The database system's schedule at a certain point in time is the following:

(also is possible)
holds a read-lock on x and holds read-locks on y. Thus and are blocked by the lock
compatibility rules of SS2PL and cannot be executed. This is a distributed deadlock situation, which is also a
voting-deadlock (see below) with a distributed (global) cycle of length 2 (number of edges, conflicts; 2 is the most
frequent length). The local sub-transactions are in the following states:
is ready (execution has ended) and voted (in atomic commitment)
is running and blocked (a non-materialized conflict situation; no vote on it can occur)
is ready and voted
is running and blocked (a non-materialized conflict; no vote).
Since the atomic commitment protocol cannot receive votes for blocked sub-transactions (a voting-deadlock), it will
eventually abort some transaction with a missing vote(s) by timeout, either , or , (or both, if the timeouts fall
very close). This will resolve the global deadlock. The remaining transaction will complete running, be voted on, and
committed. An aborted transaction is immediately restarted and re-executed.
Comments:
1. The data partition (x on A; y on B) is important since without it, for example, x can be accessed directly from B.
If a transaction is running on B concurrently with and and directly writes x, then, without a distributed
lock manager the read-lock for x held by on A is not visible on B and cannot block the write of (or signal
a materialized conflict for a non-blocking CO variant; see below). Thus serializability can be violated.
2. Due to data partition, x cannot be accessed directly from B. However, functionality is not limited, and a
transaction running on B still can issue a write or read request of x (not common). This request is communicated
to the transaction's local sub-transaction on A (which is generated, if does not exist already) which issues this
request to the local data manager on A.
Variations
In the scenario above both conflicts are non-materialized, and the global voting-deadlock is reflected as a cycle in
the global wait-for graph (but not in the global conflict graph; see Exact characterization of voting-deadlocks by
global cycles above). However the database system can utilize any CO variant with exactly the same conflicts and
voting-deadlock situation, and same resolution. Conflicts can be either materialized or non-materialized, depending
on CO variant used. For example, if SCO (below) is used by the distributed database system instead of SS2PL, then
the two conflicts in the example are materialized, all local sub-transactions are in ready states, and vote blocking
occurs in the two transactions, one on each node, because of the CO voting rule applied independently on both A and
B: due to conflicts is not voted on before ends, and is not
voted on before ends (see Enforcing global CO above), which is a voting-deadlock. Now the
conflict graph has the global cycle (all conflicts are materialized), and again it is resolved by the atomic commitment
protocol, and distributed serializability is maintained. Unlikely for a distributed database system, but possible in
Commitment ordering
120
principle (and occurs in a multi-database), A can employ SS2PL while B employs SCO. In this case the global cycle
is neither in the wait-for graph nor in the serializability graph, but still in the augmented conflict graph (the union of
the two). The various combinations are summarized in the following table:
Voting-deadlock situations
Case Node
A
Node
B
Possible schedule Materialized
conflicts
on cycle
Non-
materialized
conflicts
= = = =
1 SS2PL SS2PL

0 2 Ready
Voted
Running
(Blocked)
Running
(Blocked)
Ready
Voted
2 SS2PL SCO

1 1 Ready
Voted
Ready
Vote blocked
Running
(Blocked)
Ready
Voted
3 SCO SS2PL

1 1 Ready
Voted
Running
(Blocked)
Ready
Vote blocked
Ready
Voted
4 SCO SCO

2 0 Ready
Voted
Ready
Vote blocked
Ready
Vote blocked
Ready
Voted
Comments:
1. Conflicts and thus cycles in the augmented conflict graph are determined by the transactions and their initial
scheduling only, independently of the concurrency control utilized. With any variant of CO, any global cycle (i.e.,
spans two databases or more) causes a voting deadlock. Different CO variants may differ on whether a certain
conflict is materialized or non-materialized.
2. Some limited operation order changes in the schedules above are possible, constrained by the orders inside the
transactions, but such changes do not change the rest of the table.
3. As noted above, only case 4 describes a cycle in the (regular) conflict graph which affects serializability. Cases
1-3 describe cycles of locking based global deadlocks (at least one lock blocking exists). All cycle types are
equally resolved by the atomic commitment protocol. Case 1 is the common Distributed SS2PL, utilized since the
1980s. However, no research article, except the CO articles, is known to notice this automatic locking global
deadlock resolution as of 2009. Such global deadlocks typically have been dealt with by dedicated mechanisms.
4. Case 4 above is also an example for a typical voting-deadlock when Distributed optimistic CO (DOCO) is used
(i.e., Case 4 is unchanged when Optimistic CO (OCO; see below) replaces SCO on both A and B): No data-access
blocking occurs, and only materialized conflicts exist.
Hypothetical Multi Single-Threaded Core (MuSiC) environment
Comment: While the examples above describe real, recommended utilization of CO, this example is hypothetical,
for demonstration only.
Certain experimental distributed memory-resident databases advocate multi single-threaded core (MuSiC)
transactional environments. "Single-threaded" refers to transaction threads only, and to serial execution of
transactions. The purpose is possible orders of magnitude gain in performance (e.g., H-Store
[4]
and VoltDB)
relatively to conventional transaction execution in multiple threads on a same core. In what described below MuSiC
is independent of the way the cores are distributed. They may reside in one integrated circuit (chip), or in many
chips, possibly distributed geographically in many computers. In such an environment, if recoverable (transactional)
data are partitioned among threads (cores), and it is implemented in the conventional way for distributed CO, as
described in previous sections, then DOCO and Strictness exist automatically. However, downsides exist with this
straightforward implementation of such environment, and its practicality as a general-purpose solution is
questionable. On the other hand tremendous performance gain can be achieved in applications that can bypass these
downsides in most situations.
Commitment ordering
121
Comment: The MuSiC straightforward implementation described here (which uses, for example, as usual in
distributed CO, voting (and transaction thread) blocking in atomic commitment protocol when needed) is for
demonstration only, and has no connection to the implementation in H-Store or any other project.
In a MuSiC environment local schedules are serial. Thus both local Optimistic CO (OCO; see below) and the Global
CO enforcement vote ordering strategy condition for the atomic commitment protocol (see The Vote ordering
strategy for Global CO Enforcing Theorem above) are met automatically. This results in both distributed CO
compliance (and thus distributed serializability) and automatic global (voting) deadlock resolution.
Furthermore, also local Strictness follows automatically in a serial schedule. By Theorem 5.2 in (Raz 1992; page
307), when the CO vote ordering strategy is applied, also Global Strictness is guaranteed. Note that serial locally is
the only mode that allows strictness and "optimistic" (no data access blocking) together.
The following is concluded:
• The MuSiC Theorem
In MuSiC environments, if recoverable (transactional) data are partitioned among cores (threads), then both
1. OCO (and implied Serializability; i.e., DOCO and Distributed serializability)
2. Strictness (allowing effective recovery; 1 and 2 implying Strict CO - see SCO below) and
3. (voting) deadlock resolution
automatically exist globally with unbounded scalability in number of cores used.
Comment: However, two major downsides, which need special handling, may exist:
1. Local sub-transactions of a global transaction are blocked until commit, which makes the respective cores idle.
This reduces core utilization substantially, even if scheduling of the local sub-transactions attempts to execute all
of them in time proximity, almost together. It can be overcome by detaching execution from commit (with some
atomic commitment protocol) for global transactions, at the cost of possible cascading aborts.
2. increasing the number of cores for a given amount of recoverable data (database size) decreases the average
amount of (partitioned) data per core. This may make some cores idle, while others very busy, depending on data
utilization distribution. Also a local (to a core) transaction may become global (multi-core) to reach its needed
data, with additional incurred overhead. Thus, as the number of cores increases, the amount and type of data
assigned to each core should be balanced according to data usage, so a core is neither overwhelmed to become a
bottleneck, nor becoming idle too frequently and underutilized in a busy system. Another consideration is putting
in a same core partition all the data that are usually accessed by a same transaction (if possible), to maximize the
number of local transactions (and minimize the number of global, distributed transactions). This may be achieved
by occasional data re-partition among cores based on load balancing (data access balancing) and patterns of data
usage by transactions. Another way to considerably mitigate this downside is by proper physical data replication
among some core partitions in a way that read-only global transactions are possibly (depending on usage patterns)
completely avoided, and replication changes are synchronized by a dedicated commit mechanism.
CO variants: Interesting special cases and generalizations
Special case schedule property classes (e.g., SS2PL and SCO below) are strictly contained in the CO class. The
generalizing classes (ECO and MVCO) strictly contain the CO class (i.e., include also schedules that are not CO
compliant). The generalizing variants also guarantee global serializability without distributing local concurrency
control information (each database has the generalized autonomy property: it uses only local information), while
relaxing CO constraints and utilizing additional (local) information for better concurrency and performance: ECO
uses knowledge about transactions being local (i.e., confined to a single database), and MVCO uses availability of
data versions values. Like CO, both generalizing variants are non-blocking, do not interfere with any transaction's
operation scheduling, and can be seamlessly combined with any relevant concurrency control mechanism.
Commitment ordering
122
The term CO variant refers in general to CO, ECO, MVCO, or a combination of each of them with any relevant
concurrency control mechanism or property (including Multi-version based ECO, MVECO). No other interesting
generalizing variants (which guarantee global serializability with no local concurrency control information
distribution) are known, but may be discovered.
Strong strict two phase locking (SS2PL)
Strong Strict Two Phase Locking (SS2PL; also referred to as Rigorousness or Rigorous scheduling) means that
both read and write locks of a transaction are released only after the transaction has ended (either committed or
aborted). The set of SS2PL schedules is a proper subset of the set of CO schedules. This property is widely utilized
in database systems, and since it implies CO, databases that use it and participate in global transactions generate
together a serializable global schedule (when using any atomic commitment protocol, which is needed for atomicity
in a multi-database environment). No database modification or addition is needed in this case to participate in a CO
distributed solution: The set of undecided transactions to be aborted before committing in the local generic CO
algorithm above is empty because of the locks, and hence such an algorithm is unnecessary in this case. A
transaction can be voted on by a database system immediately after entering a "ready" state, i.e., completing running
its task locally. Its locks are released by the database system only after it is decided by the atomic commitment
protocol, and thus the condition in the Global CO enforcing theorem above is kept automatically. Interestingly, if a
local timeout mechanism is used by a database system to resolve (local) SS2PL deadlocks, then aborting blocked
transactions breaks not only potential local cycles in the global conflict graph (real cycles in the augmented conflict
graph), but also database system's potential global cycles as a side effect, if the atomic commitment protocol's abort
mechanism is relatively slow. Such independent aborts by several entities typically may result in unnecessary aborts
for more than one transaction per global cycle. The situation is different for a local wait-for graph based
mechanisms: Such cannot identify global cycles, and the atomic commitment protocol will break the global cycle, if
the resulting voting deadlock is not resolved earlier in another database.
Local SS2PL together with atomic commitment implying global serializability can also be deduced directly: All
transactions, including distributed, obey the 2PL (SS2PL) rules. The atomic commitment protocol mechanism is not
needed here for consensus on commit, but rather for the end of phase-two synchronization point. Probably for this
reason, without considering the atomic commitment voting mechanism, automatic global deadlock resolution has not
been noticed before CO (see Exact characterization of voting-deadlocks by global cycles above).
Commitment ordering
123
Strict CO (SCO)
Read-write conflict: SCO Vs. SS2PL. Duration of transaction T2 is longer with SS2PL
than with SCO.SS2PL delays write operation w2[x] of T2 until T1 commits, due to a lock
on x by T1 following read operation r1[x]. If t time units are needed for transaction T2
after starting write operation w2[x] in order to reach ready state, than T2 commits t time
units after T1 commits. However, SCO does not block w2[x], and T2 can commit
immediately after T1 commits. (#Raz1991cRaz 1991c)
Strict Commitment Ordering (SCO;
(Raz 1991c)) is the intersection of
strictness (a special case of
recoverability) and CO, and provides
an upper bound for a schedule's
concurrency when both properties
exist. It can be implemented using
blocking mechanisms (locking) similar
to those used for the popular SS2PL
with similar overheads.
Unlike SS2PL, SCO does not block on
a read-write conflict but possibly
blocks on commit instead. SCO and
SS2PL have identical blocking
behavior for the other two conflict
types: write-read, and write-write. As a
result SCO has shorter average
blocking periods, and more
concurrency (e.g., performance
simulations of a single database for the
most significant variant of locks with ordered sharing, which is identical to SCO, clearly show this, with
approximately 100% gain for some transaction loads; also for identical transaction loads SCO can reach higher
transaction rates than SS2PL before lock thrashing occurs). More concurrency means that with given computing
resources more transactions are completed in time unit (higher transaction rate, throughput), and the average duration
of a transaction is shorter (faster completion; see chart). The advantage of SCO is especially significant during lock
contention.
• The SCO Vs. SS2PL Performance Theorem
SCO provides shorter average transaction completion time than SS2PL, if read-write conflicts exist. SCO and
SS2PL are identical otherwise (have identical blocking behavior with write-read and write-write conflicts).
SCO is as practical as SS2PL since as SS2PL it provides besides serializability also strictness, which is widely
utilized as a basis for efficient recovery of databases from failure. An SS2PL mechanism can be converted to an
SCO one for better performance in a straightforward way without changing recovery methods. A description of a
SCO implementation can be found in (Perrizo and Tatarinov 1998).
[5]
See also Semi-optimistic database scheduler in
The History of Commitment Ordering.
SS2PL is a proper subset of SCO (which is another explanation why SCO is less constraining and provides more
concurrency than SS2PL).
Commitment ordering
124
Optimistic CO (OCO)
For implementing Optimistic commitment ordering (OCO) the generic local CO algorithm above is utilized
without data access blocking, and thus without local deadlocks. OCO without transaction or operation scheduling
constraints covers the entire CO class, and is not a special case of the CO class, but rather a useful CO variant and
mechanism characterization.
See also Distributed optimistic CO (DOCO) above: All transactional data managers (resource managers) there
employ OCO.
Extended CO (ECO)
General characterization of ECO
Extended Commitment Ordering (ECO; (Raz 1993a)) generalizes CO. When local transactions (transactions
confined to a single database) can be distinguished from global (distributed) transactions (transactions that span two
databases or more), commitment order is applied to global transactions only. Thus, for a local (to a database)
schedule to have the ECO property, the chronological (partial) order of commit events of global transactions only
(unimportant for local transactions) is consistent with their order on the respective local conflict graph.
• Definition - Extended commitment ordering
Let be two committed global transactions in a schedule, such that a directed path of unaborted
transactions exists in the conflict graph (precedence graph) from to ( precedes , possibly
transitively, indirectly). The schedule has the Extended commitment ordering (ECO) property, if for every
two such transactions commits before commits.
A distributed algorithm to guarantee global ECO exists. As for CO, the algorithm needs only (unmodified) atomic
commitment protocol messages. In order to guarantee global serializability, each database needs to guarantee also
the conflict serializability of its own transactions by any (local) concurrency control mechanism.
• The ECO and Global Serializability Theorem
1. (Local, which implies global) ECO together with local conflict serializability, is a sufficient condition to
guarantee global conflict serializability.
2. When no concurrency control information beyond atomic commitment messages is shared outside a database
(autonomy), and local transactions can be identified, it is also a necessary condition.
See a necessity proof in (Raz 1993a).
This condition (ECO with local serializability) is weaker than CO, and allows more concurrency at the cost of a little
more complicated local algorithm (however, no practical overhead difference with CO exists).
When all the transactions are assumed to be global (e.g., if no information is available about transactions being
local), ECO reduces to CO.
The ECO algorithm
Before a global transaction is committed, a generic local (to a database) ECO algorithm aborts a minimal set of
undecided transactions (neither committed, nor aborted; either local transactions, or global that run locally), that can
cause later a cycle in the conflict graph. This set of aborted transactions (not unique, contrary to CO) can be
optimized , if each transaction is assigned with a weight (that can be determined by transaction's importance and by
the computing resources already invested in the running transaction; optimization can be carried out, for example, by
a reduction from the Max flow in networks problem (Raz 1993a)). Like for CO such a set is time dependent, and
becomes empty eventually. Practically, almost in all needed implementations a transaction should be committed only
when the set is empty (and no set optimization is applicable). The local (to the database) concurrency control
mechanism (separate from the ECO algorithm) ensures that local cycles are eliminated (unlike with CO, which
Commitment ordering
125
implies serializability by itself; however, practically also for CO a local concurrency mechanism is utilized, at least
to ensure Recoverability). Local transactions can be always committed concurrently (even if a precedence relation
exists, unlike CO). When the overall transactions' local partial order (which is determined by the local conflict graph,
now only with possible temporary local cycles, since cycles are eliminated by a local serializability mechanism)
allows, also global transactions can be voted on to be committed concurrently (when all their transitively (indirect)
preceding (via conflict) global transactions are committed, while transitively preceding local transactions can be at
any state. This in analogy to the distributed CO algorithm's stronger concurrent voting condition, where all the
transitively preceding transactions need to be committed).
The condition for guaranteeing Global ECO can be summarized similarly to CO:
• The Global ECO Enforcing Vote ordering strategy Theorem
Let be undecided (neither committed nor aborted) global transactions in a database system that
ensures serializability locally, such that a directed path of unaborted transactions exists in the local conflict
graph (that of the database itself) from to . Then, having ended (either committed or aborted)
before is voted on to be committed, in every such database system in a multidatabase environment, is a
necessary and sufficient condition for guaranteeing Global ECO (the condition guarantees Global ECO, which
may be violated without it).
Global ECO (all global cycles in the global conflict graph are eliminated by atomic commitment) together with
Local serializability (i.e., each database system maintains serializability locally; all local cycles are eliminated) imply
Global serializability (all cycles are eliminated). This means that if each database system in a multidatabase
environment provides local serializability (by any mechanism) and enforces the vote ordering strategy in the
theorem above (a generalization of CO's vote ordering strategy), then Global serializability is guaranteed (no local
CO is needed anymore).
Similarly to CO as well, the ECO voting-deadlock situation can be summarized as follows:
• The ECO Voting-Deadlock Theorem
Let a multidatabase environment comprise database systems that enforce, each, both Global ECO (using the
condition in the theorem above) and local conflict serializability (which eliminates local cycles in the global
conflict graph). Then, a voting-deadlock occurs if and only if a global cycle (spans two or more databases)
exists in the Global augmented conflict graph (also blocking by a data-access lock is represented by an edge).
If the cycle does not break by any abort, then all the global transactions on it are involved with the respective
voting-deadlock, and eventually each has its vote blocked (either directly, or indirectly by a data-access lock).
If a local transaction resides on the cycle, it may be in any unaborted state (running, ready, or committed;
unlike CO no local commit blocking is needed).
As with CO this means that also global deadlocks due to data-access locking (with at least one lock blocking) are
voting deadlocks, and are automatically resolved by atomic commitment.
Multi-version CO (MVCO)
Multi-version Commitment Ordering (MVCO; (Raz 1993b)) is a generalization of CO for databases with
multi-version resources (data). With such resources read-only transactions do not block or being blocked for better
performance. Utilizing such resources is a common way nowadays to increase concurrency and performance by
generating a new version of a database object each time the object is written, and allowing transactions' read
operations of several last relevant versions (of each object). MVCO implies One-copy-serializability (1SER or 1SR)
which is the generalization of serializability for multi-version resources. Like CO, MVCO is non-blocking, and can
be combined with any relevant multi-version concurrency control mechanism without interfering with it. In the
introduced underlying theory for MVCO conflicts are generalized for different versions of a same resource
(differently from earlier multi-version theories). For different versions conflict chronological order is replaced by
Commitment ordering
126
version order, and possibly reversed, while keeping the usual definitions for conflicting operations. Results for the
regular and augmented conflict graphs remain unchanged, and similarly to CO a distributed MVCO enforcing
algorithm exists, now for a mixed environment with both single-version and multi-version resources (now
single-version is a special case of multi-version). As for CO, the MVCO algorithm needs only (unmodified) atomic
commitment protocol messages with no additional communication overhead. Locking-based global deadlocks
translate to voting deadlocks and are resolved automatically. In analogy to CO the following holds:
• The MVCO and Global one-copy-serializability Theorem
1. MVCO compliance of every autonomous database system (or transactional object) in a mixed multidatabase
environment of single-version and multi-version databases is a necessary condition for guaranteeing Global
one-copy-serializability (1SER).
2. MVCO compliance of every database system is a sufficient condition for guaranteeing Global 1SER.
3. Locking-based global deadlocks are resolved automatically.
Comment: Now a CO compliant single-version database system is automatically also MVCO compliant.
MVCO can be further generalized to employ the generalization of ECO (MVECO).
See also Multiversion concurrency control.
Example: CO based snapshot isolation (COSI)
CO based snapshot isolation (COSI) is the intersection of Snapshot isolation (SI) with MVCO. SI is a multiversion
concurrency control method widely utilized due to good performance and similarity to serializability (1SER) in
several aspects. The theory in (Raz 1993b) for MVCO described above is utilized later in (Fekete et al. 2005) and
other articles on SI, e.g., (Cahill et al. 2008);
[6]
see also Making snapshot isolation serializable and the references
there), for analyzing conflicts in SI in order to make it serializable. The method presented in (Cahill et al. 2008),
Serializable snapshot isolation (SerializableSI), a low overhead modification of SI, provides good performance
results versus SI, with only small penalty for enforcing serializability. The first article above is unaware of MVCO
and does not reference it, but later articles on the subject reference it. These articles utilize the theory without using
MVCO itself. A different method, by combining SI with MVCO (COSI), makes SI serializable as well, with a
relatively low overhead, similarly to combining the generic CO algorithm with single-version mechanisms.
Furthermore, the resulting combination, COSI, being MVCO compliant, allows COSI compliant database systems to
inter-operate and transparently participate in a CO solution for distributed/global serializability (see below).
However, no performance results about COSI are known yet, and it is unclear how it compares with SerializableSI.
Besides overheads also protocols' behaviors need to be compared quantitatively: On one hand, all serializable SI
schedules can be made MVCO by COSI (by possible commit delays when needed, a minus) without aborting
transactions (a plus). On the other hand, SerializableSI is known to unnecessarily abort and restart certain
percentages of transactions also in serializable SI schedules (a minus).
CO and its variants are transparently interoperable for global serializability
With CO and its variants (e.g., SS2PL, SCO, OCO, ECO, and MVCO above) global serializability is achieved via
atomic commitment protocol based distributed algorithms. For CO and all its variants atomic commitment protocol is
the instrument to eliminate global cycles (cycles that span two or more databases) in the global augmented (and thus
also regular) conflict graph (implicitly; no global data structure implementation is needed). In cases of either
incompatible local commitment orders in two or more databases (when no global partial order can embed the
respective local partial orders together), or a data-access locking related voting deadlock, both implying a global
cycle in the global augmented conflict graph and missing votes, the atomic commitment protocol breaks such cycle
by aborting an undecided transaction on it (see The distributed CO algorithm above). Differences between the
various variants exist at the local level only (within the participating database systems). Each local CO instance of
any variant has the same role, to determine the position of every global transaction (a transaction that spans two or
Commitment ordering
127
more databases) within the local commitment order, i.e., to determine when it is the transaction's turn to be voted on
locally in the atomic commitment protocol. Thus, all the CO variants exhibit the same behavior in regard to atomic
commitment. This means that they are all interoperable via atomic commitment (using the same software interfaces,
typically provided as services, some already standardized for atomic commitment, primarily for the two phase
commit protocol, e.g., X/Open XA) and transparently can be utilized together in any distributed environment (while
each CO variant instance is possibly associated with any relevant local concurrency control mechanism type).
In summary, any single global transaction can participate simultaneously in databases that may employ each any,
possibly different, CO variant (while concurrently running processes in each such database, and running
concurrently with local and other global transactions in each such database). The atomic commitment protocol is
indifferent to CO, and does not distinguish between the various CO variants. Any global cycle generated in the
augmented global conflict graph may span databases of different CO variants, and generate (if not broken by any
local abort) a voting deadlock that is resolved by atomic commitment exactly the same way as in a single CO variant
environment. local cycles (now possibly with mixed materialized and non-materialized conflicts, both serializability
and data-access-locking deadlock related, e.g., SCO) are resolved locally (each by its respective variant instance's
own local mechanisms).
Vote ordering (VO or Generalized CO (GCO); Raz 2009), the union of CO and all its above variants, is a useful
concept and Global serializability technique. To comply with VO, local serializability (in it most general form,
commutativity based, and including multi-versioning) and the vote order strategy (voting by local precedence order)
are needed.
Combining results for CO and its variants, the following is concluded:
• The CO Variants Interoperability Theorem
1. In a multi-database environment, where each database system (transactional object) is compliant with some CO
variant property (VO compliant), any global transaction can participate simultaneously in databases of possibly
different CO variants, and Global serializability is guaranteed (sufficient condition for Global serializability; and
Global one-copy-serializability (1SER), for a case when a multi-version database exists).
2. If only local (to a database system) concurrency control information is utilized by every database system (each
has the generalized autonomy property, a generalization of autonomy), then compliance of each with some (any)
CO variant property (VO compliance) is a necessary condition for guaranteeing Global serializability (and Global
1SER; otherwise they may be violated).
3. Furthermore, in such environment data-access-locking related global deadlocks are resolved automatically (each
such deadlock is generated by a global cycle in the augmented conflict graph (i.e., a voting deadlock; see above),
involving at least one data-access lock (non-materialized conflict) and two database systems; thus, not a cycle in
the regular conflict graph and does not affect serializability).
References
• Yoav Raz (1992): "The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous
Environment of Multiple Autonomous Resource Managers Using Atomic Commitment."
[7]
Proceedings of the
Eighteenth International Conference on Very Large Data Bases (VLDB), pp. 292-312, Vancouver, Canada,
August 1992. (also DEC-TR 841, Digital Equipment Corporation, November 1990)
• Download/view the VLDB 1992 article (PDF)
[8]
• Yoav Raz (1994): "Serializability by Commitment Ordering."
[9]
Information Processing Letters (IPL), Volume
51, Number 5
[10]
, pp. 257-264, September 1994. (Received August 1991)
• Yoav Raz (2009): Theory of Commitment Ordering - Summary
[11]
GoogleSites - Site of Yoav Raz. Retrieved 1
Feb, 2011.
Commitment ordering
128
• Yoav Raz (1990): On the Significance of Commitment Ordering
[12]
- Call for patenting, Memorandum, Digital
Equipment Corporation, November 1990.
• Yoav Raz (1991a): US patents 5,504,899 (ECO)
[13]
5,504,900 (CO)
[14]
5,701,480 (MVCO)
[15]
• Yoav Raz (1991b): "The Commitment Order Coordinator (COCO) of a Resource Manager, or Architecture for
Distributed Commitment Ordering Based Concurrency Control", DEC-TR 843, Digital Equipment Corporation,
December 1991.
• Yoav Raz (1991c): "Locking Based Strict Commitment Ordering, or How to improve Concurrency in Locking
Based Resource Managers", DEC-TR 844, December 1991.
• Yoav Raz (1993a): "Extended Commitment Ordering or Guaranteeing Global Serializability by Applying
Commitment Order Selectivity to Global Transactions."
[16]
Proceedings of the Twelfth ACM Symposium on
Principles of Database Systems (PODS), Washington, DC, pp. 83-96, May 1993. (also DEC-TR 842, November
1991)
• Yoav Raz (1993b): "Commitment Ordering Based Distributed Concurrency Control for Bridging Single and
Multi Version Resources."
[17]
Proceedings of the Third IEEE International Workshop on Research Issues on
Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, Austria, pp. 189-198, April
1993. (also DEC-TR 853, July 1992)
[1] Alan Fekete, Nancy Lynch, Michael Merritt, William Weihl (1988): Commutativity-based locking for nested transactions (PDF) (http:/ /
www.dtic.mil/ cgi-bin/GetTRDoc?AD=ADA200980& Location=U2&doc=GetTRDoc.pdf) MIT, LCS lab, Technical report
MIT/LCS/TM-370, August 1988.
[2] Philip A. Bernstein, Eric Newcomer (2009): Principles of Transaction Processing, 2nd Edition (http:/ / www.elsevierdirect. com/ product.
jsp?isbn=9781558606234), Morgan Kaufmann (Elsevier), June 2009, ISBN 978-1-55860-623-4 (pages 145, 360)
[3] Lingli Zhang, Vinod K.Grover, Michael M. Magruder, David Detlefs, John Joseph Duffy, Goetz Graefe (2006): Software transaction commit
order and conflict management (http:// www. freepatentsonline.com/ 7711678.html) United States Patent 7711678, Granted 05/04/2010.
[4] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex Rasin, Stanley Zdonik, Evan Jones, Yang Zhang, Samuel Madden,
Michael Stonebraker, John Hugg, Daniel Abadi (2008): "H-Store: A High-Performance, Distributed Main Memory Transaction Processing
System" (http:// portal. acm. org/citation. cfm?id=1454211), Proceedings of the 2008 VLDB, pages 1496 - 1499, Auckland, New-Zealand,
August 2008.
[5] William Perrizo, Igor Tatarinov (1998): "A Semi-Optimistic Database Scheduler Based on Commit Ordering" (http:// citeseerx.ist. psu.edu/
viewdoc/summary?doi=10. 1. 1. 53. 7318) ( PDF (http:// citeseerx.ist.psu. edu/ viewdoc/ download?doi=10.1.1.53. 7318& rep=rep1&
type=pdf)), 1998 Int'l Conference on Computer Applications in Industry and Engineering, pp. 75-79, Las Vegas, November 11, 1998.
[6] Michael J. Cahill, Uwe Röhm, Alan D. Fekete (2008): "Serializable isolation for snapshot databases" (http:// portal.acm.org/ citation.
cfm?id=1376690), Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 729-738, Vancouver,
Canada, June 2008, ISBN 978-1-60558-102-6 (SIGMOD 2008 best paper award
[7] http:// www. informatik.uni-trier.de/ ~ley/ db/ conf/vldb/ Raz92. html
[8] http:/ / www. vldb. org/conf/ 1992/ P292. PDF
[9] http:/ / linkinghub.elsevier. com/ retrieve/pii/ 0020019094900051
[10] http:/ / www. informatik.uni-trier.de/ ~ley/ db/ journals/ ipl/ ipl51.html#Raz94
[11] http:/ / sites. google. com/ site/ yoavraz2/home/ theory-of-commitment-ordering
[12] http:// yoavraz.googlepages. com/ DEC-CO-MEMO-90-11-16.pdf
[13] http:// patft1.uspto. gov/ netacgi/ nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1& u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&
r=3&f=G&l=50& co1=AND&d=PTXT&s1=%22commitment+ordering%22.TI.& OS=TTL/
[14] http:// patft1.uspto. gov/ netacgi/ nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1& u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&
r=2&f=G&l=50& co1=AND&d=PTXT&s1=%22commitment+ordering%22.TI.& OS=TTL/
[15] http:// patft1.uspto. gov/ netacgi/ nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1& u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&
r=1&f=G&l=50& co1=AND&d=PTXT&s1=%22commitment+ordering%22.TI.& OS=TTL/
[16] http:// portal.acm. org/ citation. cfm?id=153858
[17] http:// ieeexplore.ieee. org/xpl/ freeabs_all.jsp?arnumber=281924
Commitment ordering
129
External links
• Yoav Raz's Commitment ordering page (http:// sites. google. com/ site/ yoavraz2/the_principle_of_co)
The History of Commitment Ordering
Comments:
1. Besides misunderstanding of the Commitment ordering (CO) solution by many researchers for long time (see
Global serializability), misconceptions have also existed regarding the history of CO. This article intends to
clarify this subject.
2. The article establishes chronological order of events related to Commitment ordering by relevant publications'
dates and contents, possibly including their references. It neither attempts to describe any connection between
events beyond publications' references, nor to hint to such connections.
Commitment ordering (or Commit ordering, CO; Raz 1990a, 1990b), publicly introduced in May 1991, has
solved the years old open fundamental Global serializability problem (e.g., Silberschatz et al. 1991, page 120; see
second quotation) through a generalization of the de-facto standard for Global serializability, SS2PL, and provided
insight into it. Though published in prestigious academic journal (Raz 1994 - IPL) and refereed conferences ((Raz
1992 - VLDB), (Raz 1993a - ACM PODS), (Raz 1993b - IEEE RIDE-IMS)), as well as in three patents (Raz 1991a),
the CO solution has been largely misunderstood and misrepresented (e.g., Weikum and Vossen 2001, pages 102,
700, and Breitbart and Silberschatz 1992), and to a great extent ignored in relevant database research texts (e.g., the
text-books Liu and Özsu 2009, and Silberschatz et al. 2010) until recently being related to as a fourth major
concurrency control method in addition to the earlier known three major methods (Bernstein and Newcomer 2009,
pages 145, 360). On the other hand, in some research articles the CO solution has been utilized without using the
name CO, without referencing the CO work, and without explaining properly how and why CO achieves Global
serializability (e.g., Haller et al. 2005, Voicu and Schuldt 2009b). However, in the last few years the number of
publication that use and reference CO has been increasing (e.g., see Google scholar).
In concurrency control of databases and transaction processing (transaction management), CO is a class of
interoperable Serializability techniques, both centralized and distributed. It is also the name of the resulting
transaction schedule property. In a CO compliant schedule the chronological order of commitment events of
transactions is compatible with the precedence order of the respective transactions. CO provides an effective, general
solution to the Global serializability problem, i.e., achieving serializability in a distributed environment of multiple
autonomous database systems and other transactional objects, that possibly utilize a different concurrency control
mechanism each (e.g., as in Grid computing and Cloud computing). CO also provides an effective solution for
Distributed serializability in general. The concept of CO has evolved in three threads of development, seemingly
initially independent:
1. Dynamic atomicity (DA) at the MIT (Weihl 1984),
2. Analogous execution and serialization order (AESO) at the University of Houston (Georgakopoulus and
Rusinkiewics 1989),
3. Commitment ordering (CO) at Digital Equipment Corporation (Raz 1990a, 1990b).
Similarity between the initial concepts above and their final merging in equivalent or identical definitions caused
researchers in the past to refer to them as "the same" (e.g., Weikum and Vossen 2001, page 720). However essential
differences exist between their respective final research results and time-lines:
1. The originally defined DA is close to CO, but is strictly contained in CO (which makes an essential difference in
implementability; unlike CO, no general algorithm for DA is known). Only in (Fekete et al. 1988; precedes CO)
DA is defined to be equivalent to CO (in a model that supports sub-transactions), but without a full-fledged
generic distributed algorithm.
The History of Commitment Ordering
130
2. The definition of the AESO schedule property evolved to the definition of Strong recoverability (STRC) in
(Breitbart et al. 1991). The definition of STRC is identical to the CO definition. However, this should not be
confused with the CO algorithmic techniques and theoretical results: No such techniques and results have been
proposed for STRC, in particular the key result that local CO implies Global serializability with no concurrency
control information distribution (preserving local autonomy: no need in external information locally). (Breitbart et
al. 1991) appeared in September 1991. An already accepted for publication version of the article, circulated
before its publication, does not have a word about Strong recoverability, and does not include the Abstract part
which describes it and uses the term "commitment order".
3. Atomic commitment protocol (ACP) and voting (or an equivalent mechanism), which are crucial for the CO
solution and CO's efficient distribution, are utilized neither for Dynamic atomicity (DA, which has a partial
replacement though) nor for Strong recoverability (STRC).
Three patents filed for the CO techniques in 1991-3 were granted in 1996-7 (Raz 1991a). They have been
extensively referenced in other patents since.
All three development threads have converged at definitions of schedule properties identical or equivalent to CO,
and noticed that Strong strict two-phase locking (SS2PL or Rigorousness) possesses their respective properties. The
DA work has provided additional examples of algorithms that generate DA compliant schedules, as well as implying
that local DA (the original DA) implies global serializability while using only local concurrency control information.
STRC is shown to imply Serializability but no proof is given that local STRC implies Global serializability with only
local concurrency control information. General algorithms are given neither for DA nor for STRC. Only the CO
articles have provided (patented) general algorithms for CO and methods to turn any concurrency control mechanism
into a CO compliant one, for achieving global serializability across autonomous transactional objects (i.e., using only
local concurrency control information; e.g., autonomous databases) with possibly different concurrency controls.
The CO articles have also provided generalizations of CO that guarantee Global serializability with more
concurrency and better performance by using additional local information (ECO in Raz 1993a, and MVCO in Raz
1993b).
A unique and novel element in the CO techniques and patents, besides ordering commit events, is the utilization of
the atomic commitment protocol voting mechanism to break global cycles (also referred to as distributed cycles;
span two or more transactional objects) in the conflict graph, for guaranteeing global serializability. It is achieved by
applying a specific voting strategy: voting in local precedence order. Also locking based global deadlocks are
resolved automatically by the same mechanism as a side benefit. This allows effective implementations of
distributed CO (and resulting distributed serializability), while allowing any, uninterrupted transaction operations
scheduling without any conflict information distribution (e.g., local precedence relations, locks, timestamps, tickets).
Furthermore, CO does not use any additional, artificial transaction access operations (e.g., "take timestamp" or "take
ticket"), which typically result in additional, artificial conflicts that reduce concurrency.
The CO solution has been utilized extensively since 1997 in numerous works within the subject of Transactional
processes (e.g., Schuldt et al. 2002, where CO is referenced). Some of them include descriptions of CO utilization in
commercial distributed software products. Recently CO has been utilized as the basis for the concurrency control of
Re:GRIDiT (e.g., Voicu et al. 2009a, Voicu and Schuldt 2009b), a proposed approach to transaction management in
Grid computing and Cloud computing. The latter two articles and all other related articles by the same authors
neither mention the name CO nor reference CO's articles.
CO has been utilized in a work on fault tolerance of transactional systems with replication (Vandiver et al. 2007).
CO is also increasingly utilized in Concurrent programming (Parallel programming) and Transactional memory (and
especially in Software transactional memory, STM) for achieving serializability optimistically by "commit order"
(e.g., Ramadan et al. 2009, Zhang et al. 2006, von Parun et al. 2007).
The History of Commitment Ordering
131
Background
Serializability has been identified as the major criterion for the correctness of transactions executing concurrently.
Consequently a general, practical variant of it, Conflict serializability, is supported in all general-purpose database
systems. However, if several database systems (or any other transactional objects) inter-operate, Global
serializability is not maintained in general, even if each database system (transactional object) provides conflict
serializability, and overall correctness is not guaranteed. The problem of guaranteeing Global serializability in a
heterogeneous distributed system effectively, with reasonable performance, has been researched for many years and
characterized as open (Silberschatz et al. 1991, page 120). Commitment ordering (CO; Raz 1990b, 1992, 1994) has
provided a general solution to the problem. CO was disclosed publicly by Yoav Raz in May 1991, immediately after
its patent filing (see CO patents below). It was disclosed by lectures and publication submission and distribution to
tens of database researchers. More details about the CO solution can be found in Global serializability.
A description of CO and some bibliographic notes are given in (Weikum and Vossen 2001). This is the first known
text-book on concurrency control that deals with CO. The description of CO in two separate sections titled
"Commitment ordering" (Ibid, pages 102, 700) lacks several fundamental aspects of the CO technique (e.g., using
voting in atomic commitment protocol and voting deadlocks) and thus is an unsatisfactory and misleading
description of the CO technique. Also the bibliographic notes there are inaccurate, since the original DA (referred to
in the quotation below in Weihl 1989) is different from CO:
"Commitment order-preserving conflict serializability, as mentioned in Chapter 3, was proposed, using
different terminology, by Weihl 1989, Breitbart et al. 1991, and Breibart and Silberschats 1992, as well as Raz
1992, 1993, 1994."
(Ibid, page 720)
The bibliographic notes, as well as other CO related text in the book, ignore the different ways the respective
properties' definitions are utilized by the three evolvement threads (works), and the different results of each work
(see below summaries of main results of each). Also, some theorems about CO given in the book are significantly
weaker than what implied by the CO work, miss its essence, and again misleading.
Such misleading inaccuracies, omissions of fundamentals, and misrepresentation appear also in several other
research articles that have mentioned and referenced the CO publications, and it is evident that the CO solution for
global serializability has been misunderstood by many database researchers even years after its public introduction in
1991 (see Quotations in Global serializability). Many articles and books published after 1991, that discuss distributed
and global concurrency control, have not mentioned CO at all. CO is neither referenced nor mentioned even in the
new 6th edition of a database textbooks in 2010, Silberschatz et al. 2010, which deals with distributed concurrency
control and global serializability, and resorts to much less effective than CO serializability methods (see benefits of
CO in Global serializability), or methods that violate Global serializability (e.g., Two-level serializability, which is
argued there to be correct under certain narrow conditions).
On the other hand, CO is referenced in (Bernstein and Newcomer 2009, page 145) as follows:
"Not all concurrency control algorithms use locks... Three other techniques are timestamp ordering,
serialization graph testing, and commit ordering. Timestamp ordering assigns each transaction a timestamp
and ensures that conflicting operations execute in timestamp order. Serialization graph testing tracks
conflicts and ensures that the serialization graph is acyclic. Commit ordering ensures that conflicting
operations are consistent with the relative order in which their transactions commit, which can enable
interoperability of systems using different concurrency control mechanisms."
"Commit ordering is presented in Raz (1992)." (Ibid, page 360)
Comments:
1. Beyond the common locking based algorithm SS2PL, which is a CO variant itself, also additional variants of CO
that use locks exist, (see below). However, generic, or "pure" CO indeed does not use locks (but can be
The History of Commitment Ordering
132
implemented with similar, non-blocking mechanisms to keep conflict information).
2. Since CO mechanisms order the commit events according to conflicts that already have occurred, it is better to
describe CO as "Commit ordering ensures that the relative order in which transactions commit is consistent with
the order of their respective conflicting operations."
Also on the other hand, it seems that for some researchers CO has become an obvious method and in many cases it
has been utilized by them without reference to the CO work. CO has been utilized extensively in research, and most
of the times unreferenced (and under different names, e.g., Re:GRIDiT in [[#Voicu2009b|Voicu and Schuldt
2009b]).
In what follows the evolvement of CO is described. Additional section below briefly describes later utilization of
CO.
Three threads of development
Commitment ordering (CO) has evolved in three, seemingly initially independent, threads of development:
1. Dynamic atomicity (DA) at the MIT (Weihl 1984),
2. Analogous execution and serialization order (AESO) at the University of Houston (Georgakopoulus and
Rusinkiewics 1989),
3. Commitment ordering (CO) at Digital Equipment Corporation (Raz 1990a, 1990b).
Similarity between the initial concepts above and their final merge in equivalent or identical definitions caused in the
past researchers to refer to them as "the same." However essential differences exist between their respective final
research results and time-lines:
1. The originally defined DA is close to CO, but is strictly contained in CO (which makes an essential difference in
implementability; unlike CO, no general algorithm for DA is known). Only in (Fekete et al. 1988; precedes CO)
DA is defined to be equivalent to CO (in a model that supports sub-transactions), but without a full-fledged
generic distributed algorithm.
2. The definition of the AESO schedule property evolved to the definition of Strong recoverability (STRC) in
(Breitbart et al. 1991). The definition of STRC is identical to the CO definition. However, this should not be
confused with the CO algorithmic techniques and theoretical results: No such techniques and results have been
proposed for STRC, in particular the key result that local CO implies Global serializability with no concurrency
control information distribution (preserving local autonomy: no need in external information locally). (Breitbart et
al. 1991) appeared in September 1991. An already accepted for publication version of the article, circulated
before its publication, does not have a word about Strong recoverability, and does not include the Abstract part
which describes it and uses the term "commitment order".
3. Atomic commitment protocol (ACP) and voting, which are crucial for the CO solution and CO's efficient
distribution, are utilized neither for Dynamic atomicity (DA, which has a partial replacement though) nor for
Strong recoverability (STRC).
These evolvement threads are described in what follows with respective summaries of main results relevant to CO.
Dynamic atomicity
Dynamic atomicity (DA) appears in the Ph.D dissertation (Weihl 1984) and possibly in earlier publications by the
same author. It uses a variant of input-output automata, a formalism developed at the MIT to deal with systems in
the context of abstract data types. DA has been described and utilized in numerous articles later, in its original
version which is different from CO, e.g., (Weihl 1988, Weihl 1989), and in its enhanced version, equivalent to CO,
e,g., in (Fekete et al. 1988, Fekete et al. 1990), and in a book (Lynch et al. 1993). DA is defined as schedule property
(to be checked for existence) without a full-fledged generic algorithm and distributed algorithm.
The History of Commitment Ordering
133
DA is strictly contained in CO
While DA has not been originally defined as CO, under certain transaction model translation from the input-output
automata model to the model common for dealing with databases concurrency control it appears very close to CO,
but strictly contained in CO: With CO, order of commit events of every two transactions with conflicts needs to be
compatible with their precedence order. With DA the first commit needs to precede at least one (non-commit, as
demonstrated) operation of the second transaction (Weihl 1989, page 264):
"If the sequence of operations executed by one committed transaction conflicts with the operations executed
by another committed transaction, then some operations executed by one of the transactions [explicitly not
commit; by following examples there commit does not count as operation] must occur after the other
transaction has committed."
I.e., DA has an additional restriction over CO (see also footnote about this in the linked last version of (Raz 1990b,
page 5) ), and thus it is a proper subset of CO.
The additional operation needed for DA makes a difference in implementability, and no effective general DA
algorithm is known (i.e., covering the entire DA set), versus an effective general CO algorithm that exists.
Local DA implies global serializability
With such definition of DA the proof of achieving global serializability by local DA can be done without involving
atomic commitment protocol and voting. In (Weihl 1989) DA is shown to provide global serializability when applied
locally in transactional objects. Some protocols are shown to have the DA property but no general mechanisms to
enforce DA and thus global serializability are shown. Atomic commitment protocol and related voting are not part of
the formal model used. Global (voting) deadlock resolution by an atomic commitment protocol, which is the
foundation of the distributed CO algorithm, is not mentioned.
Comment: The commitment event is a synchronization point across all the local sub-transactions of a distributed
transaction. Thus, with the DA definition, when two transaction are in conflict, the extra operation in the second
transaction (in any of its local sub-transactions), after the commitment of the first, guarantees proper commitment
order of the two transactions, which implies global serializability. Thus local DA guarantees global serializability.
The same arguments also imply that local SS2PL guarantees global serializability. However, with CO no extra
operation is available to enforce proper commitment order, and hence an atomic commitment protocol mechanism
and a voting strategy , referred to as in (Raz 1990b, 1992), are needed to enforce the correct commitment
order for distributed transactions (i.e., globally; assuming autonomy, i.e., no entity exists that "knows" the global
precedence order to determine the correct global commit order. Maintaining such an entity, either centralized or
distribute, is typically expensive).
DA is changed to be equivalent to CO
Only a later definition of DA in (Fekete et al. 1988, Fekete et al. 1990 (prior to CO), Lynch et al. 1993 Page 201) is
equivalent to the definition of CO, using only commit events order. A mechanism that enforces global DA and
serializability when DA exists locally is given: a Generic scheduler (e.g., Lynch et al. 1993, Page 207). However ,
no full-fledged generic algorithm and distributed algorithm are given (as with CO and its patents).
Comments (which are secondary and do not criticize the solution):
1. In this solution, as with the CO solution, Global DA (and resulting Global serializability) is guaranteed only with
the possibility of global deadlocks (a deadlock that spans two objects or more; in this case these are global
deadlocks that do not necessarily result from data locking but rather from commit blocking, and thus are unique to
DA (CO) and relevant to serializability). Such global deadlocks and their resolution, which are the key to the CO
algorithmic solution, are not mentioned in conjunction with DA (but rather probably implicitly assumed to be
resolved when happen).
The History of Commitment Ordering
134
2. The distribution of the Generic scheduler component is not shown (but can be done similarly to the distribution of
an Atomic commitment protocol (ACP)).
In addition, as is explicitly stated in (Lynch et al. 1993, page 254), no optimality result (see below) for the new DA is
given. In comparison, the CO result of the necessity in CO for guaranteeing Global serializability over autonomous
resource managers is such an optimality result.
Both the original DA and CO are optimal
(Weihl 1989) also shows DA to be optimal in the sense that no broader property (super-class) provides global
serializability when applied locally under the assumption of dynamic transaction operations' scheduling (which is the
common way in transactional systems). This is similar to a result in (Raz 1990b) that CO is a necessary condition for
guaranteeing global serializability across autonomous resource managers (resource managers that do not share any
concurrency control information beyond (unmodified) atomic commitment protocol messages). However since CO
strictly contains the original DA there, CO is the optimal property in this sense, and not DA. The apparent
contradiction in the optimality result stems from the fact that DA optimality is proven in a formal model without
voting in atomic commitment protocol. Without a voting mechanism and a voting strategy, referred to as in
(Raz 1990b, 1992), commit order can be enforced only with an operation in a second transaction after the commit of
a first preceding transaction, as in the original definition of DA. This makes the optimum without voting, DA, a
smaller class (since an additional constraint, the extra operation, exists) than the optimum in a model where voting
exists and commit order can be enforced by a voting strategy (without the additional constraint), which is CO.
Main DA results
1. DA implies Serializability
2. Local DA implies Global serializability, even when utilizing local concurrency control information only
3. Certain concurrency control mechanisms (algorithms) provide the DA property
4. The original DA is an optimal property (in a formal model without voting in atomic commitment protocol; CO,
which strictly contains the original DA, is the optimal property under the condition of local autonomy in a model
where voting exists; no optimality result is given for the new DA - see above)
Analogous execution and serialization order
and Strong recoverability
AESO requires serializability as a prerequisite
Analogous (or Similar) execution and serialization order (AESO) is defined in a technical report (Georgakopoulus
and Rusinkiewics 1989), and possibly in more technical reports. The AESO property is equivalent to CO, but in its
definition serializability is (unnecessarily) required as a prerequisite. Thus It is clear from the definition, that the fact
that AESO without the prerequisite implies serializability, has been overlooked. For this reason the ordering of
commit events could not have been thought as a serializability implying mechanism in the context of AESO.
AESO is modified to Strong recoverability (CO)
In (Breitbart et al. 1991) a new term appears, Strong recoverability (STRC), which is identical to AESO but drops
the (unnecessary) serializability prerequisite. STRC is identical to CO. It is brought there together with the (now
redundant) AESO concept. A draft version of this article, circulated in 1991 and already accepted for publication in
the September 1991 issue of TSE ("to appear in IEEE Transactions on Software Engineering, September, 1991" is
printed at the top), includes AESO but not STRC. It neither includes the Abstract part which describes STRC and
uses the term "commitment order". Thus STRC has been added for the TSE published version (the STRC text has
been mainly added on top of the original text with the now redundant AESO). It is shown there that STRC implies
serializability, and STRC is utilized there to assist proving that a federated database with local Rigorousness
The History of Commitment Ordering
135
(SS2PL) ensures global serializability. It is not shown there how local STRC in general, other than Rigorousness,
implies global serializability. With local Rigorousness (SS2PL) global serializability is achieved automatically (see
SS2PL in Commitment ordering), while with other STRC (CO) types a certain condition on voting in atomic
commitment protocol should be met (a voting strategy). Neither atomic commitment protocol nor voting strategy are
mentioned or dealt with in the STRC articles.
No algorithm for STRC beyond SS2PL is given. Atomic commitment protocol, voting, and global deadlocks, which
are fundamental elements in the CO solution, are not mentioned there. It is (mistakenly) claimed there that Strong
recoverability (STRC) implies Recoverability, and hence the (misleading) property's name. STRC is the focus of
(Breitbart and Silberschats 1992), and also there no proper algorithm and method beyond Rigorousness (SS2PL) is
given. In the abstracts of both these STRC papers the following sentence appears:
"The class of transaction scheduling mechanisms in which the transaction serialization order can be
determined by controlling their commitment order, is defined."
The idea here is opposite to the CO solution: In all the proposed CO mechanisms the serialization order, which is
determined by data-access scheduling (the common way), determines the commit order. (Breitbart et al. 1991) does
not reference (Raz 1990b), but (Breitbart and Silberschats 1992) does.
Main STRC results
1. Local Rigorousness (SS2PL) implies Global serializability (long known result when published in (Breitbart et al.
1991), but not widely published earlier, e.g., see explicit description in (Raz 1990a), and references in (Raz
1990b, 1992))
2. Several concurency control mechanisms provide Rigorousness
3. Rigorousness implies STRC (CO)
4. STRC implies Serializability (but neither proof, nor proof outline, nor proof idea, about local STRC (CO)
implying Global serializability when utilizing local concurrency control information only, have been introduced)
Commitment ordering
Early versions of (Raz 1990b) reference neither DA, nor STRC, but later versions and other CO articles reference
both.
The discovery of CO
"In the beginning of 1990 S-S2PL was believed to be the most general history property that guarantees global
serializability over autonomous RMs. The discovery of Commitment Ordering (CO) resulted from an attempt
to generalize S-S2PL, i.e., to find a super-set of S-S2PL that guarantees global serializability while allowing to
maintain RM autonomy. An intermediate step was discovering a property that I named Separation (SEP;
"separating" conflicting operations by a commit event), which is a special case of CO...
...SEP is less restrictive than S-S2PL, and allows optimistic implementations. However, it is far more
restrictive than CO. It was noticed later that the Optimistic 2PL scheduler described in [Bern 87] spans exactly
the SEP class. A paper on separation, similar to this one, was written by me in July–August 1990, and
included results for SEP, parallel to most main results for CO, described here. The first version of the current
CO paper was a rewrite of the separation paper. The separation paper included an erronous theorem, claiming
that SEP was the most general property (a necessary condition for) guaranteeing global serializability over
autonomous RMs. The proof was almost identical to the proof of theorem 6.2 here. CO was formulated two
days after I noticed a mistake in the proof of that theorem: SEP requires aborting more transactions than the
minimum necessary to guarantee global serializability. This minimum is exactly defined by
, when T is committed. Extended CO (ECO; [Raz 91b] was formulated a few days later. Y.
R."
The History of Commitment Ordering
136
From the Preface in (Raz 1990b)
Comment: The DA work in the framework of abstract data types was unknown at that time to most database people.
The new DA article (Fekete et al. 1988) appeared (possibly with some modifications) also in (Fekete et al. 1990).
DA has no known published explicit general algorithm, and no mention about integration with other concurrency
control mechanisms.
The CO algorithms
A general effective technique is provided for generating CO compliant schedules and guaranteeing both Global
serializability (for environments with multiple transactional objects) and Distributed serializability (for distributed
transactional systems in general). A fundamental element in the technique is an Atomic commitment protocol (ACP;
any such protocol). With a certain voting strategy utilized with the APC, a voting-deadlock occurs (i.e., voting of
transactions for the ACP is blocked) whenever a global cycles in a system's augmented conflict graph (the union of
the (regular) conflict graph and the (reversed edge regular) wait-for graph) is generated (see more below). The ACP
resolves such deadlock by aborting a deadlocked transaction, with a missing vote. This abort breaks the global cycle.
Breaking all such cycles guarantees both global serializability and automatic locking-based global deadlock
resolution. No local conflict information needs to be distributed.
A generic local CO algorithm orders both local commit events for local transactions (transactions confined to a
single transactional object) and voting events for global transactions (transactions that span two or more objects) in
order to implement the voting strategy above for guaranteeing both local and global CO, as well as global
serializability.
CO automatically resolves global deadlocks
(Raz 1990b) and other CO publications show that in a CO compliant environment a global deadlock is generated
during the atomic commitment protocol's (ACP) execution if local precedence orders in two or more objects are not
compatible (i.e., no global precedence order can embed together the local orders). This generates a cycle in the
augmented conflict graph (the union of the (regular) conflict graph and the (reversed edge regular) wait-for graph)
of the multi-object system. That global deadlock is a voting-deadlock, which means that voting of distributed
transactions on the cycle is blocked. The ACP resolves such voting-deadlock by aborting some transaction with a
missing vote and breaking the cycle. If all the cycle's edges represent materialized conflicts than this cycle is also a
cycle in the (regular) conflict graph, and breaking it maintains Serializability. If at least one edge represents a
non-materialized conflict, then this is not a cycle in the (regular) conflict graph, which reflects a locking based global
deadlock. Also such cycle is broken automatically by the same mechanism, and the deadlock is resolved.
The same result applies also to an entirely SS2PL based distributed environment, where all conflicts are
non-materialized (locking-based), since SS2PL is a special case of CO. Many research articles about global
deadlocks in SS2PL and resolving them have been published since the 1970s. However, no reference except the CO
papers is known (as of today, 2009) to notice such automatic global deadlock resolution by ACP (which is always
utilized for distributed transactions).
CO is a necessary condition for global serializability across autonomous databases
(Raz 1990b, 1992) show that enforcing CO locally is a necessary condition for guaranteeing serializability globally
across autonomous objects. This means that if any autonomous object in a multi-object environment does not comply
with CO, than global serializability can be violated. It also means that CO is optimal in the sense of (Weihl 1989):
No schedule property exists that both contains CO (i.e., defining a super-class of the class of CO compliant
schedules) and guarantees global serializability across autonomous objects.
The History of Commitment Ordering
137
CO variants
CO variants are special cases and generalizations of CO. Several interesting variants have been developed and
investigated in the years 1990-1992. All CO variant can transparently inter-operate in a mixed distributed
environment with different variants, guaranteeing Global serializability and automatic locking-based global deadlock
resolution. The following are interesting CO variants:
• SS2PL
Strong strict two-phase locking (SS2PL, a name coined in (Raz 1990b); also the name of the resulting
schedule property, which is also called Rigorousness or Rigorous scheduling in (Breibart et al. 1991)) is a
common serializability technique since the 1970s. It has been known for many years before 1990 that a
multi-database environment, where all databases are SS2PL compliant, guarantees Global serializability. As a
matter of fact it has been the only known practical method for global serializability, and has become a de facto
standard, well known among practitioners in the 1980s (Raz 1990a, 1990b, 1992). SS2PL has many variants
and implementations. It provides both Strictness and CO.
• SCO
Strict commitment ordering (SCO; (Raz 1991b)) is the intersection of Strictness and CO. SCO has a sorter
average transaction duration than that of SS2PL, and thus better concurrency. Both have similar locking
overhead. An SS2PL compliant database system can be converted to an SCO compliant one in a
straightforward way without changing strictness based recovery mechanisms.
• OCO
Optimistic commitment ordering (OCO) spans the entire CO class. The name OCO is used to emphasize the
fact that this is the optimistic version of CO, versus other, non-optimistic variants of CO.
• ECO
Extended commitment ordering (ECO; (Raz 1993a)) is a generalization of CO that guarantees global
serializability as well. It utilizes information about transaction being local (i.e., confined to a single
transactional object) to provide more concurrency. With ECO, local transactions do not need to obey the CO
rule; global transaction (i.e., not local) obey a similar, generalized rule. As CO it is not blocking, and can be
integrated seamlessly with any relevant concurrency control mechanism.
• MVCO
Multi-version commitment ordering (MVCO; (Raz 1993b)) is a generalization of CO that guarantees global
serializability as well. It utilizes multiple versions of data to provide more concurrency. It allows
implementations where read-only transactions do not block, or being blocked by updaters (read-write
transactions). As CO it is not blocking, and can be integrated seamlessly with any relevant concurrency
control mechanism.
A new theory of conflicts in multiversion concurrency control is introduced to define MVCO properly. An
identical theory has been later utilized by others to analyze Snapshot isolation (SI) in order to make it
serializable. Also MVCO can make SI serializable (CO based snapshot isolation (COSI)) with a relatively low
overhead, but this has not yet been tested and compared with the successfully tested SerializableSI (which is
not MVCO compliant, and cannot participate in a CO distributed solution for global serializability).
• VO
Vote ordering (VO or Generalized CO (GCO); Raz 2009) is the property that contains CO and all its
mentioned variants. It is the most general property that does not need concurrency control information
distribution. It consists of local conflict serializability (by any mechanism, in its most general form, i.e.,
commutativity-based, including multi-versioning) and voting by precedence order (a voting strategy).
The History of Commitment Ordering
138
CO distributed architecture
A distributed architecture for CO has been designed (Raz 1991c), which is an extension of the common architecture
for the Two-phase commit protocol (2PC; and for an Atomic commitment protocol in general). The additional
component in the architecture is the Commitment Order Coordinator (COCO), which is typically an integral part of a
resource manager (e.g., Database system). The COCO orders the resource manager's local commit and voting events
to enforce CO.
CO patents
Commitment ordering has been quite widely known inside the transaction processing and databases communities at
Digital Equipment Corporation (DEC) since 1990. It has been under company confidentiality due to patenting
processes of CO methods for guaranteeing both (local) Serializability and Global serializability which started in
1990 (Raz 1990a). Patents (Raz 1991a) for methods using CO and ECO were filed in 1991 and granted in 1996, and
using MVCO filed in 1993 and granted in 1997. CO was disclosed outside of DEC by lectures and technical reports'
distribution to tens of database researches in May 1991, immediately after its first patent filing.
A unique and novel element in the CO technique and its patents, besides ordering commit events, is the utilization of
the atomic commitment protocol (ACP) voting mechanism to break global cycles (span two or more transactional
objects) in the conflict graph, for guaranteeing global serializability. Also locking based global deadlocks are
resolved automatically by the same mechanism. This allows effective implementations of distributed CO, while
allowing any, uninterrupted transaction operations scheduling, without any conflict information distribution (e.g., by
locks, timestamps, tickets), and without any additional, artificial transaction access operations (e.g., "take timestamp"
or "take ticket"), which typically result in additional, artificial conflicts that reduce concurrency.
Enhanced theory of CO
An enhanced theory of CO, briefly summarized in (Raz 2009), has been developed by Yoav Raz later, after the CO
publication in the early 1990s. The enhanced theory does not provide new results about CO but rather allows a
clearer formal presentation of the CO subject and techniques. This theory is utilized in the Wikipedia articles
Commitment ordering and Serializability, as well as in this article above. Several new terms are introduced by the
enhanced theory:
• A unified theory of conflicts: The terms Materialized conflict and Non-materialized conflict are used for conflicts
with other transactions. A conflict may occur when a transaction requests (invokes) access to a database object
already accessed by another transaction (when access operations are not commutative; the common theory uses
the term "conflict" only for a materialized one).
• Augmented conflict graph, which is the graph of all conflicts, the union of the (regular) conflict graph and the
reversed-edge (regular) wait-for graph.
• Voting strategy for the term in (Raz 1992), meaning voting on (global) transactions in a chronological
order compatible with the local precedence order of respective transactions.
• Voting-deadlock for a deadlock in the atomic commitment protocol (ACP) that blocks and prevents voting (a
deadlock situation described in (Raz 1992) and other CO articles without using this name).
• Vote ordering (VO or Generalized CO (GCO)) is the property that contains CO and all its mentioned variants. It
is the most general property that does not need concurrency control information distribution. It consists of local
conflict serializability (by any mechanism, in its most general form, i.e., commutativity-based, including
multi-versioning) and voting by precedence order (a voting strategy).
The History of Commitment Ordering
139
Main CO results
1. SS2PL implies CO
2. CO implies Serializability
3. Local CO implies Global serializability, even when utilizing local concurrency information only (by using voting
deadlocks)
4. Locking based global deadlocks are resolved automatically (using voting deadlocks)
5. General, generic algorithms for CO, both local and global, for guaranteeing Global serializability while
maintaining object autonomy (i.e., without local concurrency control information distribution; Atomic
commitment protocol (ACP) is utilized instead; patented)
6. Distributed atomic commitment protocol based architecture for CO
7. A method for transforming any local concurrency control mechanism to be CO compliant, without interfering
with transaction operations scheduling, for participating in a distributed algorithm for Global serializability
(patented)
8. Local CO is a necessary condition for guaranteeing Global serializability when using autonomous resource
managers (CO is optimal)
9. SCO locking mechanism provides shorter average transaction duration than that of SS2PL, and thus better
concurrency and throughput; Usually SS2PL can be converted to SCO in a straightforward way.
10. Generalizations of CO for better concurrency by using additional information: ECO and MVCO, with main
results very similar to CO (e.g., also ECO and MVCO are necessary conditions for global serializability (optimal)
across autonomous resource managers within their respective defined levels of information, "knowledge"; both
are not blocking, and can be integrated seamlessly with any relevant concurrency control mechanisms; related
methods patented)
11. New theory of conflicts in multiversion concurrency control (enhancement of previous multi-version theories) in
order to define MVCO
12. CO variants inter-operate transparently, guaranteeing Global serializability and automatic global deadlock
resolution in a mixed distributed environment with different variants
13. Enhanced newer theory of CO (Augmented conflict graph based)
Later utilization of CO
CO has been referenced in multiple research articles, patents and patents pending. The following are examples of CO
utilization.
An object-oriented extensible transaction management system
(Xiao 1995), a Ph.D. Thesis at the University of Illinois at Urbana-Champaign, is the first known explicit description
of CO utilization (based on Raz 1992) in a (research) system to achieve Global serializability across transactional
objects.
Semi-optimistic database scheduler
(Perrizo and Tatarinov 1998) presents a database scheduler, described as "semi-optimistic," which implements Strict
CO (SCO; Raz 1991b). (Raz 1992) is quoted there multiple times (however (Raz 1991b), which has introduced SCO,
is not; this publication appeared only as a DEC technical report). Both SCO and SS2PL provide both CO and
Strictnes (which is utilized for effective database recovery from failure). SCO does not block on read-write conflicts
(unlike SS2PL; possibly blocking on commit instead), while blocking on the other conflict types (exactly as SS2PL),
and hence the term "semi-optimistic." As a result SCO provides better concurrency than SS2PL which blocks on all
conflict types (see Strict CO).
The History of Commitment Ordering
140
Transactional processes
Transactional processes are processes that cooperate within atomic transactions. The solutions given in articles for
imposing Global serializability across them are completely based on CO. (Schuldt et al. 1999) also demonstrates the
utilization of CO in the product SAP R/3, the former name of the main Enterprise Resource Planning (ERP) software
product of SAP AG.
Only (Schuldt et al. 2002) references (Raz 1992), but all the other articles, even later ones, do not reference the CO
work, e.g., (Haller et al. 2005). Early articles use the name "Commit-order-serializability" for CO, e.g., (Schuldt et
al. 1999). Many articles provide only a description of a CO algorithm without using the CO name, or using another
name. E.g., (Haller et al. 2005) uses the name "decentralized serialization graph protocol" (DSGT protocol) for
an implementation of CO. The protocol is described there (Ibid, page 29) as follows:
"In contrast, each transaction owns a local serialization graph which comprises the conflicts in which the
transaction is involved. Essentially, the graph contains at least all conflicts that cause the transaction to be
dependent on other transactions. This partial knowledge is sufficient for transactions to decide whether they
are allowed to commit. Note that a transaction can only commit after all transactions on which it depends have
committed. ...It is important to note that our system model does not require a component that maintains a
global serialization graph."
It is obvious that CO is enforced, by its definition. The quote above fits exactly the description of the CO algorithm
in (Raz 1992).
In a distributed environment a voting mechanism is a must in order to reach consensus on whether to commit or
abort a distributed transaction (i.e., an atomic commitment protocol mechanism). No such mechanism is explicitly
mentioned. With such mechanism voting deadlocks occur and typically resolved by the same mechanism. Such
automatic Global deadlock resolution is not noticed, and the utilization of known dedicated methods for deadlock
resolution is described there. The related articles on transactional processes which use CO are unaware of the
possibility of a voting deadlock in case of transactions' local precedence orders incompatibility, a fundamental
misunderstanding of the CO mechanism, and thus their arguments for correctness are incorrect (and they do not rely
on already proven CO results).
Tolerating byzantine faults in transaction processing systems using commit barrier scheduling
CO has been utilized in a work on fault tolerance in transactional systems with replication (Vandiver et al. 2007,
page 14):
"The first two rules are needed for correctness. The query ordering rule ensures that each individual
transaction is executed properly. The commit-ordering rule ensures that secondaries serialize transactions in
the same order as the primary."
The CO work is not referenced there.
The History of Commitment Ordering
141
Middleware Architecture with Patterns and Frameworks
CO is described in (Krakowiak 2007, page 9-15, Distributed transactions) as follows:
"...In addition to commitment proper, an atomic commitment protocol can be used for concurrency control,
through a method called commitment ordering [Raz 1992]. Atomic commitment is therefore a central issue for
distributed transactions..."
Grid computing, Cloud computing, and Re:GRIDiT
Re:GRIDiT (e.g., (Voicu et al. 2009a), (Voicu and Schuldt 2009b)) is an approach to support transaction
management with data replication in the Grid and the Cloud. This approach extends the DSGT protocol approach of
(Haller et al. 2005) mentioned above, which utilizes Commitment ordering (CO). The following are quotes from
(Voicu and Schuldt 2008) which provide details on Re:GRIDiT:
"Our approach extends and generalizes the approaches presented in [ 5 ]... In this paper we define Re:GRIDiT,
a transaction protocol for replicated Grid data, that generalizes the approach proposed in [ 5 ] by extending it
to support replication."
(page 4; [ 5 ] is (Haller et al. 2005) which implements the DSGT protocol mentioned above)
"3. The commit phase: A transaction t in the commit phase has sufficient knowledge to deduce from its own
local serialization graph that it is safe to commit. This is the case when it does not depend on any active
transaction, i.e., when there is no incoming edge to t in the serialization graph of t..."
(page 9)
The second quote describes the CO algorithm in (Raz 1992), and it is obvious that Re:GRIDiT is based on CO.
An explicit characterization of CO appears in (Voicu and Schuldt 2009c), another article on Re:GRIDiT:
"Through our protocol a globally serializable schedule is produced (although no central scheduler exists and
thus no schedule is materialized in the system) [ 24 ]. Moreover, the update transactions' serialization order is
their commit order."
(page 207; [ 24 ] is (Voicu and Schuldt 2008))
Re:GRIDiT utilizes an optimistic version of CO. It uses internal system local sub-transactions for replication, which
makes replication for high availability transparent to a user. Replication is done within the boundaries of each write
transaction. Such write transaction turns into a "super-transaction" with replicating local sub-transactions. The
approach does not suggest to use an external atomic commitment protocol, but rather uses an integrated solution,
which must include some form of atomic commitment protocol to achieve atomicity of distributed transactions
(however no such protocol is explicitly mentioned, and neither voting and voting deadlocks which are crucial for the
CO solution). No benefit in an integrated atomic commitment seems to exist. Also no concurrency control
alternatives for different transactional objects in the Grid/Cloud are suggested, contrary to a general CO solution,
which allows any CO compliant transactional object (i.e., using any CO variant optimal for the object) to participate
in the Grid/Cloud environment. Automatic Global deadlock resolution, which results from the utilization of CO with
any atomic commitment protocol over data partitioned in the Grid/Cloud, is not noticed, and the utilization of known
dedicated methods for deadlock resolution is described there. The related articles on Re:GRIDiT which use CO are
unaware of the possibility of a voting deadlock in case of transactions' local precedence orders incompatibility, a
fundamental misunderstanding of the CO mechanism, and thus their arguments for correctness are incorrect (and
they do not rely on already proven CO results).
Performance comparison between Re:GRIDiT and SS2PL/2PC (the de facto standard for global serializability; they
use the name Strict 2PL for SS2PL) is done there with resulting advantage of Re:GRIDiT (while running the same
transaction loads, i.e., the same transaction mixes of the same respective transactions). This comparison is not quite
meaningful since Re:GRIDiT comprises an optimistic version of CO, while SS2PL is the most constraining,
The History of Commitment Ordering
142
blocking variant of CO. It is well known that for some transaction loads optimistic is better, while for other loads
2PL would be better. For a meaningful comparison between Re:GRIDiT and a common parallel CO approach,
OCO/2PC (OCO is Optimistic CO; see above) should be used instead, and then it can be seen whether the integrated
solution of Re:GRIDiT provides any advantage over a straightforward implementation of OCO/2PC (now correctly,
a comparison of mechanism overhead only; transaction data access operation blocking should not happen with any
of the two solutions, if Re:GRIDiT properly implements an optimistic version of CO, and OCO/2PC with data
replication is implemented effectively).
The Re:GRIDiT articles neither reference CO articles nor mention CO, though most of the Re:GRIDiT authors have
referenced CO in their previous articles.
In (Voicu et al. 2010) the Re:FRESHiT mechanism, an enhancement of the replication mechanism of RE:GRIDiT,
is discussed. Here replication is separated from concurrency control, and no specific concurrency control mechanism
is mentioned.
Concurrent programming and Transactional memory
CO is also increasingly utilized in Concurrent programming, Transactional memory, and especially in Software
transactional memory (STM) for achieving serializability optimistically by "commit order" (e.g., Ramadan et al.
2009, Zhang et al. 2006, von Parun et al. 2007). Tens of related articles and patents utilizing "commit order" have
already been published.
Zhang et al. 2006 is a US patent entitled "Software transaction commit order and conflict management" (which
references CO US patent 5701480, Raz 1991a). Its abstract part is the following:
"Various technologies and techniques are disclosed for applying ordering to transactions in a software
transactional memory system. A software transactional memory system is provided with a feature to allow a
pre-determined commit order to be specified for a plurality of transactions. The pre-determined commit order
is used at runtime to aid in determining an order in which to commit the transactions in the software
transactional memory system. A contention management process is invoked when a conflict occurs between a
first transaction and a second transaction. The pre-determined commit order is used in the contention
management process to aid in determining whether the first transaction or the second transaction should win
the conflict and be allowed to proceed."
von Parun et al. 2007 explicitly uses the term "Commit ordering" and utilizes it for achieving serializability. E,g,
(page 1):
"Moreover, IPOT inherits commit ordering from TLS, hence ordered transactions. The key idea is that
ordering enables sequential reasoning for the programmer without precluding concurrency (in the common
case) on a runtime platform with speculative execution."
To enforce optimistic CO some implementation of the Generic local CO algorithm needs to be utilized. The patent
abstract quoted above describes a general implementation of the algorithm with a pre-determined commit order (this
falls into the category of "CO generic algorithm with real-time constraints").
CO continues to be ignored
The CO techniques were invented in 1990-1 and have provided the only known scalable general Global
serializability solution (no need in concurrency control information distribution which handicaps all other
CO-noncomplying known methods). It also allows the effective distribution of all known concurrency control
methods and achieves Global serializability also in a heterogeneous environment with different concurrency control
methods. Though CO ("Commitment ordering" or "Commit ordering") and its patents have already been referenced
until 2009 in tens of academic articles and tens of patents (e.g., can be found by Google scholar and Google patents
by patent numbers), CO has been ignored in major relevant academic texts that have intended to provide updated
The History of Commitment Ordering
143
coverage of the database concurrency control fundamentals and more. Here are examples since 2009:
• Liu, Ling; Özsu, M. Tamer (Eds.) (2009): Encyclopedia of Database Systems
[1]
, 1st Edition, 3752 pages,
Springer, ISBN 978-0-387-49616-0
An encyclopedia on a subject is usually exhaustive, covering all aspects of the subject. Concurrency control is
thoroughly covered. CO, an important concept for database concurrency control which emerged publicly 18
years prior to this book, is missing.
• Avi Silberschatz, Henry F Korth, S. Sudarshan (2010): Database System Concepts
[2]
, 6th Edition, McGraw-Hill,
ISBN 0-07-295886-3
Two-level serializability which does not guarantee Global serializability (but rather comprises a relaxed form
of it) and relies on a centralized component (clear major disadvantages) is described in detail and proposed as
a distributed concurrency control method. No evidence is given for performance or any other advantage over
the unmentioned CO which does not have these disadvantages.
• Özsu, M. Tamer, Valduriez, Patrick (2011): Principles of Distributed Database Systems
[3]
, Third Edition,
Springer, ISBN 978-1-4419-8833-1, Chapter 11: Distributed Concurrency Control, Pages 361-403
This chapter has not been changed much (if at all) since the book's 1999 second edition. The old, traditional
concurrency control methods which do not scale in distributed database systems, the subject of the book, are
described (but not the unmentioned CO which scales, and thus is the most relevant serializability method for
distributed databases).
References
• Abraham Silberschatz, Michael Stonebraker, and Jeffrey Ullman (1991): "Database Systems: Achievements and
Opportunities"
[4]
, Communications of the ACM, Vol. 34, No. 10, pp. 110-120, October 1991
• Gerhard Weikum, Gottfried Vossen (2001): Transactional Information Systems
[5]
, Elsevier, ISBN
1-55860-508-8
• Liu, Ling; Özsu, M. Tamer (Eds.) (2009): Encyclopedia of Database Systems
[1]
, 1st Edition, 3752 pages,
Springer, ISBN 978-0-387-49616-0
• Avi Silberschatz, Henry F Korth, S. Sudarshan (2010): Database System Concepts
[2]
, 6th Edition, McGraw-Hill,
ISBN 0-07-295886-3
• Philip A. Bernstein, Eric Newcomer (2009): Principles of Transaction Processing, 2nd Edition
[6]
, Morgan
Kaufmann (Elsevier), ISBN 978-1-55860-623-4
Dynamic atomicity
• William Edward Weihl (1984): "Specification and Implementation of Atomic Data Types"
[7]
, Ph.D. Thesis,
MIT-LCS-TR-314, March 1984, MIT, LCS lab.
• William Edward Weihl (1988): "Commutativity-based concurrency control for abstract data types"
[8]
,
Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences, Software Track,
1988, Volume 2, pp. 205-214, Kailua-Kona, HI, USA, ISBN 0-8186-0842-0
• Alan Fekete, Nancy Lynch, Michael Merritt, William Weihl (1988): Commutativity-based locking for nested
transactions (PDF)
[9]
MIT, LCS lab, Technical report MIT/LCS/TM-370, August 1988.
• William Edward Weihl (1989): "Local atomicity properties: modular concurrency control for abstract data types"
[10]
, ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 11, Issue 2, April 1989,
pp. 249 - 282, ISSN:0164-0925
• Alan Fekete, Nancy Lynch, Michael Merritt, William Weihl (1990): "Commutativity-based locking for nested
transactions"
[11]
(PDF
[12]
), Journal of Computer and System Sciences (JCSS), Volume 41 Issue 1, pp 65-156,
The History of Commitment Ordering
144
August 1990, Academic Press, Inc. doi>10.1016/0022-0000(90)90034-I.
• Nancy Lynch, Michael Merritt, William Weihl, Alan Fekete (1993): Atomic Transactions In Concurrent and
Distributed Systems
[13]
, Morgan Kauffman (Elsevier), August 1993, ISBN 978-1-55860-104-8, ISBN
1-55860-104-X
Analogous execution and serialization order
• Dimitrios Georgakopoulos, Marek Rusinkiewicz (1989): "Transaction Management in Multidatabase Systems",
Technical Report #UH-CS-89-20, September 1989, University of Houston, Department of Computer Science.
• Yuri Breitbart, Dimitrios Georgakopoulos, Marek Rusinkiewicz, Abraham Silberschatz (1991): "On Rigorous
Transaction Scheduling"
[14]
, IEEE Transactions on Software Engineering (TSE), September 1991, Volume 17,
Issue 9, pp. 954-960, ISSN: 0098-5589
• Yuri Breitbart, Abraham Silberschatz (1992): "Strong recoverability in multidatabase systems"
[15]
, Second
International Workshop on Research Issues on Data Engineering: Transaction and Query Processing,
(RIDE-TQP), pp. 170-175, 2–3 February 1992, Tempe, AZ, USA, ISBN 0-8186-2660-7
Commitment ordering
• Yoav Raz (1990a): On the Significance of Commitment Ordering
[12]
- Call for patenting, Memorandum, Digital
Equipment Corporation, November 1990.
• Yoav Raz (1990b): "The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous
Environment of Multiple Autonomous Resource Managers Using Atomic Commitment", Technical Report
DEC-TR 841, Digital Equipment Corporation, November 1990 (1995 last version of the technical report can be
found here
[16]
).
• Yoav Raz (1991a): US patents 5,504,899 (ECO)
[13]
5,504,900 (CO)
[14]
5,701,480 (MVCO)
[15]
• Yoav Raz (1991b): "Locking Based Strict Commitment Ordering, or How to improve Concurrency in Locking
Based Resource Managers", Technical Report DEC-TR 844, Digital Equipment Corporation, December 1991.
• Yoav Raz (1991c): "The Commitment Order Coordinator (COCO) of a Resource Manager, or Architecture for
Distributed Commitment Ordering Based Concurrency Control", DEC-TR 843, Digital Equipment Corporation,
December 1991.
• Yoav Raz (1992): "The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous
Environment of Multiple Autonomous Resource Managers Using Atomic Commitment"
[7]
, , Proceedings of the
Eighteenth International Conference on Very Large Data Bases (VLDB), pp. 292-312, Vancouver, Canada,
August 1992 (an abridged version of (raz 1990b)).
• Yoav Raz (1993a): "Extended Commitment Ordering or Guaranteeing Global Serializability by Applying
Commitment Order Selectivity to Global Transactions",
[16]
Proceedings of the Twelfth ACM Symposium on
Principles of Database Systems (PODS), Washington, DC, pp. 83-96, May 1993 (also DEC-TR 842, November
1991).
• Yoav Raz (1993b): "Commitment Ordering Based Distributed Concurrency Control for Bridging Single and
Multi Version Resources",
[17]
Proceedings of the Third IEEE International Workshop on Research Issues on
Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, Austria, pp. 189-198, April
1993 (also DEC-TR 853, July 1992).
• Yoav Raz (1994): "Serializability by Commitment Ordering."
[9]
Information Processing Letters (IPL), Volume
51, Number 5
[10]
, pp. 257-264, September 1994 (Received August 1991).
• Yoav Raz (2009): Theory of Commitment Ordering - Summary
[11]
GoogleSites - Site of Yoav Raz. Retrieved 1
Feb, 2011.
The History of Commitment Ordering
145
Later utilization of CO
An object-oriented extensible transaction management system
• Lun Xiao (1995): An object-oriented extensible transaction management system
[17]
, Ph.D. Thesis, University of
Illinois at Urbana-Champaign, 1995.
Semi-optimistic database scheduler
• William Perrizo, Igor Tatarinov (1998): "A Semi-Optimistic Database Scheduler Based on Commit Ordering"
[18]
(PDF
[19]
), 1998 Int'l Conference on Computer Applications in Industry and Engineering, pp. 75-79, Las Vegas,
November 11, 1998.
Transactional processes
• Heiko Schuldt, Hans-Jörg Schek, and Gustavo Alonso (1999): "Transactional Coordination Agents for Composite
Systems"
[20]
, In Proceedings of the 3rd International Database Engineering and Applications Symposium
(IDEAS’99), IEEE Computer Society Press, Montrteal, Canada, pp. 321–331.
• Heiko Schuldt, Gustavo Alonso, Catriel Beeri, Hans-Jörg Schek (2002): "Atomicity and isolation for transactional
processes",
[21]
ACM Transactions on Database Systems (ACM TODS), 27(1): pp. 63-116, 2002.
• Klaus Haller, Heiko Schuldt, Can Türker (2005): "Decentralized coordination of transactional processes in
peer-to-peer environments",
[22]
Proceedings of the 2005 ACM CIKM, International Conference on Information
and Knowledge Management, pp. 28-35, Bremen, Germany, October 31 - November 5, 2005, ISBN
1-59593-140-6
Tolerating byzantine faults in transaction processing systems using commit barrier scheduling
• Ben Vandiver, Hari Balakrishnan, Barbara Liskov, Sam Madden (2007): "Tolerating byzantine faults in
transaction processing systems using commit barrier scheduling"
[23]
(PDF
[24]
), Proceedings of twenty-first ACM
SIGOPS symposium on Operating systems principles (SOSP '07), ACM New York ©2007, ISBN
978-1-59593-591-5, doi 10.1145/1294261.1294268
Middleware Architecture with Patterns and Frameworks
• Sacha Krakowiak (2007): Middleware Architecture with Patterns and Frameworks
[25]
(2009 PDF
[26]
), eBook,
427 pages, ScientificCommons
[27]
, Retrieved May 25, 2011.
Grid computing, Cloud computing, and Re:GRIDiT
• Laura Voicu and Heiko Schuldt (2008): "The Re:GRIDiT Protocol: Correctness of Distributed Concurrency
Control in the Data Grid in the Presence of Replication"
[28]
, Technical Report CS-2008-002 Department of
Computer Science, DBIS UNI BASEL, 2008/9.
• Laura Cristiana Voicu, Heiko Schuldt, Fuat Akal, Yuri Breitbart, Hans Jörg Schek (2009a): "Re:GRIDiT –
Coordinating Distributed Update Transactions on Replicated Data in the Grid"
[29]
, 10th IEEE/ACM International
Conference on Grid Computing (Grid 2009), Banff, Canada, 2009/10.
• Laura Cristiana Voicu and Heiko Schuldt (2009b): "How Replicated Data Management in the Cloud can benefit
from a Data Grid Protocol — the Re:GRIDiT Approach"
[30]
, Proceedings of the 1st International Workshop on
Cloud Data Management (CloudDB 2009), Hong Kong, China, 2009/11.
• Laura Cristiana Voicu and Heiko Schuldt (2009c): "Load-Aware Dynamic Replication Management in a Data
Grid"
[31]
, On the Move to Meaningful Internet Systems: OTM 2009 Workshops Proceedings
[32]
, Vilamoura,
Portugal, November 1–6, 2009, pp. 201-218, Lecture Notes in Computer Science Vol. 5872, Springer, ISBN
978-3-642-05289-7.
The History of Commitment Ordering
146
• Laura Cristiana Voicu, Heiko Schuldt, Yuri Breitbart, Hans Jörg Schek (2010): "Flexible Data Access in a Cloud
based on Freshness Requirements"
[33]
, Proceedings of the 3rd International Conference on Cloud Computing
(IEEE CLOUD 2010), Miami, Florida, USA, 2010/7.
Concurrent programming and Transactional memory
• Hany E. Ramadan, Indrajit Roy, Maurice Herlihy, Emmett Witchel (2009): "Committing conflicting transactions
in an STM"
[34]
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel
programming (PPoPP '09), ISBN 978-1-60558-397-6
• Lingli Zhang, Vinod K.Grover, Michael M. Magruder, David Detlefs, John Joseph Duffy, Goetz Graefe (2006):
Software transaction commit order and conflict management
[35]
United States Patent 7711678, Granted
05/04/2010.
• Christoph von Praun, Luis Ceze, Calin Cascaval (2007) "Implicit Parallelism with Ordered Transactions"
[36]
(PDF
[37]
), Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel
programming (PPoPP '07), ACM New York ©2007, ISBN 978-1-59593-602-8 doi 10.1145/1229428.1229443
External links
• Yoav Raz's Commitment ordering page
[38]
References
[1] http:/ / www. springer.com/ computer/ database+ management+ %26+information+retrieval/book/ 978-0-387-49616-0
[2] http:/ / highered.mcgraw-hill.com/ sites/ 0073523321/
[3] http:// www. springer.com/ computer/ database+ management+ %26+information+retrieval/book/ 978-1-4419-8833-1
[4] http:/ / www. informatik.uni-trier.de/ ~ley/ db/ journals/ cacm/ SilberschatzSU91.html
[5] http:/ / www. elsevier. com/ wps/ find/bookdescription. cws_home/ 677937/ description#description
[6] http:/ / www. elsevierdirect. com/ product.jsp?isbn=9781558606234
[7] http:// publications. csail. mit. edu/ lcs/ pubs/ pdf/ MIT-LCS-TR-314.pdf
[8] http:// ieeexplore.ieee. org/xpl/ freeabs_all.jsp?arnumber=11807
[9] http:/ / www. dtic. mil/ cgi-bin/GetTRDoc?AD=ADA200980& Location=U2&doc=GetTRDoc. pdf
[10] http:// portal.acm. org/ citation. cfm?id=63518
[11] http:// portal.acm. org/ citation. cfm?id=78713
[12] http:// groups.csail. mit. edu/ tds/ papers/ Lynch/ jcss90. pdf
[13] http:/ / www. elsevier. com/ wps/ find/bookdescription. cws_home/ 680521/ description#description
[14] http:/ / ieeexplore.ieee. org/xpl/ freeabs_all.jsp?arnumber=92915
[15] http:/ / ieeexplore.ieee. org/xpl/ freeabs_all.jsp?arnumber=227409
[16] http:/ / yoavraz.googlepages. com/ CO63. pdf
[17] http:/ / srg. cs. uiuc. edu/ Bib/ LXiao-PhD.thesis. pdf
[18] http:// citeseerx.ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 53. 7318
[19] http:/ / citeseerx.ist. psu. edu/ viewdoc/ download?doi=10.1. 1. 53.7318& rep=rep1&type=pdf
[20] http:// portal.acm. org/ citation. cfm?id=853907
[21] http:// portal.acm. org/ citation. cfm?doid=507234.507236
[22] http:/ / portal.acm. org/ citation. cfm?doid=1099554.1099563
[23] http:/ / portal.acm. org/ citation. cfm?id=1294268
[24] http:// people.csail. mit. edu/ benmv/ hrdb-sosp07. pdf
[25] http:// en.scientificcommons. org/45657614
[26] http:// proton.inrialpes. fr/~krakowia/ MW-Book/ main-onebib.pdf
[27] http:/ / en.scientificcommons. org/
[28] http:// dbis. cs. unibas. ch/ publications/ 2008/ cs-2008-002/ dbis_publication_view
[29] http:// dbis. cs. unibas. ch/ publications/ 2009/ grid2009/ dbis_publication_view
[30] http:// dbis. cs. unibas. ch/ publications/ 2009/ clouddb09/ dbis_publication_view
[31] http:// dbis. cs. unibas. ch/ publications/ 2009/ coopis09
[32] http:// www. springer.com/ computer/ database+ management+ & +information+retrieval/book/ 978-3-642-05289-7
[33] http:// dbis. cs. unibas. ch/ publications/ 2010/ cloud2010
[34] http:// portal.acm. org/ citation. cfm?id=1504201
The History of Commitment Ordering
147
[35] http:/ / www. freepatentsonline. com/7711678. html
[36] http:// portal.acm. org/ citation. cfm?id=1229443
[37] http:// www. cs. washington. edu/ homes/ luisceze/ publications/ ipot_ppopp07.pdf
[38] http:/ / sites. google. com/ site/ yoavraz2/the_principle_of_co
Comparison of ADO and ADO.NET
Note: The following content requires a knowledge of database technologies.
The following is a comparison of two different database access technologies from Microsoft, namely, ActiveX Data
Objects (ADO) and ADO.NET. Before comparing the two technologies, it is essential to get an overview of
Microsoft Data Access Components (MDAC) and the .NET Framework. Microsoft Data Access Components
provide a uniform and comprehensive way of developing applications for accessing almost any data store entirely
from unmanaged code. The .NET Framework is an application virtual machine-based software environment that
provides security mechanisms, memory management, and exception handling and is designed so that developers
need not consider the capabilities of the specific CPU that will execute the .NET application. The .NET application
virtual machine turns intermediate language (IL) into machine code. High-level language compilers for C#, VB.NET
and C++ are provided to turn source code into IL. ADO.NET is shipped with the Microsoft NET Framework.
ADO relies on COM whereas ADO.NET relies on managed-providers defined by the .NET CLR. ADO.NET does
not replace ADO for the COM programmer; rather, it provides the .NET programmer with access to relational data
sources, XML, and application data.
ADO ADO.NET
Business Model Connection-oriented Models used mostly Disconnected models are used:Message-like Models.
Disconnected
Access
Provided by Record set Provided by Data Adapter and Data set
XML Support Limited Robust Support
Connection Model Client application needs to be connected always to data-server
while working on the data, unless using client-side cursors or a
disconnected Record set
Client disconnected as soon as the data is processed.
DataSet is disconnected at all times.
Data Passing ADO objects communicate in binary mode. ADO.NET uses XML for passing the data.
Control of data
access behaviors
Includes implicit behaviors that may not always be required in an
application and that may therefore limit performance.
Provides well-defined, factored components with
predictable behavior, performance, and semantics.
Design-time
support
Derives information about data implicitly at run time, based on
metadata that is often expensive to obtain.
Leverages known metadata at design time in order to
provide better run-time performance and more consistent
run-time behavior.
References
• ADO.NET for the ADO programmer
[2]
Comparison of OLAP Servers
148
Comparison of OLAP Servers
The following tables compare general and technical information for a number of Online analytical processing
(OLAP) servers. Please see the individual products articles for further information.
General information
OLAP Server Company Website Latest stable
version
Software
license
License
Pricing
Essbase Oracle
[1]
11.1.2.0 Proprietary
[2]
icCube MISConsulting
SA
[3]
1.0 Proprietary free
Microsoft Analysis Services Microsoft
[4]
2008 R2 Proprietary
[5]
MicroStrategy Intelligence
Server
MicroStrategy
[6]
9 Proprietary -
Mondrian OLAP server Pentaho
[7]
3.2 EPL free
Oracle Database OLAP
Option
Oracle
[8]
11g R2 Proprietary
[2]
Palo Jedox
[9]
3.2 SR3 GPL v2 or
EULA
-
SAS OLAP Server SAS Institute
[10]
9.2 Proprietary -
SAP NetWeaver BW SAP
[11]
7.20 Proprietary -
TM1 IBM
[12]
9.5 Proprietary -
Data storage modes
OLAP Server MOLAP ROLAP HOLAP Offline
Essbase Yes Yes Yes
icCube
Yes No No
GWT Offline Pivot
[13]
Microsoft Analysis Services
Yes Yes Yes
Local cubes,
PowerPivot for Excel
MicroStrategy Intelligence Server
Yes Yes Yes
MicroStrategy Office
[14]
,
Dynamic Dashboards
[15]
Mondrian OLAP server No Yes No
Oracle Database OLAP Option Yes Yes Yes
Palo Yes No No
SAS OLAP Server Yes Yes Yes
TM1 Yes No No
SAP NetWeaver BW Yes Yes No
Comparison of OLAP Servers
149
APIs and query languages
APIs and query languages OLAP servers support.
OLAP Server XML for
Analysis
OLE DB for
OLAP
MDX Stored procedures Custom
functions
SQL
Essbase Yes Yes Yes Java Yes No
icCube Yes Yes Yes Java Yes No
Microsoft Analysis
Services
Yes Yes Yes
.NET
[16]
Yes
[17]
No
MicroStrategy
Intelligence Server
Yes No Yes Yes Yes Yes
Mondrian OLAP server
Yes
Yes
[18]
Yes No
Yes
[19]
No
Palo Yes Yes Yes ? Yes No
Oracle Database OLAP
Option
No
Yes
[20]
Yes
[20]
Java, PL/SQL,
OLAP DML
Yes
Yes
[21]
SAS OLAP Server Yes Yes Yes No No No
SAP NetWeaver BW Yes Yes Yes No Yes No
TM1 Yes Yes Yes ? Yes No
OLAP features
OLAP Server Parent-child
hierarchies
Multiple time
hierarchies
Semi-additive
measures
Write-back Measure
Groups
Partitioning
Essbase Yes Yes Yes Yes Yes Yes
icCube Yes Yes Yes Yes Yes Planned
Microsoft Analysis
Services
Yes Yes Yes Yes Yes Yes
MicroStrategy
Intelligence Server
Yes Yes Yes
Yes
[22]
?
Yes
Mondrian OLAP
server
Yes No Yes Planned
?
No
Oracle Database
OLAP Option
Yes
?
Yes Yes
?
Yes
Palo Yes ? ? Yes ? ?
TM1
Yes
Yes
[23]
Yes Yes
?
No
SAP NetWeaver
BW
Yes Yes Yes Yes
?
Yes
Comparison of OLAP Servers
150
System limits
OLAP Server # cubes # measures # dimensions # hierarchies
in dimension
# levels in
hierarchy
# dimension
members
Essbase
[24]
? ? ? 256 ? 20,000,000
(ASO),
1,000,000
(BSO)
icCube
2,147,483,647 2,147,483,647 2,147,483,647
2,147,483,647 2,147,483,647 2,147,483,647
Microsoft
Analysis
Services
[25]
2,147,483,647 2,147,483,647 2,147,483,647
2,147,483,647 2,147,483,647 2,147,483,647
MicroStrategy
Intelligence
Server
Unrestricted Unrestricted Unrestricted Unrestricted Unrestricted Unrestricted
SAS OLAP
Server
[26]
? 1024 128 128 19 2,147,483,648
Security
OLAP Server Authentication Network
encryption
Data access
Cell
security
Dimension
security
Visual
totals
Essbase Essbase authentication, LDAP
authentication
SSL Yes Yes Yes
Microsoft Analysis
Services
NTLM, Kerberos SSL and SSPI Yes Yes Yes
MicroStrategy
Intelligence Server
Host authentication, database
authentication, LDAP,
Microsoft Active Directory, NTLM,
SiteMinder, Tivoli, SAP, Kerberos
SSL, AES
[27]
Yes Yes Yes
Oracle Database
OLAP Option
Oracle Database authentication SSL Yes Yes
?
SAS OLAP
Server
[28]
Host authentication,SAS token
authentication, LDAP, Microsoft
Active Directory
Yes
[29]
Yes Yes Yes
Operating systems
The OLAP servers can run on the following operating systems:
Comparison of OLAP Servers
151
OLAP Server Windows Linux UNIX z/OS
Essbase Yes Yes Yes No
icCube Yes Yes Yes Yes
Microsoft Analysis Services Yes No No No
MicroStrategy Intelligence Server Yes Yes Yes No
Mondrian OLAP server Yes Yes Yes Yes
Oracle Database OLAP Option Yes Yes Yes Yes
Palo Yes Yes Yes No
SAS OLAP Server Yes Yes Yes Yes
SAP NetWeaver BW Yes Yes Yes Yes
TM1 Yes Yes Yes No
Note (1):The server availability depends on Java Virtual Machine not on the operating system
Support information
OLAP Server Issue Tracking System Forum/Blog Roadmap Source code
Essbase
myOracle Support
[30]
[31] [32]
Closed
icCube
Bugzilla
[33]
[34] [35]
Open
Microsoft Analysis Services
Connect
[36]
[37]
- Closed
MicroStrategy Intelligence Server
MicroStrategy Resource Center
[38]
[39]
- Closed
Mondrian OLAP server
Jira
[40]
[41] [42]
Open
Oracle Database OLAP Option
myOracle Support
[30]
[31]
Closed
Palo
Mantis
[43]
[44]
Open
SAS OLAP Server
Support
[45]
Closed
SAP NetWeaver BW
OSS
[46]
[47] [48]
Closed
TM1
no [49]
Closed
References
[1] "Oracle Essbase" (http:/ / www.oracle.com/ us/ solutions/ ent-performance-bi/business-intelligence/ essbase/ index.html). .
[2] http:/ / www. oracle. com/ us/ corporate/pricing/index. htm
[3] "icCube OLAP Server" (http:// www. icCube. com). .
[4] "Microsoft SQL Server 2008 Analysis Services" (http:/ / www.microsoft.com/ Sqlserver/2008/ en/ us/ analysis-services. aspx). .
[5] http:/ / www. microsoft.com/ sqlserver/ 2008/ en/ us/ pricing.aspx
[6] "MicroStrategy Intelligence Server" (http:// www.microstrategy. com/ Software/Products/ Intelligence_Server/). .
[7] "Pentaho Analysis Services: Mondrian Project" (http:/ / mondrian.pentaho.org). .
[8] "Oracle OLAP Documentation" (http:// www.oracle.com/ technology/ documentation/ olap.html). .
[9] "Jedox AG Business Intelligence" (http:/ / www.jedox. com/ en/ home/ overview.html). .
[10] "SAS OLAP Server" (http:// www.sas. com/ technologies/ dw/ storage/ mddb/ index. html). .
[11] "Components & Tools" (http:/ / www. sap. com/ usa/ platform/netweaver/ components/ businesswarehouse/ index. epx). .
[12] "Cognos Business Intelligence and Financial Performance Management" (http:/ / www-01.ibm. com/ software/ data/ cognos/ index. html). .
[13] http:// www. iccube. com/ products/ offline-pivot-table
Comparison of OLAP Servers
152
[14] http:/ / www. microstrategy.com/ Software/Products/ User_Interfaces/Office/
[15] http:/ / www. microstrategy.com/ Software/Products/ Service_Modules/ Report_Services/
[16] "SQL Server 2008 Books Online (October 2009)Defining Stored Procedures" (http:// msdn. microsoft.com/ en-us/ library/ms176113.
aspx). MSDN. .
[17] "SQL Server 2008 Books Online (October 2009)Using Stored Procedures" (http:/ / msdn. microsoft.com/ en-us/ library/ms145486. aspx).
MSDN. .
[18] "Pentaho and Simba Technologies Partner to Bring World's Most Popular Open Source OLAP Project to Microsoft Excel Users" (http:/ /
www.simba. com/ news/ Pentaho-Simba-Partner-for-Excel-Connectivity.htm). .
[19] "How to Define a Mondrian Schema" (http:// mondrian.pentaho. org/ documentation/schema. php#User-defined_function). Pentaho. .
[20] "Oracle and Simba Technologies Introduce MDX Provider for Oracle OLAP" (http:/ / www.oracle.com/ us/ corporate/press/ 036550). .
[21] "Querying Oracle OLAP Cubes: Fast Answers to Tough Questions Using Simple SQL" (http:/ / www.oracle.com/ technology/ products/ bi/
olap/ 11g/ demos/ olap_sql_demo. html). .
[22] "Common Extensions of the MicroStrategy Platform" (http:// www.microstrategy.com/ Software/Products/ Dev_Tools/ SDK/ extensions.
asp). .
[23] "How to add Current Month and Current Year level into Time dimension" (http:// www. ibm. com/ developerworks/forums/ message.
jspa?messageID=14495600). .
[24] "Essbase Server Limits" (http:/ / download. oracle. com/ docs/ cd/ E12825_01/ epm. 111/ esb_dbag/ frameset. htm?limits. htm). Oracle. .
[25] "SQL Server 2008 Books Online (October 2009)Maximum Capacity Specifications (Analysis Services - Multidimensional Data)" (http://
technet.microsoft.com/ en-us/ library/ms365363. aspx). Microsoft. .
[26] "SAS OLAP Cube Size Specifications" (http:// support. sas. com/ documentation/cdl/ en/ olapug/ 59574/ HTML/ default/a003302815.
htm). .
[27] MicroStrategy Intelligence Server™ Features (http:// latam. microstrategy.com/ Software/Products/ Intelligence_Server/features.asp)
[28] "SAS OLAP Security Totals and Permission Conditions" (http:// support. sas. com/ documentation/ cdl/ en/ mdxag/ 59575/ HTML/default/
a003230130. htm). .
[29] "How to Change Over-the-Wire Encryption Settings for SAS Servers" (http:// support. sas. com/ documentation/ cdl/ en/ bisecag/ 61133/
HTML/default/ a003275910. htm). .
[30] http:/ / support. oracle.com
[31] http:/ / forums.oracle. com/ forums/main. jspa?categoryID=84
[32] http:/ / communities. ioug. org/Portals/ 2/ Oracle_Essbase_Roadmap_Sep_09. pdf
[33] http:// issues. iccube. com/
[34] http:// www. iccube. com/ support/ forum
[35] http:/ / www. iccube. com/ support/ qa/ roadmap
[36] https:/ / connect.microsoft.com/ SQLServer
[37] http:// social. msdn. microsoft. com/ Forums/en-US/ sqlanalysisservices/ threads
[38] https:// resource.microstrategy. com/Support/ MainSearch. aspx
[39] https:/ / resource.microstrategy. com/Forum/
[40] http:// jira.pentaho. com/ browse/ MONDRIAN
[41] http:// forums.pentaho. org/ forumdisplay.php?f=79
[42] http:// mondrian.pentaho. org/ documentation/ roadmap.php
[43] http:/ / bugs. palo.net/ mantis/ main_page. php
[44] http:/ / www. jedox. com/ community/ palo-forum/board. php?boardid=9
[45] http:/ / support. sas. com/ forums/ index. jspa
[46] http:/ / service.sap. com/
[47] http:// forums.sdn. sap. com/ index. jspa
[48] http:/ / esworkplace. sap. com/ socoview(bD1lbiZjPTAwMSZkPW1pbg==)/render.asp?id=2270EAD629814D05A7ECECECECC8D002&
fragID=& packageid=DEE98D07DF9FA9F1B3C7001A64D3F462
[49] http:// forums.olapforums. com/
Comparison of structured storage software
153
Comparison of structured storage software
Structured storage is computer storage for structured data, often in the form of a distributed database.
[1]
Computer
software formally known as structured storage systems include Apache Cassandra,
[2]
Google's BigTable
[3]
and
HBase.
[4]
Comparison
The following is a comparison of notable structured storage systems.
Project Name Type Persistence Replication High
Availability
Transactions Rack-locality
Awareness
Implementation
Language
Influences, Sponsors License
AllegroGraph Graph
database Yes
No - v5,
2010 Yes Yes No
Common Lisp
Franz Inc.
[5]
Commercial,
Limited Free
Version
Apache
Jackrabbit
Key-value
&
Hierarchical
&
Document
Yes Yes Yes Yes
likely Java Apache, Roy Fielding, Day
Software
Apache 2.0
Berkeley
DB/Dbm/Ndbm
(bdb)1.x
Key-value
Yes No No No No
C old school Various
Berkeley DB
Sleepycat/Oracle
Berkeley DB 2.x
Key-value
Yes Yes
Unknown
Yes No
C, C++, or Java dbm, Sleepycat/Oracle dual BSD-like
Sleepycat
License/commercial
Cassandra Key-value
Yes Yes
Distributed Eventually
consistent
Yes
Java Dynamo and BigTable,
Facebook/Digg/Rackspace
Apache 2.0
CouchDB Document
Yes Yes
replication
+ load
balancing
Atomicity is
per
document,
per
CouchDB
instance
No
Erlang Lotus Notes / Ubuntu, Mozilla,
IBM
Apache 2.0
Extensible
Storage
Engine(ESE/NT)
Document
or
Key-value
Yes No No Yes No
C++, Assembly Microsoft per Windows
License
GigaSpaces
[6]
Tuple
Space &
Relational
&
Document
&
key-value
Yes Yes Yes Yes
Depends on
user
configuration
Java Tuple space commercial
GT.M Key-value
Yes Yes Yes Yes
Depends on
user
configuration
C (small bits of
assembly
language)
FIS
[7]
AGPL v3
Project Name Type
Persistence Replication
High
Availability Transactions Rack-locality
Awareness
Implementation
Language
Influences, Sponsors License
Comparison of structured storage software
154
HBase Key-value Yes. Major
version
upgrades
require
re-import.
See HDFS,
S3 or EBS.
maybe with
Zookeeper
in 0.21?
Unknown See HDFS,
S3 or EBS.
Java BigTable Apache 2.0
Hypertable Key-value
Yes
Yes, with
KosmosFS
and Ceph
coming in
2.0
coming
Yes, with
KosmosFS
C++ BigTable GPL 2.0
Information
Management
System IBM
IMS aka DB1
Key-value.
Multi-level
Yes Yes
Yes, with
HALDB
Yes, with
IMS TM
Unknown Assembler IBM since 1966 Proprietary
Memcache Key-value
No No No Yes No
C Six
Apart/NorthScale/Fotolog/Facebook
BSD-like
permissive
copyright by Danga
MongoDB Document
(JSON) Yes Yes
fail-over Single
document
atomicity
No
C++
10gen
[8]
GNU AGPL v3.0
Neo4j Graph
database
Yes Yes Yes Yes No
Java
Neo Technology
[9]
GNU GPL v3.0
Redis Key-value Yes. But
last few
queries can
be lost.
Yes No Yes No
Ansi-C Memcache BSD
SimpleDB
(Amazon.com)
Document
&
Key-value
Yes
Yes
(automatic) Yes
Unknown likely Erlang Amazon.com Amazon internal
only
Project Name Type
Persistence Replication
High
Availability Transactions Rack-locality
Awareness
Implementation
Language
Influences, Sponsors License
References
[1] Hamilton, James (3 November 2009). "Perspectives: One Size Does Not Fit All" (http:/ / perspectives.mvdirona. com/
CommentView,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx). . Retrieved 13 November 2009.
[2] Lakshman, Avinash; Malik, Prashant. Cassandra - A Decentralized Structured Storage System (http:// www.cs.cornell. edu/ projects/
ladis2009/ papers/ lakshman-ladis2009. pdf). Cornell University. . Retrieved 13 November 2009.
[3] Chang, Fay; Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and
Robert E. Gruber. Bigtable: A Distributed Storage System for Structured Data (http:/ / labs. google. com/ papers/ bigtable-osdi06. pdf).
Google. . Retrieved 13 November 2009.
[4] Kellerman, Jim. "HBase: structured storage of sparse data for Hadoop" (http:/ / www.rapleaf.com/ pdfs/ hbase_part_2.pdf). . Retrieved 13
November 2009.
[5] http:/ / www. franz.com
[6] http:/ / gigaspaces. com/
[7] http:/ / fis-gtm.com
[8] http:/ / www. 10gen. com
[9] http:/ / www. neotechnology. com/
Computer-aided software engineering
155
Computer-aided software engineering
Example of a CASE tool.
Computer-aided software
engineering (CASE) is the scientific
application of a set of tools and
methods to a software system which is
meant to result in high-quality,
defect-free, and maintainable software
products.
[1]
It also refers to methods for
the development of information systems
together with automated tools that can
be used in the software development
process.
[2]
Overview
The term "computer-aided software
engineering" (CASE) can refer to the
software used for the automated
development of systems software, i.e.,
computer code. The CASE functions include analysis, design, and programming. CASE tools automate methods for
designing, documenting, and producing structured computer code in the desired programming language.
CASE software supports the software process activities such as requirement engineering, design, program
development and testing. Therefore, CASE tools include design editors, data dictionaries, compilers, debuggers,
system building tools, etc.
CASE also refers to the methods dedicated to an engineering discipline for the development of information system
using automated tools.
CASE is mainly used for the development of quality software which will perform effectively.
History
The ISDOS project at the University of Michigan initiated a great deal of interest in the whole concept of using
computer systems to help analysts in the very difficult process of analysing requirements and developing systems.
Several papers by Daniel Teichroew fired a whole generation of enthusiasts with the potential of automated systems
development. His PSL/PSA tool was a CASE tool although it predated the term. His insights into the power of
meta-meta-models was inspiring, particularly to a former student, Dr. Hasan Sayani, currently Professor, Program
Director at University of Maryland University College.
Another major thread emerged as a logical extension to the DBMS directory. By extending the range of meta-data
held, the attributes of an application could be held within a dictionary and used at runtime. This "active dictionary"
became the precursor to the more modern "model driven execution" (MDE) capability. However, the active
dictionary did not provide a graphical representation of any of the meta-data. It was the linking of the concept of a
dictionary holding analysts' meta-data, as derived from the use of an integrated set of techniques, together with the
graphical representation of such data that gave rise to the earlier versions of I-CASE.
The term CASE was originally coined by software company Nastec Corporation of Southfield, Michigan in 1982
with their original integrated graphics and text editor GraphiText, which also was the first microcomputer-based
system to use hyperlinks to cross-reference text strings in documents—an early forerunner of today's web page link.
Computer-aided software engineering
156
GraphiText's successor product, DesignAid, was the first microprocessor-based tool to logically and semantically
evaluate software and system design diagrams and build a data dictionary.
Under the direction of Albert F. Case, Jr. vice president for product management and consulting, and Vaughn Frick,
director of product management, the DesignAid product suite was expanded to support analysis of a wide range of
structured analysis and design methodologies, notably Ed Yourdon and Tom DeMarco, Chris Gane & Trish Sarson,
Ward-Mellor (real-time) SA/SD and Warnier-Orr (data driven).
The next entrant into the market was Excelerator from Index Technology in Cambridge, Mass. While DesignAid ran
on Convergent Technologies and later Burroughs Ngen networked microcomputers, Index launched Excelerator on
the IBM PC/AT platform. While, at the time of launch, and for several years, the IBM platform did not support
networking or a centralized database as did the Convergent Technologies or Burroughs machines, the allure of IBM
was strong, and Excelerator came to prominence. Hot on the heels of Excelerator were a rash of offerings from
companies such as Knowledgeware (James Martin, Fran Tarkenton and Don Addington), Texas Instrument's IEF and
Accenture's FOUNDATION toolset (METHOD/1, DESIGN/1, INSTALL/1, FCP).
CASE tools were at their peak in the early 1990s. At the time IBM had proposed AD/Cycle, which was an alliance of
software vendors centered around IBM's Software repository using IBM DB2 in mainframe and OS/2:
The application development tools can be from several sources: from IBM, from vendors, and from the
customers themselves. IBM has entered into relationships with Bachman Information Systems, Index
Technology Corporation, and Knowledgeware, Inc. wherein selected products from these vendors will be
marketed through an IBM complementary marketing program to provide offerings that will help to achieve
complete life-cycle coverage.
[3]
With the decline of the mainframe, AD/Cycle and the Big CASE tools died off, opening the market for the
mainstream CASE tools of today. Nearly all of the leaders of the CASE market of the early 1990s ended up being
purchased by Computer Associates, including IEW, IEF, ADW, Cayenne, and Learmonth & Burchett Management
Systems (LBMS). Types of case tools 1 Analysis tools 2 Repository to store all diagrams, forms, modeles and report
definitions etc. 3 Diagramming tools 4 Screen and report generators 5 Code generators 6 Documentation generators
Supporting software
Alfonso Fuggetta classified CASE into 3 categories:
[4]
1. Tools support only specific tasks in the software process.
2. Workbenches support only one or a few activities.
3. Environments support (a large part of) the software process.
Workbenches and environments are generally built as collections of tools. Tools can therefore be either stand alone
products or components of workbenches and environments.
Tools
CASE tools are a class of software that automate many of the activities involved in various life cycle phases. For
example, when establishing the functional requirements of a proposed application, prototyping tools can be used to
develop graphic models of application screens to assist end users to visualize how an application will look after
development. Subsequently, system designers can use automated design tools to transform the prototyped functional
requirements into detailed design documents. Programmers can then use automated code generators to convert the
design documents into code. Automated tools can be used collectively, as mentioned, or individually. For example,
prototyping tools could be used to define application requirements that get passed to design technicians who convert
the requirements into detailed designs in a traditional manner using flowcharts and narrative documents, without the
assistance of automated design software.
[5]
Existing CASE tools can be classified along 4 different dimensions:
Computer-aided software engineering
157
1. Life-cycle support
2. Integration dimension
3. Construction dimension
4. Knowledge-based CASE dimension
[6]
Let us take the meaning of these dimensions along with their examples one by one:
Life-Cycle Based CASE Tools
This dimension classifies CASE Tools on the basis of the activities they support in the information systems life
cycle. They can be classified as Upper or Lower CASE tools.
• Upper CASE Tools support strategic planning and construction of concept-level products and ignore the design
aspect. They support traditional diagrammatic languages such as ER diagrams, Data flow diagram, Structure
charts, Decision Trees, Decision tables, etc.
• Lower CASE Tools concentrate on the back end activities of the software life cycle, such as physical design,
debugging, construction, testing, component integration, maintenance, reengineering and reverse engineering.
Integration dimension
Three main CASE Integration dimensions have been proposed:
[7]
1. CASE Framework
2. ICASE Tools
3. Integrated Project Support Environment(IPSE)
Workbenches
Workbenches integrate several CASE tools into one application to support specific software-process activities.
Hence they achieve:
• a homogeneous and consistent interface (presentation integration).
• easy invocation of tools and tool chains (control integration).
• access to a common data set managed in a centralized way (data integration).
CASE workbenches can be further classified into following 8 classes:
[4]
1. Business planning and modeling
2. Analysis and design
3. User-interface development
4. Programming
5. Verification and validation
6. Maintenance and reverse engineering
7. Configuration management
8. Project management
Environments
An environment is a collection of CASE tools and workbenches that supports the software process. CASE
environments are classified based on the focus/basis of integration
[4]
1. Toolkits
2. Language-centered
3. Integrated
4. Fourth generation
5. Process-centered
Toolkits
Computer-aided software engineering
158
Toolkits are loosely integrated collections of products easily extended by aggregating different tools and
workbenches. Typically, the support provided by a toolkit is limited to programming, configuration management and
project management. And the toolkit itself is environments extended from basic sets of operating system tools, for
example, the Unix Programmer's Work Bench and the VMS VAX Set. In addition, toolkits' loose integration
requires user to activate tools by explicit invocation or simple control mechanisms. The resulting files are
unstructured and could be in different format, therefore the access of file from different tools may require explicit
file format conversion. However, since the only constraint for adding a new component is the formats of the files,
toolkits can be easily and incrementally extended.
[4]
Language-centered
The environment itself is written in the programming language for which it was developed, thus enabling users to
reuse, customize and extend the environment. Integration of code in different languages is a major issue for
language-centered environments. Lack of process and data integration is also a problem. The strengths of these
environments include good level of presentation and control integration. Interlisp, Smalltalk, Rational, and KEE are
examples of language-centered environments.
[4]
Integrated
These environments achieve presentation integration by providing uniform, consistent, and coherent tool and
workbench interfaces. Data integration is achieved through the repository concept: they have a specialized database
managing all information produced and accessed in the environment. Examples of integrated environment are IBM
AD/Cycle and DEC Cohesion.
[4]
Fourth-generation
Fourth-generation environments were the first integrated environments. They are sets of tools and workbenches
supporting the development of a specific class of program: electronic data processing and business-oriented
applications. In general, they include programming tools, simple configuration management tools, document
handling facilities and, sometimes, a code generator to produce code in lower level languages. Informix 4GL, and
Focus fall into this category.
[4]
Process-centered
Environments in this category focus on process integration with other integration dimensions as starting points. A
process-centered environment operates by interpreting a process model created by specialized tools. They usually
consist of tools handling two functions:
• Process-model execution
• Process-model production
Examples are East, Enterprise II, Process Wise, Process Weaver, and Arcadia.
[4]
Applications
All aspects of the software development life cycle can be supported by software tools, and so the use of tools from
across the spectrum can, arguably, be described as CASE; from project management software through tools for
business and functional analysis, system design, code storage, compilers, translation tools, test software, and so on.
However, tools that are concerned with analysis and design, and with using design information to create parts (or all)
of the software product, are most frequently thought of as CASE tools. CASE applied, for instance, to a database
software product, might normally involve:
• Modeling business / real-world processes and data flow
• Development of data models in the form of entity-relationship diagrams
• Development of process and function descriptions
Computer-aided software engineering
159
Risks and associated controls
Common CASE risks and associated controls include:
• Inadequate standardization: Linking CASE tools from different vendors (design tool from Company X,
programming tool from Company Y) may be difficult if the products do not use standardized code structures and
data classifications. File formats can be converted, but usually not economically. Controls include using tools
from the same vendor, or using tools based on standard protocols and insisting on demonstrated compatibility.
Additionally, if organizations obtain tools for only a portion of the development process, they should consider
acquiring them from a vendor that has a full line of products to ensure future compatibility if they add more
tools.
[5]
• Unrealistic expectations: Organizations often implement CASE technologies to reduce development costs.
Implementing CASE strategies usually involves high start-up costs. Generally, management must be willing to
accept a long-term payback period. Controls include requiring senior managers to define their purpose and
strategies for implementing CASE technologies.
[5]
• Slow implementation: Implementing CASE technologies can involve a significant change from traditional
development environments. Typically, organizations should not use CASE tools the first time on critical projects
or projects with short deadlines because of the lengthy training process. Additionally, organizations should
consider using the tools on smaller, less complex projects and gradually implementing the tools to allow more
training time.
[5]
• Weak repository controls: Failure to adequately control access to CASE repositories may result in security
breaches or damage to the work documents, system designs, or code modules stored in the repository. Controls
include protecting the repositories with appropriate access, version, and backup controls.
[5]
References
[1] Kuhn, D.L (1989). "Selecting and effectively using a computer aided software engineering tool". Annual Westinghouse computer
symposium; 6–7 Nov 1989; Pittsburgh, PA (U.S.); DOE Project.
[2] P. Loucopoulos and V. Karakostas (1995). System Requirements Engineering. McGraw-Hill.
[3] "AD/Cycle strategy and architecture", IBM Systems Journal, Vol 29, NO 2, 1990; p. 172.
[4] Alfonso Fuggetta (December 1993). "A classification of CASE technology" (http:// www2.computer.org/portal/ web/ csdl/ abs/ mags/ co/
1993/12/ rz025abs. htm). Computer 26 (12): 25–38. doi:10.1109/2.247645. . Retrieved 2009-03-14.
[5] Software Development Techniques (http:/ /www. ffiec.gov/ ffiecinfobase/booklets/ d_a/ 10.html). In: FFIEC InfoBase. Retrieved 26 Oct
2008.
[6] Software Engineering: Tools, Principles and Techniques by Sangeeta Sabharwal, Umesh Publications
[7] Evans R. Rock. Case Analyst Workbenches: A Detailed Product Evaluation. Volume 1, pp. 229–242 by
External links
• CASE Tools (http:// case-tools. org/) A CASE tools' community with comments, tags, forums, articles, reviews,
etc.
• CASE tool index (http:/ / www. unl. csi. cuny. edu/ faqs/ software-enginering/tools. html) - A comprehensive list
of CASE tools
• UML CASE tools (http:/ / www. objectsbydesign. com/ tools/ umltools_byProduct. html) - A comprehensive list
of UML CASE tools. Mainly have resources to choose a UML CASE tool and some related to MDA CASE
Tools.
Concurrency control
160
Concurrency control
In information technology and computer science, especially in the fields of computer programming (see also
concurrent programming, parallel programming), operating systems (see also parallel computing), multiprocessors,
and databases, concurrency control ensures that correct results for concurrent operations are generated, while
getting those results as quickly as possible.
Computer systems, both software and hardware, consist of modules, or components. Each component is designed to
operate correctly, i.e., to obey to or meet certain consistency rules. When components that operate concurrently
interact by messaging or by sharing accessed data (in memory or storage), a certain component's consistency may be
violated by another component. The general area of concurrency control provides rules, methods, design
methodologies, and theories to maintain the consistency of components operating concurrently while interacting, and
thus the consistency and correctness of the whole system. Introducing concurrency control into a system means
applying operation constraints which typically result in some performance reduction. Operation consistency and
correctness should be achieved with as good as possible efficiency, without reducing performance below reasonable.
See also Concurrency (computer science).
Concurrency control in databases
Comments:
1. This section is applicable to all transactional systems, i.e., to all systems that use database transactions (atomic
transactions; e.g., transactional objects in Systems management and in networks of smartphones which typically
implement private, dedicated database systems), not only general-purpose database management systems
(DBMSs).
2. DBMSs need to deal also with concurrency control issues not typical just to database transactions but rather to
operating systems in general. These issues (e.g., see Concurrency control in operating systems below) are out of
the scope of this section.
Concurrency control in Database management systems (DBMS; e.g., Bernstein et al. 1987, Weikum and Vossen
2001), other transactional objects, and related distributed applications (e.g., Grid computing and Cloud computing)
ensures that database transactions are performed concurrently without violating the data integrity of the respective
databases. Thus concurrency control is an essential element for correctness in any system where two database
transactions or more, executed with time overlap, can access the same data, e.g., virtually in any general-purpose
database system. Consequently a vast body of related research has been accumulated since database systems have
emerged in the early 1970s. A well established concurrency control theory for database systems is outlined in the
references mentioned above: serializability theory, which allows to effectively design and analyze concurrency
control methods and mechanisms. An alternative theory for concurrency control of atomic transactions over abstract
data types is presented in (Lynch et al. 1993), and not utilized below. This theory is more refined, complex, with a
wider scope, and has been less utilized in the Database literature than the classical theory above. Each theory has its
pros and cons, emphasis and insight. To some extent they are complementary, and their merging may be useful.
To ensure correctness, a DBMS usually guarantees that only serializable transaction schedules are generated, unless
serializability is intentionally relaxed to increase performance, but only in cases where application correctness is not
harmed. For maintaining correctness in cases of failed (aborted) transactions (which can always happen for many
reasons) schedules also need to have the recoverability (from abort) property. A DBMS also guarantees that no
effect of committed transactions is lost, and no effect of aborted (rolled back) transactions remains in the related
database. Overall transaction characterization is usually summarized by the ACID rules below. As databases have
become distributed, or needed to cooperate in distributed environments (e.g., Federated databases in the early 1990,
and Cloud computing currently), the effective distribution of concurrency control mechanisms has received special
Concurrency control
161
attention.
Database transaction and the ACID rules
The concept of a database transaction (or atomic transaction) has evolved in order to enable both a well understood
database system behavior in a faulty environment where crashes can happen any time, and recovery from a crash to a
well understood database state. A database transaction is a unit of work, typically encapsulating a number of
operations over a database (e.g., reading a database object, writing, acquiring lock, etc.), an abstraction supported in
database and also other systems. Each transaction has well defined boundaries in terms of which program/code
executions are included in that transaction (determined by the transaction's programmer via special transaction
commands). Every database transaction obeys the following rules (by support in the database system; i.e., a database
system is designed to guarantee them for the transactions it runs):
• Atomicity - Either the effects of all or none of its operations remain ("all or nothing" semantics) when a
transaction is completed (committed or aborted respectively). In other words, to the outside world a committed
transaction appears (by its effects on the database) to be indivisible, atomic, and an aborted transaction does not
leave effects on the database at all, as if never existed.
• Consistency - Every transaction must leave the database in a consistent (correct) state, i.e., maintain the
predetermined integrity rules of the database (constraints upon and among the database's objects). A transaction
must transform a database from one consistent state to another consistent state (it is the responsibility of the
transaction's programmer to make sure that the transaction itself is correct, i.e., performs correctly what it intends
to perform while maintaining the integrity rules). Thus since a database can be normally changed only by
transactions, all the database's states are consistent. An aborted transaction does not change the state.
• Isolation - Transactions cannot interfere with each other. Moreover, usually the effects of an incomplete
transaction are not visible to another transaction. Providing isolation is the main goal of concurrency control.
• Durability - Effects of successful (committed) transactions must persist through crashes (typically by recording
the transaction's effects and its commit event in a non-volatile memory).
The concept of atomic transaction has been extended during the years to what has become Business transactions
which actually implement types of Workflow and are not atomic. However also such enhanced transactions typically
utilize atomic transactions as components.
Why is concurrency control needed?
If transactions are executed serially, i.e., sequentially with no overlap in time, no transaction concurrency exists.
However, if concurrent transactions with interleaving operations are allowed in an uncontrolled manner, some
unexpected, undesirable result may occur. Here are some typical examples:
1. The lost update problem: A second transaction writes a second value of a data-item (datum) on top of a first value
written by a first concurrent transaction, and the first value is lost to other transactions running concurrently
which need, by their precedence, to read the first value. The transactions that have read the wrong value end with
incorrect results.
2. The dirty read problem: Transactions read a value written by a transaction that has been later aborted. This value
disappears from the database upon abort, and should not have been read by any transaction ("dirty read"). The
reading transactions end with incorrect results.
3. The incorrect summary problem: While one transaction takes a summary over the values of all the instances of a
repeated data-item, a second transaction updates some instances of that data-item. The resulting summary does
not reflect a correct result for any (usually needed for correctness) precedence order between the two transactions
(if one is executed before the other), but rather some random result, depending on the timing of the updates, and
whether certain update results have been included in the summary or not. concurrency control can be further
divided in to tree part that is lost update problem incomplete summary problem and dirt read problem
Concurrency control
162
Concurrency control mechanisms
Categories
The main categories of concurrency control mechanisms are:
• Optimistic - Delay the checking of whether a transaction meets the isolation and other integrity rules (e.g.,
serializability and recoverability) until its end, without blocking any of its (read, write) operations ("...and be
optimistic about the rules being met..."), and then abort a transaction to prevent the violation, if the desired rules
are to be violated upon its commit. An aborted transaction is immediately restarted and re-executed, which incurs
an obvious overhead (versus executing it to the end only once). If not too many transactions are aborted, then
being optimistic is usually a good strategy.
• Pessimistic - Block an operation of a transaction, if it may cause violation of the rules, until the possibility of
violation disappears. Blocking operations is typically involved with performance reduction.
• Semi-optimistic - Block operations in some situations, if they may cause violation of some rules, and do not
block in other situations while delaying rules checking (if needed) to transaction's end, as done with optimistic.
Different categories provide different performance, i.e., different average transaction completion rates (throughput),
depending on transaction types mix, computing level of parallelism, and other factors. If selection and knowledge
about trade-offs are available, then category and method should be chosen to provide the highest performance.
The mutual blocking between two transactions (where each one blocks the other) or more results in a deadlock,
where the transactions involved are stalled and cannot reach completion. Most non-optimistic mechanisms (with
blocking) are prone to deadlocks which are resolved by an intentional abort of a stalled transaction (which releases
the other transactions in that deadlock), and its immediate restart and re-execution. The likelihood of a deadlock is
typically low.
Both blocking, deadlocks, and aborts result in performance reduction, and hence the trade-offs between the
categories.
Methods
Many methods for concurrency control exist. Most of them can be implemented within either main category above.
The major methods, which have each many variants, and in some cases may overlap or be combined, are:
1. Locking (e.g., Two-phase locking - 2PL) - Controlling access to data by locks assigned to the data. Access of a
transaction to a data item (database object) locked by another transaction may be blocked (depending on lock type
and access operation type) until lock release.
2. Serialization graph checking (also called Serializability, or Conflict, or Precedence graph checking) - Checking
for cycles in the schedule's graph and breaking them by aborts.
3. Timestamp ordering (TO) - Assigning timestamps to transactions, and controlling or checking access to data by
timestamp order.
4. Commitment ordering (or Commit ordering; CO) - Controlling or checking transactions' chronological order of
commit events to be compatible with their respective precedence order.
Other major concurrency control types that are utilized in conjunction with the methods above include:
• Multiversion concurrency control (MVCC) - Increasing concurrency and performance by generating a new
version of a database object each time the object is written, and allowing transactions' read operations of several
last relevant versions (of each object) depending on scheduling method.
• Index concurrency control - Synchronizing access operations to indexes, rather than to user data. Specialized
methods provide substantial performance gains.
• Private workspace model (Deferred update) - Each transaction maintains a private workspace for its accessed
data, and its changed data become visible outside the transaction only upon its commit (e.g., Weikum and Vossen
2001). This model provides a different concurrency control behavior with benefits in many cases. MVCC can be
Concurrency control
163
viewed as a special case where the changed/new data in the workspace (or a copy of it) join the database itself
upon commit (for unchanged data no difference exists).
The most common mechanism type in database systems since their early days in the 1970s has been Strong strict
Two-phase locking (SS2PL; also called Rigorous scheduling or Rigorous 2PL) which is a special case (variant) of
both Two-phase locking (2PL) and Commitment ordering (CO). It is pessimistic. In spite of its long name (for
historical reasons) the idea of the SS2PL mechanism is simple: "Release all locks applied by a transaction only after
the transaction has ended." SS2PL (or Rigorousness) is also the name of the set of all schedules that can be generated
by this mechanism, i.e., these are SS2PL (or Rigorous) schedules, have the SS2PL (or Rigorousness) property.
Major goals of concurrency control mechanisms
Concurrency control mechanisms firstly need to operate correctly, i.e., to maintain each transaction's integrity rules
(as related to concurrency; application-specific integrity rule are out of the scope here) while transactions are running
concurrently, and thus the integrity of the entire transactional system. Correctness needs to be achieved with as good
performance as possible. In addition, increasingly a need exists to operate effectively while transactions are
distributed over processes, computers, and computer networks. Other subjects that may affect concurrency control
are recovery and replication.
Correctness
Serializability
For correctness, a common major goal of most concurrency control mechanisms is generating schedules with the
Serializability property. Without serializability undesirable phenomena may occur, e.g., money may disappear from
accounts, or be generated from nowhere. Serializability of a schedule means equivalence (in the resulting database
values) to some serial schedule with the same transactions (i.e., in which transactions are sequential with no overlap
in time, and thus completely isolated from each other: No concurrent access by any two transactions to the same data
is possible). Serializability is considered the highest level of isolation among database transactions, and the major
correctness criterion for concurrent transactions. In some cases compromised, relaxed forms of serializability are
allowed for better performance (e.g., the popular Snapshot isolation mechanism) or to meet availability requirements
in highly distributed systems (see Eventual consistency), but only if application's correctness is not violated by the
relaxation (e.g., no relaxation is allowed for money transactions, since by relaxation money can disappear, or appear
from nowhere).
Almost all implemented concurrency control mechanisms achieve serializability by providing Conflict serializablity,
a broad special case of serializability (i.e., it covers, enables most serializable schedules, and does not impose
significant additional delay-causing constraints) which can be implemented efficiently.
Recoverability
See Recoverability in Serializability
Comment: While in the general area of systems the term "recoverability" may refer to the ability of a system to
recover from failure, within concurrency control of database systems this term has received a specific meaning.
Concurrency control typically also ensures the Recoverability property of schedules for maintaining correctness in
cases of aborted transactions (which can always happen for many reasons). Recoverability (from abort) means that
no committed transaction in a schedule has read data written by an aborted transaction. Such data disappear from the
database (upon the abort) and are parts of an incorrect database state. Reading such data violates the consistency rule
of ACID. Unlike Serializability, Recoverability cannot be compromised, relaxed at any case, since any relaxation
results in quick database integrity violation upon aborts. The major methods listed above provide serializability
mechanisms. None of them in its general form automatically provides recoverability, and special considerations and
mechanism enhancements are needed to support recoverability. A commonly utilized special case of recoverability is
Concurrency control
164
Strictness, which allows efficient database recovery from failure (but excludes optimistic implementations; e.g.,
Strict CO (SCO) cannot have an optimistic implementation, but has semi-optimistic ones).
Comment: Note that the Recoverability property is needed even if no database failure occurs and no database
recovery from failure is needed. It is rather needed to correctly automatically handle transaction aborts, which may
be unrelated to database failure and recovery from it.
Distribution
With the fast technological development of computing the difference between local and distributed computing over
low latency networks or buses is blurring. Thus the quite effective utilization of local techniques in such distributed
environments is common, e.g., in computer clusters and multi-core processors. However the local techniques have
their limitations and need multi-processes (or threads) supported by multi-processors (or cores) to scale. This often
turns them into distributed, if the transactions themselves need to span multi-processes. In these cases local
concurrency control techniques do not scale well.
Distributed serializability and Commitment ordering
See Distributed serializability in Serializability
As database systems have become distributed, or started to cooperate in distributed environments (e.g., Federated
databases in the early 1990s, and nowadays Grid computing, Cloud computing, and networks with smartphones),
some transactions have become distributed. A distributed transaction means that the transaction spans processes, and
may span computers and geographical sites. This generates a need in effective distributed concurrency control
mechanisms. Achieving the Serializability property of a distributed system's schedule (see Distributed serializability
and Global serializability (Modular serializability)) effectively poses special challenges typically not met by most of
the regular serializability mechanisms, originally designed to operate locally. This is especially due to a need in
costly distribution of concurrency control information amid communication and computer latency. The only known
general effective technique for distribution is Commitment ordering, which was disclosed publicly in 1991 (after
being patented). Commitment ordering (Commit ordering, CO; Raz 1992) means that transactions' chronological
order of commit events is kept compatible with their respective precedence order. CO does not require the
distribution of concurrency control information and provides a general effective solution (reliable, high-performance,
and scalable) for both distributed and global serializability, also in a heterogeneous environment with database
systems (or other transactional objects) with different (any) concurrency control mechanisms. CO is indifferent to
which mechanism is utilized, since it does not interfere with any transaction operation scheduling (which most
mechanisms control), and only determines the order of commit events. Thus, CO enables the efficient distribution of
all other mechanisms, and also the distribution of a mix of different (any) local mechanisms, for achieving
distributed and global serializability. The existence of such a solution has been considered "unlikely" until 1991, and
by many experts also later, due to misunderstanding of the CO solution (see Quotations in Global serializability). An
important side-benefit of CO is automatic distributed deadlock resolution. Contrary to CO, virtually all other
techniques (when not combined with CO) are prone to distributed deadlocks (also called global deadlocks) which
need special handling. CO is also the name of the resulting schedule property: A schedule has the CO property if the
chronological order of its transactions' commit events is compatible with the respective transactions' precedence
(partial) order.
SS2PL mentioned above is a variant (special case) of CO and thus also effective to achieve distributed and global
serializability. It also provides automatic distributed deadlock resolution (a fact overlooked in the research literature
even after CO's publication), as well as Strictness and thus Recoverability. Possessing these desired properties
together with known efficient locking based implementations explains SS2PL's popularity. SS2PL has been utilized
to efficiently achieve Distributed and Global serializability since the 1980, and has become the de-facto standard for
it. However, SS2PL is blocking and constraining (pessimistic), and with the proliferation of distribution and
utilization of systems different from traditional database systems (e.g., as in Cloud computing), less constraining
Concurrency control
165
types of CO (e.g., Optimistic CO) may be needed for better performance.
Comments:
1. The Distributed conflict serializability property in its general form is difficult to achieve efficiently, but it is
achieved efficiently via its special case Distributed CO: Each local component (e.g., a local DBMS) needs both to
provide some form of CO, and enforce a special voting strategy for the Two-phase commit protocol (2PC: utilized
to commit distributed transactions). Differently from the general Distributed CO, Distributed SS2PL exists
automatically when all local components are SS2PL based (in each component CO exists, implied, and the voting
strategy is now met automatically). This fact has been known and utilized since the 1980s (i.e., that SS2PL exists
globally, without knowing about CO) for efficient Distributed SS2PL, which implies Distributed serializability
and strictness (e.g., see Raz 1992, page 293; it is also implied in Bernstein et al. 1987, page 78). Less constrained
Distributed serializability and strictness can be efficiently achieved by Distributed Strict CO (SCO), or by a mix
of SS2PL based and SCO based local components.
2. About the references and Commitment ordering: (Bernstein et al. 1987) was published before the discovery of
CO in 1990. The CO schedule property is called Dynamic atomicity in (Lynch et al. 1993, page 201). CO is
described in (Weikum and Vossen 2001, pages 102, 700), but the description is partial and misses CO's essence.
(Raz 1992) was the first refereed and accepted for publication article about CO algorithms (however, publications
about an equivalent Dynamic atomicity property can be traced to 1988). Other CO articles followed.
Distributed recoverability
Unlike Serializability, Distributed recoverability and Distributed strictness can be achieved efficiently in a
straightforward way, similarly to the way Distributed CO is achieved: In each database system they have to be
applied locally, and employ a voting strategy for the Two-phase commit protocol (2PC; Raz 1992, page 307).
As has been mentioned above, Distributed SS2PL, including Distributed strictness (recoverability) and Distributed
commitment ordering (serializability), automatically employs the needed voting strategy, and is achieved (globally)
when employed locally in each (local) database system (as has been known and utilized for many years; as a matter
of fact locality is defined by the boundary of a 2PC participant (Raz 1992) ).
Other major subjects of attention
The design of concurrency control mechanisms is often influenced by the following subjects:
Recovery
All systems are prone to failures, and handling recovery from failure is a must. The properties of the generated
schedules, which are dictated by the concurrency control mechanism, may have an impact on the effectiveness and
efficiency of recovery. For example, the Strictness property (mentioned in the section Recoverability above) is often
desirable for an efficient recovery.
Replication
For high availability database objects are often replicated. Updates of replicas of a same database object need to be
kept synchronized. This may affect the way concurrency control is done (e.g., Gray et al. 1996
[1]
).
References
• Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman (1987): Concurrency Control and Recovery in
Database Systems
[2]
(free PDF download), Addison Wesley Publishing Company, 1987, ISBN 0-201-10715-5
• Gerhard Weikum, Gottfried Vossen (2001): Transactional Information Systems
[5]
, Elsevier, ISBN
1-55860-508-8
Concurrency control
166
• Nancy Lynch, Michael Merritt, William Weihl, Alan Fekete (1993): Atomic Transactions in Concurrent and
Distributed Systems
[13]
, Morgan Kauffman (Elsevier), August 1993, ISBN 978-1-55860-104-8, ISBN
1-55860-104-X
• Yoav Raz (1992): "The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous
Environment of Multiple Autonomous Resource Managers Using Atomic Commitment."
[7]
(PDF
[8]
),
Proceedings of the Eighteenth International Conference on Very Large Data Bases (VLDB), pp. 292-312,
Vancouver, Canada, August 1992. (also DEC-TR 841, Digital Equipment Corporation, November 1990)
[1] Gray, J.; Helland, P.; O’Neil, P.; Shasha, D. (1996). "The dangers of replication and a solution" (ftp:// ftp.research.microsoft.com/ pub/ tr/
tr-96-17. pdf). Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. pp. 173–182.
doi:10.1145/233269.233330. .
[2] http:// research.microsoft.com/ en-us/ people/ philbe/ ccontrol.aspx
Concurrency control in operating systems
Multitasking operating systems, especially real-time operating systems, need to maintain the illusion that all tasks
running on top of them are all running at the same time, even though only one or a few tasks really are running at
any given moment due to the limitations of the hardware the operating system is running on. Such multitasking is
fairly simple when all tasks are independent from each other. However, when several tasks try to use the same
resource, or when tasks try to share information, it can lead to confusion and inconsistency. The task of concurrent
computing is to solve that problem. Some solutions involve "locks" similar to the locks used in databases, but they
risk causing problems of their own such as deadlock. Other solutions are Non-blocking algorithms.
References
• Andrew S. Tanenbaum, Albert S Woodhull (2006): Operating Systems Design and Implementation, 3rd Edition,
Prentice Hall, ISBN 0-131-42938-8
• Silberschatz, Avi; Galvin, Peter; Gagne, Greg (2008). Operating Systems Concepts, 8th edition. John Wiley &
Sons. ISBN 0-470-12872-0.
Conference on Innovative Data Systems Research
167
Conference on Innovative Data Systems Research
The Conference on Innovative Data Systems Research (CIDR) is a biennial computer science conference focused
on research into new techniques for data management. It was started in 2002 by Michael Stonebraker, Jim Gray, and
David DeWitt, and is held at the Asilomar Conference Grounds in Pacific Grove, California.
CIDR focuses on presenting work that is more speculative, radical, or provocative than what is typically accepted by
the traditional database research conferences (such as the International Conference on Very Large Data Bases
(VLDB) and the ACM SIGMOD Conference).
External links
• CIDR website
[1]
References
[1] http:/ / www-db.cs. wisc. edu/ cidr/
Consumer Relationship System
Consumer Relationship Systems (CRS) are specialized Customer Relationship Management (CRM) software
applications used to handle consumer products and services company's dealings with consumers and customers.
[1]
Consumer Affairs and Customer Relations contact centers within these organizations, that are typically Consumer
Packaged Goods (CPG) companies providing consumers with packaged items, such as foods and beverages,
household consumable products and durable goods, as well as travel services, e.g., passenger airlines and cruise ship
lines.
The companies' trained contact center representatives handle in-bound contacts from anonymous consumers and
customers, replying to inquiries and fulfilling responses. Representatives capture consumer contact information,
issues, and verbatim feedback which is stored in the CRM and made available to company stakeholders such as
marketing, product management and development, legal, public relations, etc., for input to product and service
improvements. The CRS workflow processing and reporting enable issuing of early warning alerts to product
problems in the marketplace (e.g., product recalls) and capture of current consumer sentiment ('voice of the
customer'). The system has been used to effectively create best-practice actionable voice of the customer (VOC)
processes (See ICMI's Customer Management Insight Magazine, September 2007, pp 44–50.)[2]
The first such CRS was developed in the 1980s. In 1981 Michael Wilke and Robert Thornton founded
Wilke/Thornton, Inc. in Columbus, Ohio, to develop new software application systems.[3] By 1983 Wilke/Thornton
was creating and implementing a specialized CRM application system for inbound consumer contact call centers of
CPG companies, including Kimberly-Clark, Quaker Oats, and Yokohama Tire [4]. This consumer/customer contact
management system was designed to capture and record consumer response (inquiries, reactions, and related issues,
such as how-to-use and where-to-buy items) to packaged products and services.
The Consumer Relationship System (CRS) established a niche category—the consumer relationship system—and,
because Wilke/Thornton was the earliest developer, become the system became the de facto standard for the
consumer packaged goods industry niche. Now hundreds of consumer packaged goods companies with global
consumer response handling operations now use consumer relationship systems (Consumer Relationship CRM).
[5]
Some 10,000 contact center representatives use these systems daily worldwide. The CRS facilitate the processing of
workflow by the representatives who receive the calls, letters, email, and online chat messages from consumers in
many locations, across time zones, communicating in many languages, having different postal address formats, and a
Consumer Relationship System
168
vast multitude of different product items and issues. These representatives assign item, issue, status and action codes
to contacts and carryout the appropriate replies and fulfillment actions.
While the goal of these consumer response handling operations is to increase customer satisfaction, loyalty, and
retention, the consumer relationship systems also collect consumer response for early detection of local market
product problems and of consumer preference trends to provide detailed response data, including verbatim
transcripts, for analysis and insight development reported internally to support product and service improvements.
Many of the consumer contact center managers who use the CRS are members of the Society of Consumer Affairs
Professionals International, a global organization that supports the aims and purposes of these customer care
professionals. [6]
Current consumer relationship systems integrate with telephone and call recording systems, with corporate enterprise
systems for input and reporting. Consumer response flows from consumer products companies’ branded Web sites
directly into the response systems. These systems are popular because they can deliver the ‘voice of the consumer’
that contributes to product quality improvement and corporate success. (See ICMI's Customer Management Insight,
December, 2007, pp. 45–50.)[2] The CRS area available by traditional perpetual license and by online subscription
service. In recent years Consumer Relationship CRM vendors, like Wilke/Thornton mentioned above, have begun
adopting the software-as-a-service business delivery model (SaaS)SaaS which benefits users by reducing up-front
capital costs and by being scalable to meet user needs. With the on-demand Web-based CRS subscription service,
subscribers pay only for their usage of the system, like traditional utility services.
References
[1] Customer relationship management
[2] http:/ / www. nxtbook.com/ nxtbooks/ cmp/ cmi_200709/ index. php
[3] http:// www. wilke-thornton.com/ WTI/Pages/ products. html
[4] http:/ / www. google. com/ search?hl=en& rls=com.microsoft%3Aen-us&q=Yokohama+Tire+CRS
[5] Customer relationship management#Consumer Relationship CRM
[6] http:// www. socap. org/
Data management | Intelligence (information gathering)
Content Engineering
169
Content Engineering
Content Engineering is a term applied to an engineering speciality dealing with the issues around the use of content
in computer-facilitated environments. Content production, content management, content modelling, content
conversion, and content use and repurposing are all areas involving this speciality. It is not a speciality with wide
industry recognition and is often performed on an ad hoc basis by members of software development or content
production staff, but is beginning to be recognized as a necessary function in any complex content-centric project
involving both content production as well as software system development.
Content engineering tends to bridge the gap between groups involved in the production of content (Publishing and
Editorial staff, Marketing, Sales, HR) and more technologically-oriented departments (such as Software
Development, or IT) that put this content to use in web or other software-based environments, and requires an
understanding of the issues and processes of both sides.
Typically, Content Engineering involves extensive use of XML technologies, XML being the most widespread
language for representing structured content. Content Management Systems are often a key technology used in this
practice though frequently Content Engineering fills the gap where no formal CMS has been put into place.
Content format
Graphical representations of electrical data: analog
audio content format (red), 4-bit digital pulse code
modulated content format (black).
Chinese calligraphy written in a language content
format by Song Dynasty (A.D. 1051-1108) poet Mi Fu.
A content format is an encoded format for converting a specific
type of data to displayable information. Content formats are used
in recording and transmission to prepare data for observation or
interpretation.
[1]

[2]
This includes both analog and digitized
content. Content formats may be recorded and read by either
natural or manufactured tools and mechanisms.
In addition to converting data to information, a content format may
include the encryption and/or scrambling of that information.
[3]
Multiple content formats may be contained within a single section
of a storage medium (e.g. track, disk sector, computer file,
document, page, column) or transmitted via a single channel (e.g.
wire, carrier wave) of a transmission medium. With multimedia,
multiple tracks containing multiple content formats are presented
simultaneously. Content formats may either be recorded in
secondary signal processing methods such as a software container
format (e.g. digital audio, digital video) or recorded in the primary
format (e.g. spectrogram, pictogram).
Observable data is often known as raw data, or raw content.
[4]
A
primary raw content format may be directly observable (e.g.
image, sound, motion, smell, sensation) or physical data which
only requires hardware to display it, such as a phonographic
needle and diaphragm or a projector lamp and magnifying glass.
There has been a countless number of content formats throughout
history. The following are examples of some common content
formats and content format categories:
Content format
170
A series of numbers encoded in a Universal Product
Code digital numeric content format.
• Audio data encoding
[5]
• Analog audio data
• Stereophonic sound formats
• Digital audio data
• Audio data compression
• Synthesizer sequences
• Visual data encoding
• Hand rendering materials
• Film speed formats
• Pixel coordinates data
• Color space data
• Vector graphic coordinates/dimensions
• Texture mapping formats
• 3D display formats
• Holographic formats
• Display resolution formatting
• Motion graphics encoding
• Frame rate data
• Video data
[6]
• Computer animation formats
• Instruction encoding
• Musical notation
• Computer language
• Traffic signals
• Natural languages formats
• Writing systems
• Phonetic
• Sign languages
• Communication signaling formats
• Code formats
• Information mapping
• Graphic organizer
• Statistical model
• Table of elements
• DNA sequence
• Human anatomy
• Biometric data
• Chemical formulas
• Aroma compound
• Drug chart
• Electromagnetic spectrum
• Time standard
• Numerical weather prediction
• Capital asset pricing model
• National income and output
• Celestial coordinate system
• Military mapping
• Geographic information system
• Interstate Highway System
References
[1] Bob Boiko, Content Management Bible, Nov 2004 pp:79, 240, 830
[2] Ann Rockley, Managing Enterprise Content: A Unified Content Strategy, Oct 2002 pp:269, 320, 516
[3] Jessica Keyes, Technology Trendlines, Jul 1995 pp:201
[4] Oge Marques and Borko Furht, Content-Based Image and Video Retrieval, April 2002 pp:15
[5] David Austerberry, The Technology of Video and Audio Streaming, Second Edition, Sep 2004 pp: 328
[6] M. Ghanbari, Standard Codecs: Image Compression to Advanced Video Coding, Jun 2003 pp:364
Content inventory
171
Content inventory
A content inventory is the process and the result of cataloging the entire contents of a website.
[1]
An allied
practice—a content audit—is the process of evaluating that content.
[2]

[3]

[4]
A content inventory and a content audit
are closely related concepts, and they are often conducted in tandem.
Description
A content inventory typically includes all information assets on a website, such as web pages (html), meta elements
(e.g., keywords, description, page title), images, audio and video files, and document files (e.g., .pdf, .doc, .ppt).
[5]

[6]
[7]

[8]

[9]

[10]
A content inventory is a quantitative analysis of a website. It simply logs what is on a website. The
content inventory will answer the question: “What is there?”
A content audit is a qualitative analysis of information assets on a website. It is the assessment of that content and its
place in relationship to surrounding Web pages and information assets. The content audit will answer the question:
“Is it any good?”
[3]

[4]
Over the years, techniques for creating and managing a content inventory have been developed and refined by many
experts in the field of website content management.
[1]

[2]

[11]
A spreadsheet application (e.g., Microsoft Excel) is the preferred tool for keeping a content inventory; the data can
be easily configured and manipulated. Typical categories in a content inventory include the following:
• Link — The URL for the page
• Format — For example, .html, .pdf, .doc, .ppt
• Meta page title — Page title as it appears in the meta <title> tag
• Meta keywords — Keywords as they appear in the meta name="keywords" tag element
• Meta description — Text as it appears in the meta name="description" tag element
• Content owner — Person responsible for maintaining page content
• Date page last updated — Date of last page update
• Audit Comments (or Notes) — Audit findings and notes
There are other descriptors that may need to be captured on the inventory sheet. Content management experts advise
capturing information that might be useful for both short- and long-term purposes. Other information could include:
• the overall topic or area to which the page belongs
• a short description of the information on the page
• when the page was created, date of last revision, and when the next page review is due
• pages this page links to
• pages that link to this page
• page status – keep, delete, revise, in revision process, planned, being written, being edited, in review, ready for
posting, or posted
• rank of page on the website – is it a top 50 page? a bottom 50 page? Initial efforts might be more focused on
those pages that visitors use the most and least.
Other tabs in the inventory workbook can be created to track related information, such as meta keywords, new Web
pages to develop, website tools and resources, or content inventories for sub-areas of the main website. Creating a
single, shared location for information related to a website can be helpful for all website content managers, writers,
editors, and publishers.
Populating the spreadsheet is a painstaking task, but some up-front work can be automated with software, and other
tools and resources can assist the audit work.
Content inventory
172
Value
A content inventory and a content audit are performed to understand what is on a website and why it is there. The
inventory sheet, once completed and revised as the site is updated with new content and information assets, can also
become a resource for help in maintaining website governance.
For an existing website, the information cataloged in a content inventory and content audit will be a resource to help
manage all of the information assets on the website.
[12]
The information gathered in the inventory can also be used to
plan a website re-design or site migration to a web content management system.
[10]
When planning a new website, a
content inventory can be a useful project management tool: as a guide to map information architecture and to track
new pages, page revision dates, content owners, and so on.
References
[1] Veen, Jeffrey (June 2002). "Doing a Content Inventory (Or, A Mind-Numbingly Detailed Odyssey Through Your Web Site)" (http:// www.
adaptivepath.com/ ideas/ essays/ archives/ 000040. php). . Retrieved 27 April 2010.
[2] Halverson, Kristina (August 2009). "Content Strategy for the Web: Why You Must Do a Content Audit" (http:/ / www. peachpit.com/
articles/ article.aspx?p=1388961). . Retrieved 6 May 2010.
[3] Baldwin, Scott (January 2010). "Doing a content audit or inventory" (http://nform.ca/ blog/ 2010/ 01/ doing-a-content-audit-or-inven). .
Retrieved 29 April 2010.
[4] Marsh, Hilary (January 2005). "How to do a content audit" (http:/ / www.contentcompany.biz/ articles/ content_audit.html). . Retrieved 3
May 2010.
[5] Spencer, Donna (January 2006). "Taking a content inventory" (http:/ / maadmob.net/ donna/ blog/ 2006/ taking-a-content-inventory). .
Retrieved 27 April 2010.
[6] Doss, Glen (January 2007). "Content Inventory: Sometimes referred to as Web Content Inventory or Web Audit" (http:/ / www.fatpurple.
com/ 2010/ 02/ 26/ content-inventory/). . Retrieved 27 April 2010.
[7] Jones, Colleen (August 2009). "Content Analysis: A Practical Approach" (http:// www.uxmatters.com/ mt/ archives/ 2009/ 08/
content-analysis-a-practical-approach.php). . Retrieved 27 April 2010.
[8] Leise, Fred (March 2007). "Content Analysis Heuristics" (http:// boxesandarrows.com/ view/ content-analysis). . Retrieved 27 April 2010.
[9] Baldwin, Scott (January 2010). "Doing a content audit or inventory" (http://nform.ca/ blog/ 2010/ 01/ doing-a-content-audit-or-inven). .
Retrieved 27 April 2010.
[10] Krozser, Kassia (April 2005). "The Content Inventory: Roadmap to a Successful CMS Implementation" (http:/ / www. alttags.org/
content-management/the-content-inventory-roadmap-to-a-succesful-cms-implementation/). . Retrieved 27 April 2010.
[11] Bruns, Don (March 2010). "Automatically Index a Content Inventory with GetUXIndex()" (http:// donbruns.net/ index.php/
how-to-automatically-index-a-content-inventory/). . Retrieved 6 May 2010.
[12] "Content Inventory" (http:// www.usability. gov/ methods/ design_site/ inventory.html). U.S. Department of Health & Human Services. 26
May 2009. . Retrieved 4 May 2010.
Further reading
• In his article A Map-Based Approach to a Content Inventory (http:// www.boxesandarrows.com/ view/
a-map-based-approach), Patrick Walsh describes how to use Microsoft Access and Microsoft Excel to link a data
attribute with a structural attribute to create “a tool that can be used throughout the lifetime of a website.”
• In the article The Rolling Content Inventory (http:// www.louisrosenfeld. com/ home/ bloug_archive/000448.
html), author Louis Rosenfeld argues that “ongoing, partial content inventories” are more cost-effective and
realistic to implement.
• Collen Jones writes from a UX design perspective in Content Analysis: A Practical Approach (http:/ / www.
uxmatters.com/ mt/ archives/ 2009/ 08/ content-analysis-a-practical-approach.php).
Content inventory
173
External links
• Xenu's Link Sleuth (http:// home. snafu. de/ tilman/ xenulink. html)
• SiteOrbiter (http:// siteorbiter.com/ )
• Similar Page Checker (http:/ / www. webconfs. com/ similar-page-checker.php)
• Link Checker Tools (http:/ / www. cryer.co. uk/ resources/ link_checkers. htm)
Content management
Content management, or CM, is the set of processes and technologies that support the collection, managing, and
publishing of information in any form or medium. In recent times this information is typically referred to as content
or, to be precise, digital content. Digital content may take the form of text, such as documents, multimedia files, such
as audio or video files, or any other file type which follows a content lifecycle which requires management.
The process of content management
Content management practices and goals vary with mission. News organizations, e-commerce websites, and
educational institutions all use content management, but in different ways. This leads to differences in terminology
and in the names and number of steps in the process.
For example, an instance of digital content is created by one or more authors. Over time that content may be edited.
One or more individuals may provide some editorial oversight thereby approving the content for publication.
Publishing may take many forms. Publishing may be the act of pushing content out to others, or simply granting
digital access rights to certain content to a particular person or group of persons. Later that content may be
superseded by another form of content and thus retired or removed from use.
Content management is an inherently collaborative process. It often consists of the following basic roles and
responsibilities:
• Creator - responsible for creating and editing content.
• Editor - responsible for tuning the content message and the style of delivery, including translation and
localization.
• Publisher - responsible for releasing the content for use.
• Administrator - responsible for managing access permissions to folders and files, usually accomplished by
assigning access rights to user groups or roles. Admins may also assist and support users in various ways.
• Consumer, viewer or guest- the person who reads or otherwise takes in content after it is published or shared.
A critical aspect of content management is the ability to manage versions of content as it evolves (see also version
control). Authors and editors often need to restore older versions of edited products due to a process failure or an
undesirable series of edits.
Another equally important aspect of content management involves the creation, maintenance, and application of
review standards. Each member of the content creation and review process has a unique role and set of
responsibilities in the development and/or publication of the content. Each review team member requires clear and
concise review standards which must be maintained on an ongoing basis to ensure the long-term consistency and
health of the knowledge base.
A content management system is a set of automated processes that may support the following features:
• Import and creation of documents and multimedia material.
• Identification of all key users and their roles.
• The ability to assign roles and responsibilities to different instances of content categories or types.
Content management
174
• Definition of workflow tasks often coupled with messaging so that content managers are alerted to changes in
content.
• The ability to track and manage multiple versions of a single instance of content.
• The ability to publish the content to a repository to support access to the content. Increasingly, the repository is an
inherent part of the system, and incorporates enterprise search and retrieval.
Content management systems take the following forms:
• a web content management system is software for web site management - which is often what is implicitly meant
by this term
• the work of a newspaper editorial staff organization
• a workflow for article publication
• a document management system
• a single source content management system - where content is stored in chunks within a relational database
Implementation
Content management implementations must be able to manage content distributions and digital rights in content life
cycle. Content management systems are usually involved with Digital Rights Management in order to control user
access and digital rights. In this step the read only structures of Digital Rights Management Systems force some
limitations on Content Management implementations as they do not allow the protected contents to be changed in
their life cycle. Creation of new contents using the managed(protected) ones is also another issue which will get the
protected contents out of management controlling systems. There are a few Content Management implementations
covering all these issues.
External links
• Boiko, Bob (2004-11-26). Content Management Bible. Wiley. pp. 1176. ISBN 0764573713.
• Rockley, Ann (2002-10-27). Managing Enterprise Content: A Unified Content Strategy. New Riders Press.
pp. 592. ISBN 0735713065.
• Hackos, JoAnn T. (2002-2-14). Content Management for Dynamic Web Delivery. Wiley. pp. 432.
ISBN 0471085863.
• Glushko, Robert J.; Tim McGrath (2005). Document Engineering: Analyzing and Designing Documents for
Business Informatics and Web Services. MIT Press. pp. 728. ISBN 0262572451.
References
Content Migration
175
Content Migration
Content Migration is the process of moving information stored on a Web content management system(CMS),
Digital asset management(DAM), Document management system(DMS), or flat HTML based system to a new
system. Flat HTML content can entail HTML files, Active Server Pages (ASP), JavaServer Pages (JSP), PHP, or
content stored in some type of HTML/JavaScript based system and can be either static or dynamic content.
Content Migrations can solve a number of issues ranging from:
• Consolidation from one or more CMS systems into one system to allow for more centralized control, governance
of content, and better Knowledge management and sharing.
• Reorganizing content due to mergers and acquisitions to assimilate as much content from the source systems for a
unified look and feel.
• Converting content that has grown organically either in a CMS or Flat HTML and standardizing the formatting so
standards can be applied for a unified branding of the content.
There are many ways to access the content stored in a CMS. Depending on the CMS vendor they offer either an
Application programming interface (API), Web services, rebuilding a record by writing SQL queries, XML exports,
or through the web interface.
1. The API
[1]
requires a developer to read and understand how to interact with the source CMS’s API layer then
develop an application that extracts the content and stores it in a database, XML file, or Excel. Once the content is
extracted the developer must read and understand the target CMS API and develop code to push the content into
the new System. The same can be said for Web Services.
2. Most CMSs use a database to store and associate content so if no API exists the SQL programmer must reverse
engineer the table structure. Once the structure is reverse engineered, very complex SQL queries are written to
pull all the content from multiple tables into an intermediate table or into some type of Comma-separated values
(CSV) or XML file. Once the developer has the files or database the developer must read and understand the
target CMS API and develop code to push the content into the new System. The same can be said for Web
Services.
3. XML export creates XML files of the content stored in a CMS but after the files are exported they need to be
altered to fit the new scheme of the target CMS system. This is typically done by a developer by writing some
code to do the transformation.
4. HTML files, JSP, ASP, PHP, or other application server file formats are the most difficult. The structure for Flat
HTML files are based on a culmination of folder structure, HTML file structure, and image locations. In the early
days of content migration, the developer had to use programming languages to parse the html files and save it as
structured database, XML or CSV. Typically PERL, JAVA, C++, or C# were used because of the regular
expression handling capability. JSP, ASP, PHP, ColdFusion, and other Application Server technologies usually
rely on server side includes to help simplify development but makes it very difficult to migrate content because
the content is not assembled until the user looks at it in their web browser. This makes is very difficult to look at
the files and extract the content from the file structure.
5. Web Scraping allows users to access most of the content directly from the Web User Interface. Since a web
interface is visual (this is the point of a CMS) some Web Scrapers leverage the UI to extract content and place it
into a structure like a Database, XML, or CSV formats. All CMSs, DAMs, and DMSs use web interfaces so
extracting the content for one or many source sites is basically the same process. In some cases it is possible to
push the content into the new CMS using the web interface but some CMSs use JAVA applets, or Active X
Control which are not supported by most web scrapers. In that case the developer must read and understand the
target CMS API and develop code to push the content into the new System. The same can be said for Web
Services.
The basic content migration flow
Content Migration
176
1. Obtain an inventory of the content.
2. Obtain an inventory of Binary content like Images, PDFs, CSS files, Office Docs, Flash, and any binary objects.
3. Find any broken links in the content or content resources.
4. Determine the Menu Structure of the Content.
5. Find the parent/sibling connection to the content so the links to other content and resources are not broken when
moving them.
6. Extract the Resources from the pages and store them into a Database or File structure. Store the reference in a
database or a File.
7. Extract the HTML content from the site and store locally.
8. Upload the resources to the new CMS either by using the API or the web interface and store the new location in a
Database or XML.
9. Transform the HTML to meet the new CMSs standards and reconnect any resources.
10. Upload the transformed content into the new system.
Vendors
• Active Navigation
[2]
• Kapow Software
[3]
• Proventeq
[4]
• Tzunami
[5]
• Vamosa
[6]
• Nonlinearcreations
[7]
• Seeunity
[8]
• EntropySoft
[9]
References
[1] What the Content Migration APIs Are Not (http:/ / msdn. microsoft.com/ en-us/library/ms453426. aspx)
[2] http:/ / www. activenavigation. com
[3] http:/ / www. kapowsoftware. com
[4] http:/ / www. Proventeq.com
[5] http:/ / www. Tzunami.com
[6] http:/ / www. vamosa. com
[7] http:// www. Nonlinearcreations. com
[8] http:/ / www. Seeunity. com
[9] http:/ / entropysoft.net
External links
• The Definitive Guide to Automating Content Migration (http://info. kapowsoftware.com/ GuidetoCMWP. html)
• Content Migration Seven Steps to Success (http:/ / www.vamosa. com/
data-migration-seven-steps-to-success-a351)
• No Small Task: Migrating Content to a New CMS (http:// www. cmswire. com/ cms/ web-publishing/
no-small-task-migrating-content-to-a-new-cms-002437.php)
• A Look at Automated Content Migration: Part 1 - Kapow Technologies (http:/ / www.cmswire. com/ cms/
web-cms/ a-look-at-automated-content-migration-part-1-kapow-technologies-006158.php)
• A Look at Automated Content Migration: Part 2 - Vamosa (http:/ / www.cmswire. com/ cms/ web-cms/
a-look-at-automated-content-migration-part-2-vamosa-006275.php)
• Content Migration articles on CMS Wire (http:/ / www.cmswire. com/ cms/ content-migration/)
Content re-appropriation
177
Content re-appropriation
Fundamental to modern information architectures, and driven by semantic Web
[1]
technologies, content
re-appropriation is the act of searching, filtering, gathering, grouping, and aggregation which allows information to
be related, classified and identified. This is achieved by applying syntactic or semantic meaning though intelligent
tagging or artificial interpretation of fragmented content (see Resource Description Framework). Hence, all
information becomes valuable and interpretable.
Domain
Since the domain of Content applies to areas of software applicationss, documents, and media, these can be
processed though a pipeline of generation, aggregation, transform-many, and serialization (see XML Pipeline
[2]
).
The output of this can viewed in a medium most effect for decision making.
The desired outcomes of content re-appropriation are:
• Seamless, Integrated, and Shared User experiences
• Visualization
• Detection, Analysis & Investigation
• Personalization unique to the User
• Inbound or Outbound Syndication of Information
• Publish or Subscribe to Information
• Dynamically adapted output to Users medium
Essentially to make information disparities transparent to the user - getting to the bottom line … quickly.
Areas of Use
Content re-appropriation is effective across the Content-Tier, that is places where Content exists:
• Identity & Directory Management e.g. LDAP, SAML & JNDI
• Content Management e.g. Apache Slide
[3]
• Content Systems e.g. File Systems, E-mail, Network shares, SAN & Database
• Business Systems e.g. ERP & CRM
• Data Warehouse e.g. OLAP
• Internet & Web Services e.g. HTTP & SOAP
• Presence and peer-To-Peer
References
[1] http:/ / www. webreference.com/ internet/ semantic/
[2] http:// www. w3. org/TR/ xml-pipeline/
[3] http:// jakarta.apache. org/slide/
Content repository
178
Content repository
A content repository is the technical underpinning of a content application, like a Content Management System or a
Document Management System. It functions as the logical storage facility for content.
[1]
Content repository features
A content repository exposes amongst other the following facilities:
• Read/write of content
• Hierarchy and sort order management
• Query / search
• Versioning
• Access control
• Import / export
• Locking
• Life-cycle management
• Retention and hold / records management
Commonly known Content Applications / Content Management Systems
• Content Management System
• Digital Asset Management
• Source Code Management
• Web Content Management System
• Document Management System
• Social collaboration
• Records Management
Content repository or related standards and specification
• Content repository API for Java
• WebDAV
• Content Management Interoperability Services
References
[1] Content Repository Design (http:/ / openacs. org/doc/ acs-content-repository/design. html), ACS Content Repository (http:/ / openacs. org/
doc/ acs-content-repository/), OpenACS.org (http:// openacs.org/ ).
Control break
179
Control break
In a computer program, a control break occurs when there is a change in the value of one of the keys on which a file
is sorted which requires some extra processing. For example, with an input file sorted by post code, the number of
items found in each postal district might need to be printed on a report, and a heading shown for the next district.
Quite often there is a hierarchy of nested control breaks in a program, e.g. streets within districts within areas, with
the need for a grand total at the end. Structured programming techniques have been developed to ensure correct
processing of control breaks in languages such as COBOL and to ensure that conditions such as empty input files
and sequence errors are handled properly.
With fourth generation languages such as SQL, the programming language should handle most of the details of
control breaks automatically.
Control flow diagram
Example of a so called "performance seeking control flow
diagram".
[1]
A control flow diagram (CFD) is a diagram to describe
the control flow of a business process, process or
program.
Control flow diagrams were developed in the 1950s, and
are widely used in multiple engineering disciplines. They
are one of the classic business process modeling
methodologies, along with flow charts, data flow
diagrams, functional flow block diagram, Gantt charts,
PERT diagrams, and IDEF.
[2]
Overview
A control flow diagram can consist of a subdivision to
show sequential steps, with if-then-else conditions,
repetition, and/or case conditions. Suitably annotated
geometrical figures are used to represent operations, data,
or equipment, and arrows are used to indicate the
sequential flow from one to another.
[3]
There are several types of control flow diagrams, for
example:
• Change control flow diagram, used in project
management
• Configuration decision control flow diagram, used in configuration management
• Process control flow diagram, used in process management
• Quality control flow diagram, used in quality control.
In software and systems development control flow diagrams can be used in control flow analysis, data flow analysis,
algorithm analysis, and simulation. Control and data flow analysis are most applicable for real time and data driven
systems. These flow analyses transform logic and data requirements text into graphic flows which are easier to
analyze than the text. PERT, state transition, and transaction diagrams are examples of control flow diagrams.
[4]
Control flow diagram
180
Types of Control Flow Diagrams
Process Control Flow Diagram
A flow diagram can be developed for the process control system for each critical activity. Process control is normally
a closed cycle in which a sensor provides information to a process control software application through a
communications system. The application determines if the sensor information is within the predetermined (or
calculated) data parameters and constraints. The results of this comparison are fed to an actuator, which controls the
critical component. This feedback may control the component electronically or may indicate the need for a manual
action.
[5]
This closed-cycle process has many checks and balances to ensure that it stays safe. The investigation of how the
process control can be subverted is likely to be extensive because all or part of the process control may be oral
instructions to an individual monitoring the process. It may be fully computer controlled and automated, or it may be
a hybrid in which only the sensor is automated and the action requires manual intervention. Further, some process
control systems may use prior generations of hardware and software, while others are state of the art.
[5]
Performance seeking control flow diagram
The figure presents an example of a performance seeking control flow diagram of the algorithm. The control law
consists of estimation, modeling, and optimization processes. In the Kalman filter estimator, the inputs, outputs, and
residuals were recorded. At the compact propulsion system modeling stage, all the estimated inlet and engine
parameters were recorded.
[1]
In addition to temperatures, pressures, and control positions, such estimated parameters as stall margins, thrust, and
drag components were recorded. In the optimization phase, the operating condition constraints, optimal solution, and
linear programming health status condition codes were recorded. Finally, the actual commands that were sent to the
engine through the DEEC were recorded.
[1]
dfd(data float diagam)is network manen ment system
References
 This article incorporates public domain material from websites or documents
[6]
of the National Institute of
Standards and Technology.
[1] Glenn B. Gilyard and John S. Orme (1992) Subsonic Flight Test Evaluationof a Performance Seeking ControlAlgorithm on an F-15 Airplane
(http:// www. nasa. gov/ centers/ dryden/pdf/ 88262main_H-1808.pdf) NASA Technical Memorandum 4400.
[2] Thomas Dufresne & James Martin (2003). "Process Modeling for E-Business" (http:// mason. gmu.edu/ ~tdufresn/ paper.doc). INFS 770
Methods for Information Systems Engineering: Knowledge Management and E-Business. Spring 2003
[3] FDA glossary of terminology applicable to software development and computerized systems (http:/ / www.fda.gov/ ora/Inspect_ref/igs/
gloss. html). Accessed 14 Jan 2008.
[4] Dolores R. Wallace et al. (1996). Reference Information for the Software Verification and Validation Process (http:// hissa. nist. gov/
HHRFdata/Artifacts/ITLdoc/234/ val-proc.html), NIST Special Publication 500-234.
[5] National Institute of Justice (2002). A Method to Assess the Vulnerability of U.S. Chemical Facilities (http:// www. ncjrs. gov/ txtfiles1/ nij/
195171. txt). Series: Special Report.
[6] http:// www. nist. gov
Copyright
181
Copyright
Intellectual property
law
Primary rights
Copyright • Authors' rights
Related rights • Moral rights
Patent • Utility model
Trademark
Geographical indication
Trade secret
Sui generis rights
Database right
Indigenous intellectual property
Industrial design right
Mask work • Plant breeders' rights
Supplementary protection certificate
Related topics
Criticism • Orphan works
Public domain • more
Copyright is a set of exclusive rights granted to the author or creator of an original work, including the right to copy,
distribute and adapt the work. In most jurisdictions copyright arises upon fixation and does not need to be registered.
Copyright owners have the exclusive statutory right to exercise control over copying and other exploitation of the
works for a specific period of time, after which the work is said to enter the public domain. Uses covered under
limitations and exceptions to copyright, such as fair use, do not require permission from the copyright owner. All
other uses require permission. Copyright owners can license or permanently transfer or assign their exclusive rights
to others.
Initially copyright law applied to only the copying of books. Over time other uses such as translations and derivative
works were made subject to copyright. Copyright now covers a wide range of works, including maps, sheet music,
dramatic works, paintings, photographs, architectural drawings, sound recordings, motion pictures and computer
programs. The British Statute of Anne 1709, full title "An Act for the Encouragement of Learning, by vesting the
Copies of Printed Books in the Authors or purchasers of such Copies, during the Times therein mentioned", was the
first copyright statute. Today copyright laws are partially standardized through international and regional agreements
such as the Berne Convention and the WIPO Copyright Treaty. Although there are consistencies among nations'
copyright laws, each jurisdiction has separate and distinct laws and regulations covering copyright. National
copyright laws on licensing, transfer and assignment of copyright still vary greatly between countries and
copyrighted works are licensed on a territorial basis. Some jurisdictions also recognize moral rights of creators, such
as the right to be credited for the work.
Copyright
182
Justification
The British Statute of Anne of 1709 was the first act to directly protect the rights of authors.
[1]
Under US copyright
law, the justification appears in Article I, Section 8 Clause 8 of the Constitution, known as the Copyright Clause. It
empowers the United States Congress "To promote the Progress of Science and useful Arts, by securing for limited
Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries."
[2]
According to the World Intellectual Property Organization the purpose of copyright is twofold:
"To encourage a dynamic culture, while returning value to creators so that they can lead a dignified
economic existence, and to provide widespread, affordable access to content for the public."
[3]
History
Pope Alexander VI issued a bull in 1501
against the unlicensed printing of books
and in 1559 the Index Expurgatorius, or
List of Prohibited Books, was issued for
the first time.
[4]
Early European printers' monopoly
The origin of copyright law in most European countries lies in efforts by the
church and governments to regulate and control printing,
[5]
which was widely
established in the 15th and 16th centuries.
[5]
Before the invention of the
printing press a writing, once created, could only be physically multiplied by
the highly laborious and error-prone process of manual copying by scribes.
[4]
Printing allowed for multiple exact copies of a work, leading to a more rapid
and widespread circulation of ideas and information (see print culture).
[5]
Copyright
183
John Milton's 1644 edition of
Areopagitica, long title Areopagitica: A
speech of Mr. John Milton for the liberty
of unlicensed printing to the Parliament
of England, in it he argued forcefully
against the Licensing Order of 1643.
While governments and the church encouraged printing in many ways, which
allowed the dissemination of Bibles and government information, works of
dissent and criticism could also circulate rapidly. As a consequence,
governments established controls over printers across Europe, requiring them
to have official licenses to trade and produce books. The licenses typically
gave printers the exclusive right to print particular works for a fixed period of
years, and enabled the printer to prevent others from printing or importing the
same work during that period.
[5]
The notion that the expression of dissent
should be tolerated, not censured or punished by law, developed alongside the
rise of printing and the press. The Areopagitica, published in 1644 under the
full title Areopagitica: A speech of Mr. John Milton for the liberty of
unlicensed printing to the Parliament of England, was John Milton's response
to the British parliament re-introducing government licensing of printers,
hence publishers. In doing so, Milton articulated the main strands of future
discussions about freedom of expression.
[6]
As the "menace" of printing
spread, governments established centralized control mechanisms
[7]
and in
1557 the British Crown thought to stem the flow of seditious and heretical
books by chartering the Stationers' Company. The right to print was limited to
the members of that guild, and thirty years later the Star Chamber was
chartered to curtail the "greate enormities and abuses" of "dyvers contentyous
and disorderlye persons professinge the arte or mystere of pryntinge or selling of books." The right to print was
restricted to two universities and to the 21 existing printers in the city of London, which had 53 printing presses. The
French crown also repressed printing, and printer Etienne Dolet was burned at the stake in 1546. As the British took
control of type founding in 1637, printers fled to the Netherlands. Confrontation with authority made printers radical
and rebellious, with 800 authors, printers and book dealers being incarcerated in the Bastille before it was stormed in
1789.
[7]
Copyright
184
Early British copyright law
The Statute of Anne came into force in 1710
In England the printers, known as stationers, formed a
collective organization, known as the Stationers'
Company. In the 16th century the Stationers' Company
was given the power to require all lawfully printed
books to be entered into its register. Only members of
the Stationers' Company could enter books into the
register. This meant that the Stationers' Company
achieved a dominant position over publishing in 17th
century England (no equivalent arrangement formed in
Scotland and Ireland). The monopoly came to an end in
1694, when the English Parliament did not renew the
Stationers Company's power.
[5]
The newly established
Parliament of Great Britain passed the first copyright
statute, the Statute of Anne, full title "An Act for the
Encouragement of Learning, by vesting the Copies of
Printed Books in the Authors or purchasers of such
Copies, during the Times therein mentioned".
[5]
The coming into force of the Statute of Anne in April
1710 marked a historic moment in the development of
copyright law. As the world's first copyright statute it granted publishers of a book legal protection of 14 years with
the commencement of the statute. It also granted 21 years of protection for any book already in print.
[8]
Unlike the
monopoly granted to the Stationers' Company previously, the Statute of Anne was concerned with the reading
public, the continued production of useful literature, and the advancement and spread of education. To encourage
"learned men to compose and write useful books" the statute guaranteed the finite right to print and reprint those
works. It established a pragmatic bargain involving authors, the booksellers and the public.
[9]
The Statute of Anne
ended the old system whereby only literature that met the censorship standards administered by the booksellers could
appear in print. The statute furthermore created a public domain for literature, as previously all literature belonged to
the booksellers forever.
[10]
Common law copyright
When the statutory copyright term provided for by the Statute of Anne began to expire in 1731 London booksellers
thought to defend their dominant position by seeking injunctions from the Court of Chancery for works by authors
that fell outside the statute's protection. At the same time the London booksellers lobbied parliament to extend the
copyright term provided by the Statute of Anne. Eventually, in a case known as Midwinter v. Hamilton
(1743–1748), the London booksellers turned to common law and starting a 30 year period known as the battle of the
booksellers. The London booksellers argued that the Statute of Anne only supplemented and supported a pre-existing
common law copyright. The dispute was argued out in a number of notable cases, including Millar v Kincaid
(1749–1751), Tonson v Collins (1761–1762),
[11]
and Donaldson v Beckett (1774). Donaldson v Beckett eventually
established that copyright was a "creature of statute", and that the rights and responsibilities in copyright were
determined by legislation.
[12]
The Lords clearly voted against perpetual copyright
[13]
and by confirming that the
copyright term—that is the length of time a work is in copyright—did expire according to statute the Lords also
confirmed that a large number of works and books first published in Britain were in the public domain, either
because the copyright term granted by statute had expired, or because they were first published before the Statute of
Anne was enacted in 1709. This opened the market for cheap reprints of works from Shakespeare, John Milton and
Geoffrey Chaucer, works now considered classics. The expansion of the public domain in books broke the
Copyright
185
dominance of the London booksellers and allowed for competition, with the number of London booksellers and
publishers rising threefold from 111 to 308 between 1772 and 1802.
[14]
Early French copyright law
In pre-revolutionary France all books needed to be approved by official censors and authors and publishers had to
obtain a royal privilege before a book could be published. Royal privileges were exclusive and usually granted for
six years, with the possibility of renewal. Over time it was established that the owner of a royal privilege has the sole
right to obtain a renewal indefinitely. In 1761 the Royal Council awarded a royal privilege to the heirs of an author
rather than the author's publisher, sparking a national debate on the nature of literary property similar to that taking
place in Britain during the battle of the booksellers.
[15]
In 1777 a series of royal decrees reformed the royal privileges. The duration of privileges were set at a minimum
duration of 10 years or the life of the author, which ever was longer. If the author obtained a privilege and did not
transfer or sell it on, he could publish and sell copies of the book himself, and pass the privilege on to his heirs, who
enjoyed an exclusive right into perpetuity. If the privilege was sold to a publisher, the exclusive right would only last
the specified duration. The royal decrees prohibited the renewal of privileges and once the privilege had expired
anyone could obtain a "permission simple" to print or sell copies of the work. Hence the public domain in books
whose privilege had expired was expressly recognized.
[15]
After the French Revolution a dispute over Comédie-Française being granted the exclusive right to the public
performance of all dramatic works erupted and in 1791 the National Assembly abolished the privilege. Anyone was
allowed to establish a public theater and the National Assembly declared that the works of any author who had died
more than five years ago were public property. In the same degree the National Assembly granted authors the
exclusive right to authorize the public performance of their works during their lifetime, and extended that right to the
authors' heirs and assignees for five years after the author's death. The National Assembly took the view that a
published work was by its nature a public property, and that an author's rights are recognized as an exception to this
principle, to compensate an author for his work.
[15]
In 1793 a new law was passed giving authors, composers, and artists the exclusive right to sell and distribute their
works, and the right was extended to their heirs and assigns for 10 years after the author's death. The National
Assembly placed this law firmly on a natural right footing, calling the law the "Declaration of the Rights of Genius"
and so evoking the famous Declaration of the Rights of Man and of the Citizen. However, author's rights were
subject to the condition of depositing copies of the work with the Bibliothèque Nationale and 19th Century
commentators characterized the 1793 law as utilitarian and "a charitable grant from society".
[15]
Copyright
186
Early US copyright law
The Copyright Act of 1790 in the Columbian Centinel
The Statute of Anne did not apply to the American
colonies. The colonies' economy was largely agrarian,
hence copyright law was not a priority, resulting in only
three private copyright acts being passed in America
prior to 1783. Two of the acts were limited to seven
years, the other was limited to a term of five years. In
1783 several authors' petitions persuaded the Continental
Congress "that nothing is more properly a man's own
than the fruit of his study, and that the protection and
security of literary property would greatly tends to
encourage genius and to promote useful discoveries." But
under the Articles of Confederation, the Continental
Congress had no authority to issue copyright, instead it
passed a resolution encouraging the States to "secure to
the authors or publishers of any new book not hitherto
printed... the copy right of such books for a certain time
not less than fourteen years from the first publication; and
to secure to the said authors, if they shall survive the term
first mentioned,... the copy right of such books for
another term of time no less than fourteen years.
[16]
Three states had already enacted copyright statutes in
1783 prior to the Continental Congress resolution, and in
the subsequent three years all of the remaining states
except Delaware passed a copyright statute. Seven of the
States followed the Statute of Anne and the Continental
Congress' resolution by providing two fourteen year terms. The five remaining States granted copyright for single
terms of fourteen, twenty and twenty one years, with no right of renewal.
[17]
At the Constitutional Convention of 1787 both James Madison of Virginia and Charles Pinckney of South Carolina
submitted proposals that would allow Congress the power to grant copyright for a limited time. These proposals are
the origin of the Copyright Clause in the United States Constitution, which allows the granting of copyright and
patents for a limited time to serve a utilitarian function, namely "to promote the progress of science and useful arts".
The first federal copyright act, the Copyright Act of 1790 granted copyright for a term of "fourteen years from the
time of recording the title thereof", with a right of renewal for another fourteen years if the author survived to the end
of the first term. The act covered not only books, but also maps and charts. With exception of the provision on maps
and charts the Copyright Act of 1790 is copied almost verbatim from the Statute of Anne.
[17]
At the time works only received protection under federal statutory copyright if the statutory formalities, such as a
proper copyright notice, were satisfied. If this was not the case the work immediately entered into the public domain.
In 1834 the Supreme Court ruled in Wheaton v. Peters, a case similar to the British Donaldson v Beckett of 1774,
that although the author of an unpublished work had a common law right to control the first publication of that work,
the author did not have a common law right to control reproduction following the first publication of the work.
[17]
Copyright
187
Latin America
Cover page of the British Copyright Act 1911, also known as the Imperial
Copyright Act of 1911. "Part I Imperial Copyright. Rights. 1.(1) Subject to
the provisions of this Act, copyright shall subsist throughout the parts of His
Majesty's dominions to which this Act extends for the term hereinafter
mentioned in every original literary dramatic music and artists work, if..."
Latin American countries established national
copyright laws following independence from the
Spanish and Portuguese colonial powers. Latin
American countries were among the first
countries outside Europe to establish copyright
law, with Brazil being the fourth country in the
world to establish national copyright laws in
1804, after the UK, France and the United
States. The foundation of Brazilian copyright
law
[18]
was the French Civil Code. Copyright
law was initially established in Mexico
following a Spanish court order in 1820 and in
1832 Mexico passed its first copyright statute.
Copyright statutes had been established in eight
Latin American countries by the 1850s.
[19]
Africa, Asia, and the Pacific
Copyright law was introduced in African, Asian
and Pacific countries in the late 19th Century by
European colonial powers, especially Britain and
France. After the 1884 Congress of Berlin European colonial powers imposed new laws and institutions in their
colonies, including copyright laws. The British Empire introduced copyright law in its African and Asian colonies
though the Copyright Act 1911, also known as the Imperial Copyright Act of 1911. Similarly France applied its
copyright law throughout its colonies and the French National Institute for Intellectual Property (INPI) acted as the
colonial intellectual property authority.
[19]
The introduction of copyright laws in colonies occurred in the context of
colonial powers' desire to "civilize" their colonies and to protect the commercial interest of the colonial powers.
While approaches varied, copyright laws were generally not adapted to fit local conditions.
[20]
International copyright law
Berne Convention for the Protection of Literary and Artistic Works
Berne Convention signatory countries (in blue).
The Berne Convention was first
established in 1886, and was
subsequently re-negotiated in 1896
(Paris), 1908 (Berlin), 1928 (Rome),
1948 (Brussels), 1967 (Stockholm) and
1971 (Paris). The convention relates to
literary and artistic works, which
includes films, and the convention
requires its member states to provide
protection for every production in the
literary, scientific and artistic domain.
Copyright
188
The Berne Convention has a number of core features, including the principle of national treatment, which holds that
each member state to the Convention would give citizens of other member states the same rights of copyright that it
gave to its own citizens (Article 3-5).
[21]
Another core feature is the establishment of minimum standards of national copyright legislation in that each
member state agrees to certain basic rules which their national laws must contain, though member states can if they
wish increase the amount of protection given to copyright owners. One important minimum rule was that the term of
copyright was to be a minimum of the author's lifetime plus 50 years. Another important minimum rule established
by the Berne Convention is that copyright arises with the creation of a work and does not depend upon any formality
such as a system of public registration (Article 5(2)). At the time some countries did require registration of
copyright, and when Britain implemented the Berne Convention in the Copyright Act 1911 it had to abolish its
system of registration at Stationers' Hall.
[21]
The Berne Convention focuses on authors as the key figure in copyright law and the stated purpose of the convention
is "the protection of the rights of authors in their literary and artistic works" (Article 1), rather than the protection of
publishers and other actors in the process of disseminating works to the public. In the 1928 revision the concept of
moral rights was introduced (Article 10bis), giving authors the right to be identified as such and to object to
derogatory treatment of their works. These rights, unlike economic rights such as preventing reproduction, are
generally not transferrable to others.
[21]
The Berne Convention also enshrined limitations and exceptions to copyright, enabling the reproduction of literary
and artistic works without the copyright owners prior permission. The detail of these exceptions was left to national
copyright legislation, but the guiding principle is stated in Article 9 of the convention. The so called three-step test
holds that an exception is only permitted "in certain special cases, provided that such reproduction does not conflict
with a normal exploitation of the work and does not unreasonably prejudice the legitimate interests of the author".
Free use of copyrighted work is expressly permitted in the case of quotations from lawfully published works,
illustration for teaching purposes, and news reporting (Article 10).
[21]
European copyright law
In the 1980s the European Community started to regard copyright as an element in the creation of a single market.
Since 1991 the EU has passed a number of directives on copyright, designed to harmonize copyright laws in member
states in certain key areas, such as computer programs, databases and the internet. The directives aimed to reduce
obstacles to the free movement of goods and services within the European Union, such as for example in rental
rights, satellite broadcasting, copyright term and resale rights.
[22]
Key directives include the 1993 Copyright
Duration Directive, Directive 2000/31/EC of the European Parliament and of the Council of 8 June 2000 on certain
legal aspects of information society services, in particular electronic commerce, in the Internal Market ('Directive on
electronic commerce' or E-Commerce Directive),the 2001 InfoSoc Directive, also known as Copyright Directive,
and the 2004 Directive on the enforcement of intellectual property rights.
Agreement on Trade-Related Aspects of Intellectual Property Rights (TRIPS)
Important developments on copyright at international level in the 1990s include the 1994 Agreement on
Trade-Related Aspects of Intellectual Property Rights, known as the TRIPS Agreement. TRIPS was negotiated at the
end of the Uruguay Round of the General Agreement on Tariffs and Trade (GATT) and contains a number of
provisions on copyright. Compliance with the TRIPS Agreement is required of states wishing to be members of the
World Trade Organization (WTO). States need to be signatory of the Berne Convention and comply with all its
provisions, except for the provision on moral rights (Article 9(1)). States need to bring computer programs and
databases within the scope of works covered by copyright law (Article 10). States need to provide for rental rights in
at least computer programs and films (Article 11). Where copyright term, that is, duration of copyright, is calculated
other than by reference to the life of a natural person, States need to give a minimum term of 50 years calculated
Copyright
189
from either the date of authorized publication or the creation of the work.
[22]
Anti-Counterfeiting Trade Agreement
The Anti-Counterfeiting Trade Agreement (ACTA) is a proposed plurilateral trade agreement which is claimed by its
proponents to be in response "to the increase in global trade of counterfeit goods and pirated copyright protected
works."
[23]
The scope of ACTA is broad, including counterfeit physical goods, as well as "internet distribution and
information technology".
[24]
In October 2007 the United States, the European Community, Switzerland and Japan announced that they would
negotiate ACTA. Furthermore the following countries have joined the negotiations: Australia, the Republic of Korea,
New Zealand, Mexico, Jordan, Morocco, Singapore, the United Arab Emirates and Canada.
[24]

[25]

[26]
The ACTA
negotiations have been largely conducted in secrecy, with very little information being officially disclosed. However,
on 22 May 2008 a discussion paper about the proposed agreement was uploaded to Wikileaks, and newspaper
reports about the secret negotiations quickly followed.
[26]

[27]

[28]

[29]
China
Issues of copying of software and films for unauthorized distribution in China has become an ongoing diplomatic
issue between the United States and China in the 21st century.
[30]
Copyright by country
Copyright laws have been standardized to some extent through international conventions such as the Berne
Convention. Although there are consistencies among nations' intellectual property laws, each jurisdiction has
separate and distinct laws and regulations about copyright.
[1]
The World Intellectual Property Organization
summarizes each of its member states' intellectual property laws on its website.
[31]
A copyright certificate for proof of the Fermat
theorem, issued by the State Department of
Intellectual Property of Ukraine
Copyright
190
Obtaining copyright
© is the copyright symbol
in a copyright notice
Copyright law is different from country to country, and a copyright notice is required in
about 20 countries for a work to be protected under copyright.
[32]
Before 1989, all
published works in the US had to contain a copyright notice, the © symbol followed by
the publication date and copyright owner's name, to be protected by copyright. This is no
longer the case and use of a copyright notice is now optional in the US, though they are
still used,
[33]
in order to ensure copyright protection in those countries which require the
presence of the notice.
In all countries that are members of the Berne Convention, copyright is automatic and
need not be obtained through official registration with any government office. Once an
idea has been reduced to tangible form, for example by securing it in a fixed medium (such as a drawing, sheet
music, photograph, a videotape, or a computer file), the copyright holder, or rightsholder, is entitled to enforce his
or her exclusive rights. However, while registration isn't needed to exercise copyright, in jurisdictions where the laws
provide for registration, it serves as prima facie evidence of a valid copyright. The original copyright owner of the
copyright may be the employer of the author rather than the author himself, if the work is a "work for hire".
In the United States, Code of Federal Regulations (CFR) 37 Section 202.1a prohibits copyright of a single word, title
or small group of words or phrases, regardless of originality.
[34]
In some circumstances these may be covered by
trademarks.
Copyright term
Copyright subsists for a variety of lengths in different jurisdictions. The length of the term can depend on several
factors, including the type of work (e.g. musical composition or novel), whether the work has been published or not,
and whether the work was created by an individual or a corporation. In most of the world, the default length of
copyright is the life of the author plus either 50 or 70 years. In the United States, the term for most existing works is
for a term ending 70 years after the death of the author. If the work was a work for hire (e.g., those created by a
corporation) then copyright persists for 120 years after creation or 95 years after publication, whichever is shorter. In
some countries (for example, the United States
[35]
and the United Kingdom),
[36]
copyrights expire at the end of the
calendar year in question. Although proposed amendments may change the way that these laws are enforced.
[37]
The length and requirements for copyright duration are subject to change by legislation, and since the early 20th
century there have been a number of adjustments made in various countries, which can make determining the
duration of a given copyright somewhat difficult. For example, the United States used to require copyrights to be
renewed after 28 years to stay in force, and formerly required a copyright notice upon first publication to gain
coverage. In Italy and France, there were post-wartime extensions that could increase the term by approximately 6
years in Italy and up to about 14 in France. Many countries have extended the length of their copyright terms
(sometimes retroactively). International treaties establish minimum terms for copyrights, but individual countries
may enforce longer terms than those treaties.
[38]
Exclusive rights granted by copyright
Copyright is literally, the right to copy, though in legal terms "the right to control copying" is more accurate.
Copyright are exclusive statutory rights to exercise control over copying and other exploitation of the works for a
specific period of time. The copyright owner is given two sets of rights: an exclusive, positive right to copy and
exploit the copyrighted work, or license others to do so, and a negative right to prevent anyone else from doing so
without consent, with the possibility of legal remedies if they do.
[39]
Copyright
191
Copyright initially only granted the exclusive right to copy a book, allowing anybody to use the book to, for
example, make a translation, adaptation or public performance.
[40]
At the time print on paper was the only format in
which most text based copyrighted works were distributed. Therefore, while the language of book contracts was
typically very broad, the only exclusive rights that had any significant economic value were rights to distribute the
work in print.
[41]
The exclusive rights granted by copyright law to copyright owners have been gradually expanded
over time and now uses of the work such as dramatization, translations, and derivative works such as adaptations and
transformations, fall within the scope of copyright.
[40]
With a few exceptions, the exclusive rights granted by
copyright are strictly territorial in scope, as they are granted by copyright laws in different countries. Bilateral and
multilateral treaties establish minimum exclusive rights in member states, meaning that there is some uniformity
across Berne Convention member states.
[42]
The print on paper format means that content is affixed onto paper and the content can't be easily or conveniently
manipulated by the user. Duplication of printed works is time-consuming and generally produces a copy that is of
lower quality. Developments in technology have created new formats, in addition to paper, and new means of
distribution. Particularly digital formats distributed over computer networks have separated the content from its
means of delivery. Users of content are now able to exercise many of the exclusive rights granted to copyright
owners, such as reproduction, distribution and adaptation.
Types of work subject to copyright
The types of work which are subject to copyright has been expanded over time. Initially only covering books,
copyright law was revised in the 19th century to include maps, charts, engravings, prints, musical compositions,
dramatic works, photographs, paintings, drawings and sculptures. In the 20th century copyright was expanded to
cover motion pictures, computer programs, sound recordings, choreography and architectural works.
[40]
Idea–expression divide
Copyright law is typically designed to protect the fixed expression or manifestation of an idea rather than the
fundamental idea itself. Copyright does not protect ideas, only their expression and in the Anglo-American law
tradition the idea-expression divide is a legal concept which explains the appropriate function of copyright laws.
[43]
Related rights and neighboring rights
Related rights is used to describe database rights, public lending rights (rental rights), droit de suite and performers'
rights. Related rights may also refer to copyright in broadcasts and sound recordings.
[44]
Related rights award
copyright protection to works which are not author works, but rather technical media works which allowed author
works to be communicated to a new audience in a different form. The substance of protection is usually not as great
as there is for author works. In continental European copyright law, a system of neighboring rights has thus
developed and the approach was reinforced by the creation of the Rome Convention for the Protection of Performers,
Producers of Phonograms and Broadcasting Organizations in 1961.
[45]
First-sale doctrine and exhaustion of rights
Copyright law does not restrict the owner of a copy from reselling legitimately obtained copies of copyrighted
works, provided that those copies were originally produced by or with the permission of the copyright holder. It is
therefore legal, for example, to resell a copyrighted book or CD. In the United States, this is known as the first-sale
doctrine, and was established by the courts to clarify the legality of reselling books in second-hand bookstores. Some
countries may have parallel importation restrictions that allow the copyright holder to control the resale market. This
may mean for example that a copy of a book that does not infringe copyright in the country where it was printed
does infringe copyright in a country into which it is imported for retailing. The first-sale doctrine is known as
Copyright
192
exhaustion of rights in other countries and is a principle that also applies, though somewhat differently, to patent and
trademark rights. It is important to note that the first-sale doctrine permits the transfer of the particular legitimate
copy involved. It does not permit making or distributing additional copies.
Limitations and exceptions
The expression "limitations and exceptions" refers to situations in which the exclusive rights granted to authors, or
their assignees under copyright law do not apply or are limited for public interest reasons. They generally limit use
of copyrighted material to certain cases that do not require permission from the rightsholders, such as for
commentary, criticism, news reporting, research, teaching or scholarship, archiving, access by the visually impaired
etc. They essentially create a limitation, or an exception to the monopoly exclusive rights that are granted to the
creator of a copyright work by law. Copyright theory teaches that the balance between monopoly granted to the
creator, and the exceptions to this monopoly are at the heart of creativity. i.e. Exclusive rights stimulate investment
and the production of creative works and simultaneously, exceptions to those rights create a balance that allows for
the use of creative works to support innovation, creation, competition and the public interest.
Limitations and exceptions have a number of important public policy goals such as market failure, freedom of
speech,
[46]
education and equality of access (such as by the visually impaired.)
Some view "limitations and exceptions" as "user rights" - seeing user rights provide an essential balance to the rights
of copyright owners. There is no consensus amongst copyright experts as to whether they are "rights" or not. See for
example the National Research Council's Digital Agenda Report, note 1
[47]
. The concept of user rights has also been
recognized by courts, including the Canadian Supreme Court in CCH Canadian Ltd v. Law Society of Upper Canada
[48]
(2004 SCC 13), which classed "fair dealing" as such a user right. These kinds of disagreements in philosophy are
quite common in the philosophy of copyright, where debates about jurisprudential reasoning tend to act as proxies
for more substantial disagreements about good policy.
Changing technology and limitations and exceptions
The scope of copyright limitations and exceptions became a subject of significant controversy within various nations
in the late 1990s and early 2000s, largely due to the impact of digital technology, the changes in national copyright
legislations for compliance with TRIPS, and the enactment of anti-circumvention rules in response to the WIPO
Copyright Treaty. Academics and defenders of copyright exceptions fear that technology, contract law undermining
copyright law and copyright law not being amended, is reducing the scope of important exceptions and therefore
harming creativity. This has resulted in a number of declarations on the importance of access to knowledge being
important for creativity, such as the Adelphi Charter in 2005 and at a European level in May 2010 a declaration
entitled Copyright for Creativity - A Declaration for Europe.
[49]
The declaration was supported by industry, artist,
education and consumer groups. The declaration states that "While exclusive rights have been adapted and
harmonized to meet the challenges of the knowledge economy, copyright’s exceptions are radically out of line with
the needs of the modern information society. The lack of harmonisation of exceptions hinders the circulation of
knowledge based goods and services across Europe. The lack of flexibility within the current European exceptions
regime also prevents us from adapting to a constantly changing technological environment."
International legal instruments and limitations and exceptions
Limitations and exceptions are also the subject of significant regulation by global treaties. These treaties have
harmonized the exclusive rights which must be provided by copyright laws, and the Berne three-step test operates to
constrain the kinds of copyright exceptions and limitations which individual nations can enact. On the other hand,
international copyright treaties place almost no requirements on national governments to provide exemptions from
exclusive rights; a notable exception to this is Article 10(1) of the Berne Convention, which guarantees a limited
right to make quotations from copyrighted works. Because of the lack of balance in international treaties in October
Copyright
193
2004, WIPO agreed to adopt a significant proposal offered by Argentina and Brazil, the "Proposal for the
Establishment of a Development Agenda for WIPO" also known simply as the "Development Agenda" - from the
Geneva Declaration on the Future of the World Intellectual Property Organization.
[50]
This proposal was well
supported by developing countries. A number of civil society bodies have been working on a draft Access to
Knowledge,
[51]
or A2K, Treaty which they would like to see introduced.
Fair use and fair dealing
Copyright does not prohibit all copying or replication. In the United States, the fair use doctrine, codified by the
Copyright Act of 1976 as 17 U.S.C. § 107
[52]
, permits some copying and distribution without permission of the
copyright holder or payment to same. The statute does not clearly define fair use, but instead gives four
non-exclusive factors to consider in a fair use analysis. Those factors are:
1. the purpose and character of the use;
2. the nature of the copyrighted work;
3. the amount and substantialness of the portion used in relation to the copyrighted work as a whole; and
4. the effect of the use upon the potential market for or value of the copyrighted work.
[53]
In the United Kingdom and many other Commonwealth countries, a similar notion of fair dealing was established by
the courts or through legislation. The concept is sometimes not well defined; however in Canada, private copying for
personal use has been expressly permitted by statute since 1999. In Australia, the fair dealing exceptions under the
Copyright Act 1968 (Cth) are a limited set of circumstances under which copyrighted material can be legally copied
or adapted without the copyright holder's consent. Fair dealing uses are research and study; review and critique;
parody and satire; news reportage and the giving of professional advice (i.e. legal advice). Under current Australian
law it is still a breach of copyright to copy, reproduce or adapt copyright material for personal or private use without
permission from the copyright owner. Other technical exemptions from infringement may also apply, such as the
temporary reproduction of a work in machine readable form for a computer.
In the United States the AHRA (Audio Home Recording Act Codified in Section 10, 1992) prohibits action against
consumers making noncommercial recordings of music, in return for royalties on both media and devices plus
mandatory copy-control mechanisms on recorders.
Section 1008. Prohibition on certain infringement actions
No action ever may be brought under this title alleging infringement of copyright based on the manufacture,
importation, or distribution of a digital audio recording device, a digital audio recording medium, an analog
recording device, or an analog recording medium, or based on the non-commercial use by a consumer of such
a device or medium for making digital musical recordings or analog musical recordings.
Later acts amended US Copyright law so that for certain purposes making 10 copies or more is construed to be
commercial, but there is no general rule permitting such copying. Indeed making one complete copy of a work, or in
many cases using a portion of it, for commercial purposes will not be considered fair use. The Digital Millennium
Copyright Act prohibits the manufacture, importation, or distribution of devices whose intended use, or only
significant commercial use, is to bypass an access or copy control put in place by a copyright owner. An appellate
court has held that fair use is not a defense to engaging in such distribution.
Educational use is regarded as "fair use" in most jurisdictions, but the restrictions vary wildly from nation to
nation.
[54]
Recent Israeli District Court decision dated 2 Sep. 2009
[55]

[56]
accepted the defence of fair use for a site linking to
P2P live feeds of soccer matches. The main reasoning was based on the public importance of certain sporting events,
i.e. - the public's rights as counter weight to the copyright holders rights.
Copyright
194
Licensing, transfer, and assignment
DVD: All Rights Reserved
Copyright may be bought and sold much like other
properties.
[57]
In the individual licensing model the
copyright owner authorizes the use of the work against
remuneration and under the conditions specified by the
license. The conditions of the license may be complex
since the exclusive rights granted by copyright to the
copyright owner can be split territorially or with respect
to language, the sequence of uses may be fixed, the
number of copies to be made and their subsequent use
may also be specified. Furthermore sublicenses and
representation agreements may also be made.
[58]
A contractual transfer of all or some of the rights in a
copyrighted work is a known as a copyright license. A
copyright assignment is an immediate and irrevocable transfer of the copyright owner's entire interest in all or some
of the rights in the copyrighted work. Copyright licensing and assignment cover only the specified geographical
region. There are significant differences in national copyright laws with regards to copyright licensing and
assignment.
[59]
Copyright licenses, as a minimum, define the copyrighted works and rights subject to the license, the territories or
geographic region in which the license applies, the term or length of the license, and the consideration (such as a one
of payment or royalties) for the license. The exclusive rights granted by copyright law can all be licensed, but they
vary depending on local law. Depending on how the work may be used different licenses need to be acquired. For
example, the activity of distributing videocassettes of a motion picture will require the license for the right to
reproduce the motion picture on a videocassette and the right to distribute the copies to the public. Because the ratio
of a television screen is different from that of a wide-screen cinema, requiring the cutting of the wide-screen "ends",
it may also be necessary to obtain a license for the right to modify the motion picture. If the motion picture is to be
edited or modified the copyright owner may include control over or approval of the editing process, or of the final
result. Existing contractual agreements between the copyright owner and the director, may also require approval
from the director to any changes made to the copyrighted work.
[60]
Different types of exclusive licenses exist, such as licenses that excludes the licensor from use of the licensed
copyrighted work in the relevant region and for the stated time period. Or exclusive licenses may prevent the licensor
from licensing other parties in the geographic region and during the license term. There are also various types of
non-exclusive licenses, including the right of first refusal should the licensor elect to offer future licenses to third
parties. If a licensing agreement does not specify that the license is exclusive it may nonetheless be deemed
exclusive depending on the language of the contract. Depending on local laws the owner of an exclusive license may
be deemed the "copyright owner" of that work and bring charges for copyright infringement.
[61]
The term or length of the copyright license is not allowed to exceed the copyright term specified by local law.
Licenses may establish various pay arrangements, such as royalties as a percentage of sales or as a stepped up or
down percentage of sales, e.g. 5 percent of sales up to 50,000 units, 2.5 percent of sales in excess thereof. The trigger
for royalty payments may be sales, or other factors, such as the number of "hits" or views on a website. Minimum
royalty payments are arrangements whereby a minimum up-front payment is made and then recouped against the
percentage of sales. The up-front payment may be non-refundable if sales royalties do not reach the amount of the
payment.
[61]
Minimum royalty payment arrangements may be accompanied by marketing duties for the licensee, e.g.
best effort and reasonable effort to market and promote the copyrighted work.
[62]
Copyright
195
Collective rights management
Collective rights management is the licensing of copyright and related rights by organizations acting on behalf of
rights owners. Collective management organizations, such as collecting societies, typically represent groups of
copyright and related rights owners, such as authors, composers, publishers, writers, photographers, musicians and
performers.
[63]
The following exclusive rights granted under copyright law are commonly collectively managed by
collecting societies: the right to public performance, the right to broadcasting, the mechanical reproduction rights in
recorded music, the performing rights in dramatical works, the rights of reprographic reproduction of literary and
musical works, and related rights, for example the rights of performers and producers in recorded music when used
in broadcasts.
[63]
The collective management of copyright and related rights is undertaken by various types of collective management
organizations, most commonly collecting societies. Collecting societies act on behalf of their members, which may
be authors or performers, and issue copyright licenses to users authorizing the use of the works of their members.
[63]
Other forms of collective management organizations include rights clearance centers and one-stop shops. One-stop
shops are a coalition of collecting societies and rights clearance centers offering a centralized source for users to
obtain licenses. They have become popular in response to multi-media productions requiring users to obtain multiple
licenses for relevant copyright and related rights.
[63]
Extended collective licensing
The first extended collective licensing (ECL) laws were established in Denmark, Finland, Iceland, Norway and
Sweden in the 1960s.
[64]
ECL is a form of collective rights management whereby ECL laws allow for freely
negotiated copyright licensing contracts for the exclusive rights granted by copyright. ECL laws are designed
specifically for mass use, where negotiating alone will rarely allow a single right owner to fully financially benefit
from their exclusive rights. Under ECL laws, collecting societies negotiate ECL agreements with users, such as a TV
broadcaster, covering the types of copyrighted works for uses specified in the ECL license.
[64]
Subject to certain conditions collecting societies can under ECL law apply to represent all rights owners on a
non-exclusive basis in a specific category of copyrighted works.
[65]
The collecting society can then negotiate an ECL
agreement with a user for certain uses. This agreement applies to members of that collecting society, as well as
non-members. ECL laws require that collecting societies treat rights owners who are non-members in the same way
they treat their members. Non-members are also given the right to individual remuneration, i.e. royalty payment, by
the collecting society, and the right to exclude their work from an ECL agreement.
[66]
Compulsory licensing
In some countries copyright law provides for compulsory licenses of copyrighted works for specific uses. In many
cases the remuneration or royalties received for a copyrighted work under compulsory license are specified by local
law, but may also be subject to negotiation. Compulsory licensing may be established through negotiating licenses
that provide terms within the parameters of the compulsory license.
[67]
Article 11bis(2) and Article 13(1) of the
Berne Convention for the Protection of Literary and Artistic Works provide the legal basis for compulsory licenses.
They state that member states are free to determine the conditions under which certain exclusive rights may be
exercised in their national laws. They also provide for the minimum requirements to be set when compulsory
licenses are applied, namely that they must not prejudice the author to fair compensation.
[68]
Copyright
196
Future rights under pre-existing agreements
It is commonplace in copyright licensing to license not only new uses which may be developed but also works which
are not yet created. However, local law may not always recognize that the wording in licensing agreements does
cover new uses permitted by subsequently developed technology.
[59]
Whether a license covers future, as yet
unknown, technological developments is subject to frequent disputes. Litigation over the use of a licensed
copyrighted work in a medium unknown when the license was agreed is common.
[60]
Newspaper advert: "United States and
Foreign Copyright. Patents and
Trade-Marks A Copyright will protect
you from Pirates. And make you a
fortune."
Enforcement
Copyrights are generally enforced by the holder in a civil law court, but there
are also criminal infringement statutes in some jurisdictions. While central
registries are kept in some countries, which aid in proving claims of
ownership, registering does not necessarily prove ownership, nor does the fact
of copying (even without permission) necessarily prove that copyright was
infringed. Criminal sanctions are generally aimed at serious counterfeiting
activity, but are now becoming more commonplace as copyright collectives
such as the RIAA are increasingly targeting the file sharing domestic Internet
user. (See: File sharing and the law)
Infringement
An unskippable anti-piracy film included on
movie DVDs equates copyright infringement
with theft.
Copyright infringement, or copyright violation, is the unauthorized use
of works covered by copyright law, in a way that violates one of the
copyright owner's exclusive rights, such as the right to reproduce or
perform the copyrighted work, or to make derivative works.
For electronic and audio-visual media under copyright, unauthorized
reproduction and distribution is also commonly referred to as piracy.
An early reference to piracy in the context of copyright infringement
was made by Daniel Defoe in 1703 when he said of his novel The
True-Born Englishman "Had I wrote it for the gain of the press, I
should have been concerned at its being printed again and again by
PIRATES, as they call them, and PARAGRAPHMEN: but if they do justice, and print it true, according to the copy,
they are welcome to sell it for a penny, if they please: the pence, indeed, is the end of their works."
[69]
The practice
of labeling the act of infringement as "piracy" predates statutory copyright law. Prior to the Statute of Anne 1709, the
Stationers' Company of London in 1557 received a Royal Charter giving the company a monopoly on publication
and tasking it with enforcing the charter. Those who violated the charter were labeled pirates as early as 1603.
[70]
Copyright
197
Orphan works
An orphan work is a work under copyright protection whose copyright owner is difficult or impossible to contact.
The creator may be unknown, or where the creator is known it is unknown who represents them.
[71]
Public domain
Newton's own copy of his Principia, with hand-written
corrections for the second edition
Works are in the public domain if their kind is not covered by
intellectual property rights or if the intellectual property rights
have expired,
[72]
have been forfeited, or have never been
claimed.
[73]
Examples include the English language, the formulae
of Newtonian physics, as well as the works of Shakespeare and the
patents over powered flight.
[72]
Copyright as property right
In the Anglo-American tradition copyright is understood as
property, as distinguished from the droit d'auteur understanding of
copyright.
[74]
In Britain copyright was initially conceived of as a
"chose in action", that is an intangible property, as opposed to tangible property.
[75]
In the case of tangible property
the property rights are bundled with the ownership of the property, and property rights are transferred once the
property is sold. In contrast copyright law detaches the exclusive rights granted under property law to the copyright
owner from ownership of the good which is regarded as a reproduction. Hence the purchaser of a book buys
ownership of the book as a good, but not the underlying copyright in the book's content. If a derivative work based
on the content of the book is made, permission needs to be sought from the copyright owner, not all owners of a
copy of the book.
[76]
The Statute of Anne specifically referred to copyright in terms of literary property that is limited in time. Many
contemporaries did not believe that the statute was concerned with property "in the strict sense of the word" and the
question of whether copyright is property right dates back to the Battle of the Booksellers. In 1773 Lord Gardenston
commented in Hinton v. Donaldson that "the ordinary subjects of property are well known, and easily conceived...
But property, when applied to ideas, or literary and intellectual compositions, is perfectly new and surprising..."
[77]
It was in the 19th century that the term intellectual property began to be used as an umbrella term for patents,
copyright and other laws.
[78]

[79]
The expansion of copyright and copyright term are mirrored in the rhetoric that has
been employed in referring to copyright. Courts, when strengthening copyright, have characterized it as a type of
property. Companies have strongly emphasized copyright as property, with leaders in the music and movie industries
seeking to "protect private property from being pillaged" and making forceful assertions that copyright is absolute
property right.
[80]
With reference to the expanding scope of copyright, one commentator noted that "We have gone
from a regime where a tiny part of creative content was controlled to a regime where most of the most useful and
valuable creative content is controlled for every significant use."
[40]
According to Graham Dutfield and Uma
Suthersanen copyright is now a "class of intangible business assets", mostly owned by companies who function as
"investor, employer, distributor and marketer". While copyright was conceived as personal property awarded to
creators, creators now rarely own the rights in their works.
[81]
Copyright
198
Copyright and authors
Copyright law emerged in 18th Century Europe in relation to printed books and a new notion of authorship. In the
European Renaissance and Neoclassical period the writer was regarded as an instrument, not as an independent
creator. The writer was seen as using external sources to create a work of inspiration. In the 18th Century a changing
concept of genius located the source of inspiration within the writer, whose special talents and giftedness was the
basis for creating works of inspiration and uniqueness. The concept of the author as original creator and owner of
their work emerged partly from the new concept of property rights and John Locke's theory that individuals were
"owners of themselves". According to Locke individuals invested their labour into natural goods, and so creating
property. Authors were argued to be the owners of their work because they had invested their labour in creating it.
[82]
According to Patterson and Livingston there remains confusion about the nature of copyright ever since Donaldson v
Beckett, a case heard in 1774 by the British House of Lords about whether copyright is the natural law right of the
author or the statutory grant of a limited monopoly. One theory holds that copyright's origin occurs at the creation of
a work, the other that its origin exists only through the copyright statute.
[83]
Copyright and competition law
Copyright is typically thought of as a limited, legally sanctioned monopoly.
[59]
Because of this, copyright licensing
may sometimes interfere too much in free and competitive markets.
[84]
These concerns are governed by legal
doctrines such as competition law in the European Union, anti-trust law in the United States, and anti-monopoly law
in Russia and Japan.
[84]
Competition issues may arise when the licensing party unfairly leverages market power,
engages in price discrimination through its licensing terms, or otherwise uses a licensing agreement in a
discriminatory or unfair manner.
[59]

[84]
Attempts to extend the copyright term granted by law – for example, by
collecting royalties for use of the work after its copyright term has expired and it has passed into the public domain –
raise such competition concerns.
[59]
In April 1995, the US published "Antitrust Guidelines for the licensing of
Intellectual Property" which apply to patents, copyright, and trade secrets. In January 1996, the European Union
published Commission Regulation No.240/96 which applies to patents, copyright, and other intellectual property
rights, especially regarding licenses. The guidelines apply mutatis mutandis to the extent possible.
[85]
Copyright and contract
In all but a few countries, private contracts can override the limitations and exceptions provided in copyright law.
[86]
Copyright and economic development
The view that a restrictive copyright benefits anybody has been challenged. According to the historian Eckhard
Höffner the 1710 introduction of copyright law in England (and later in France) acted as a barrier to economic
progress for over a century, a situation he contrasts with Germany where authors were paid by page and their work
was not protected by any copyright laws.
[87]
Höffner argues that copyright laws allowed British publishers to print
books only in limited quantities for high prices, while in Germany a proliferation of publishing took place that
benefitted authors, publishers, and the public, and may have been an important factor in Germany's economic
development.
[88]

[89]
Copyright
199
References
[1] Broussard, Sharee L. (September 2007). The copyleft movement: creative commons licensing (http:// findarticles.com/ p/ articles/ mi_7081/
is_3_26/ai_n28457434?tag=content;col1). Communication Research Trends. .
[2] Article I, Section 8, Clause 8, United States Constitution
[3] "Copyright and Related Rights" (http:// www.wipo. int/ copyright/en/ ). World Intellectual Property Organization. . Retrieved 7 February
2010.
[4] de Sola Pool, Ithiel (1983). Technologies of freedom (http:/ / www.google.com/ books?id=BzLXGUxV4CkC& pg=PA15&
dq=Areopagitica+freedom+of+speech+ britain&lr=&as_brr=3&cd=36#v=onepage&q=& f=false). Harvard University Press. p. 14.
ISBN 9780674872332. .
[5] MacQueen, Hector L; Charlotte Waelde and Graeme T Laurie (2007). Contemporary Intellectual Property: Law and Policy (http:// www.
google.com/ books?id=_Iwcn4pT0OoC& dq=contemporary+intellectual+property&source=gbs_navlinks_s). Oxford University Press.
p. 34. ISBN 9780199263394. .
[6] Sanders, Karen (2003). Ethics & Journalism (http:// www. google.com/ books?id=bnpliIUyO60C& dq=Areopagitica+freedom+of+
speech+ britain& lr=&as_brr=3&source=gbs_navlinks_s). Sage. p. 66. ISBN 9780761969679. .
[7] de Sola Pool, Ithiel (1983). Technologies of freedom (http:/ / www.google.com/ books?id=BzLXGUxV4CkC& pg=PA15&
dq=Areopagitica+freedom+of+speech+ britain&lr=&as_brr=3&cd=36#v=onepage&q=& f=false). Harvard University Press. p. 15.
ISBN 9780674872332. .
[8] Ronan, Deazley (2006). Rethinking copyright: history, theory, language (http:// www.google. com/ books?id=dMYXq9V1JBQC&
dq=statute+of+anne+ copyright&lr=&as_brr=3&source=gbs_navlinks_s). Edward Elgar Publishing. p. 13. ISBN 9781845422820. .
[9] Ronan, Deazley (2006). Rethinking copyright: history, theory, language (http:// www.google. com/ books?id=dMYXq9V1JBQC&
dq=statute+of+anne+ copyright&lr=&as_brr=3&source=gbs_navlinks_s). Edward Elgar Publishing. pp. 13–14. ISBN 9781845422820. .
[10] Jonathan, Rosenoer (1997). Cyberlaw: the law of the internet (http:// www.google.com/ books?id=HlG2esMIm7kC& dq=statute+ of+
anne+ copyright&lr=&as_brr=3&source=gbs_navlinks_s). Springer. p. 34. ISBN 9780387948324. .
[11] Ronan, Deazley (2006). Rethinking copyright: history, theory, language (http:/ / www.google. com/ books?id=dMYXq9V1JBQC&
dq=statute+of+anne+ copyright&lr=&as_brr=3&source=gbs_navlinks_s). Edward Elgar Publishing. p. 14. ISBN 9781845422820. .
[12] Rimmer, Matthew (2007). Digital copyright and the consumer revolution: hands off my iPod (http:// www. google. com/
books?id=1ONyncVruj8C&dq=statute+ of+anne+ copyright+scotland& as_brr=3&source=gbs_navlinks_s). Edward Elgar Publishing. p. 4.
ISBN 9781845429485. .
[13] Marshall, Lee (2006). Bootlegging: romanticism and copyright in the music industry (http:/ / www.google. com/
books?id=25luX89BlA0C& dq=statute+ of+anne+ copyright+scotland& lr=&as_brr=3&source=gbs_navlinks_s). Sage. p. 15.
ISBN 9780761944904. .
[14] Van Horn Melton, James (2001). The rise of the public in Enlightenment Europe (http:/ / books.google.com/ books?id=QZovusQ1SjYC&
dq="perpetual+copyright"& source=gbs_navlinks_s). Cambridge University Press. pp. 140–141. ISBN 9780521469692. .
[15] Peter K, Yu (2007). Intellectual Property and Information Wealth: Copyright and related rights (http:// www.google. com/
books?id=tgK9BzcF5WgC&dq=statute+ of+anne+ copyright&lr=&as_brr=3& source=gbs_navlinks_s). Greenwood Publishing Group.
pp. 141–142. ISBN 9780275988838. .
[16] Peter K, Yu (2007). Intellectual Property and Information Wealth: Copyright and related rights (http:/ / www.google. com/
books?id=tgK9BzcF5WgC&dq=statute+ of+anne+ copyright&lr=&as_brr=3& source=gbs_navlinks_s). Greenwood Publishing Group.
p. 142. ISBN 9780275988838. .
[17] Peter K, Yu (2007). Intellectual Property and Information Wealth: Copyright and related rights (http:/ / www.google. com/
books?id=tgK9BzcF5WgC&dq=statute+ of+anne+ copyright&lr=&as_brr=3& source=gbs_navlinks_s). Greenwood Publishing Group.
p. 143. ISBN 9780275988838. .
[18] Penal Code, Brasilian. Brasilian Penal Code (http:/ / www. planalto. gov.br/ccivil/ leis/ L9610. htm). .
[19] Deere, Carolyn (2009). The implementation game: the TRIPS agreement and the global politics of intellectual property reform in developing
countries (http:// books. google. com/ books?id=ZI3jI-YaTI0C&dq=inauthor:"Carolyn+ Deere"&hl=en&
ei=xzeFTIGBDp6TOPGQ6MAO& sa=X& oi=book_result& ct=result& resnum=1&ved=0CCsQ6AEwAA). Oxford University Press. p. 35.
ISBN 9780199550616. .
[20] Deere, Carolyn (2009). The implementation game: the TRIPS agreement and the global politics of intellectual property reform in developing
countries (http:// books. google. com/ books?id=ZI3jI-YaTI0C&dq=inauthor:"Carolyn+ Deere"&hl=en&
ei=xzeFTIGBDp6TOPGQ6MAO& sa=X& oi=book_result& ct=result& resnum=1&ved=0CCsQ6AEwAA). Oxford University Press. p. 36.
ISBN 9780199550616. .
[21] MacQueen, Hector L; Charlotte Waelde and Graeme T Laurie (2007). Contemporary Intellectual Property: Law and Policy (http:// www.
google.com/ books?id=_Iwcn4pT0OoC& dq=contemporary+intellectual+property&source=gbs_navlinks_s). Oxford University Press.
p. 37. ISBN 9780199263394. .
[22] MacQueen, Hector L; Charlotte Waelde and Graeme T Laurie (2007). Contemporary Intellectual Property: Law and Policy (http:// www.
google.com/ books?id=_Iwcn4pT0OoC& dq=contemporary+intellectual+property&source=gbs_navlinks_s). Oxford University Press.
p. 39. ISBN 9780199263394. .
Copyright
200
[23] "Anti-Counterfeiting Trade Agreement | Intellectual Property Enforcement | Intellectual Property Policy" (http:// www.med. govt.nz/
templates/ContentTopicSummary____34357.aspx). Med.govt.nz. . Retrieved 2011-04-10.
[24] "What is ACTA?" (http:// www.eff.org/issues/ acta). Electronic Frontier Foundation (EFF). . Retrieved 1 December 2008.
[25] Geiger, Andrea (30 April 2008). "A View From Europe: The high price of counterfeiting, and getting real about enforcement" (http:/ /
thehill.com/ business--lobby/ a-view-from-europe-the-high-price-of-counterfeiting-and-getting-real-about-enforcement-2008-04-30.html).
The Hill. . Retrieved 27 May 2008.
[26] Pilieci, Vito (26 May 2008). "Copyright deal could toughen rules governing info on iPods, computers" (http:/ / www. canada.com/
vancouversun/ story. html?id=ae997868-220b-4dae-bf4f-47f6fc96ce5e&p=1). Vancouver Sun. . Retrieved 27 May 2008.
[27] "Proposed US ACTA multi-lateral intellectual property trade agreement (2007)" (http:/ / wikileaks. org/w/ index.
php?title=Proposed_US_ACTA_multi-lateral_intellectual_property_trade_agreement_(2007)&oldid=29522). Wikileaks. 22 May 2008. .
[28] Jason Mick (23 May 2008). "Wikileaks Airs U.S. Plans to Kill Pirate Bay, Monitor ISPs With Multinational ACTA Proposal" (http:// www.
dailytech. com/ article.aspx?newsid=11870). DailyTech. .
[29] Weeks, Carly (26 May 2008). "Anti-piracy strategy will help government to spy, critic says" (http:/ / www.theglobeandmail. com/ servlet/
story/LAC. 20080526. COPYRIGHT26/ / TPStory/National). The Globe and Mail. . Retrieved 27 May 2008.
[30] Lucy Montgomery, China's Creative Industries: Copyright, Social Network Markets, and the Business of Culture in a Digital Age (Edward
Elgar Publishing; 2011)
[31] WIPO Guide to Intellectual Property Worldwide (http:// www. wipo. int/ about-ip/en/ ipworldwide/country. htm)
[32] Fries, Richard C. (2006). Reliable design of medical devices (http:// www. google.com/ books?id=nO0yEZmE3ZkC&dq=copyright+
notices&lr=&as_brr=3&source=gbs_navlinks_s). CRC Press. p. 197. ISBN 0824723759, 9780824723750. . "In addition, copyright
protection is not available in some 20 foreign countries unless a work contains a copyright notice."
[33] Fries, Richard C. (2006). Reliable design of medical devices (http:/ / www. google.com/ books?id=nO0yEZmE3ZkC&dq=copyright+
notices&lr=&as_brr=3&source=gbs_navlinks_s). CRC Press. p. 196. ISBN 0824723759, 9780824723750. .
[34] Planesi v. Peters, 2005 WL 1939885 (9th Cir. 2005) No. 05-781 (Supreme Court of the United States of America)
[35] 17 U.S.C.  § 305 (http:/ / www. law. cornell.edu/ uscode/ 17/ 305. html)
[36] The Duration of Copyright and Rights in Performances Regulations 1995, part II (http:// www.opsi. gov. uk/ si/ si1995/
Uksi_19953297_en_3. htm), Amendments of the UK Copyright, Designs and Patents Act 1988
[37] UK must modernise copyright laws, report urges, <http:// www.zdnet.co.uk/ news/ intellectual-property/2011/ 05/ 18/
uk-must-modernise-copyright-laws-report-urges-40092815/)
[38] Nimmer, David (2003). Copyright: Sacred Text, Technology, and the DMCA (http:// books. google.com/ books?id=RYfRCNxgPO4C).
Kluwer Law International. p. 63. ISBN 978-9041188762. OCLC 50606064. .
[39] Jones, Hugh; and Benson, Christopher (2002). Publishing law (http:// www.google. com/ books?id=CIsb4fsmJD8C& dq=uk+copyright+
law& source=gbs_navlinks_s). Routledge. pp. 12–13. ISBN 9780415261548. .
[40] Peter K, Yu (2007). Intellectual Property and Information Wealth: Copyright and related rights (http:// www.google. com/
books?id=tgK9BzcF5WgC&dq=statute+ of+anne+ copyright&lr=&as_brr=3& source=gbs_navlinks_s). Greenwood Publishing Group.
p. 346. ISBN 9780275988838. .
[41] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 17. ISBN 9789280512717.
.
[42] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 9. ISBN 9789280512717. .
[43] Simon, Stokes (2001). Art and copyright (http:/ / www.google.com/ books?id=h-XBqKIryaQC& dq=idea-expression+dichotomy& lr=&
as_brr=3& source=gbs_navlinks_s). Hart Publishing. pp. 48–49. ISBN 9781841132259. .
[44] The way ahead – A Strategy for Copyright in the Digital Age (http:// www.ipo.gov. uk/ c-strategy-digitalage.pdf). Intellectual Property
Office and Department for Business Innovation & Skills. October 2009. p. 10. .
[45] MacQueen, Hector L; Charlotte Waelde and Graeme T Laurie (2007). Contemporary Intellectual Property: Law and Policy (http:/ / www.
google.com/ books?id=_Iwcn4pT0OoC& dq=contemporary+intellectual+property&source=gbs_navlinks_s). Oxford University Press.
p. 38. ISBN 9780199263394. .
[46] P. Bernt Hugenholtz. Copyright And Freedom Of Expression In Europe(2001) Published in: Rochelle Cooper Dreyfuss, Harry First and
Diane Leenheer Zimmerman (eds.), Expanding the Boundaries of Intellectual Property, Oxford University Press
[47] http:// books. nap. edu/ html/ digital_dilemma/exec_summ. html#REF1
[48] http:// www. lexum. umontreal.ca/ csc-scc/ en/ pub/ 2004/ vol1/ html/ 2004scr1_0339. html
[49] "Copyright for Creativity.Broad coalition calls for European copyright to support digital creativity and innovation (https:// www.
copyright4creativity.eu/ bin/ view/ Main/ PressRelease05May2010) 5 May 2010."]. .
[50] "Consumer Project on Technology web site, ''Geneva Declaration on the Future of the World Intellectual Property Organization''" (http://
www.cptech.org/ip/ wipo/ genevadeclaration. html). Cptech.org. . Retrieved 2011-04-10.
[51] "Consumer Project on Technology web site, ''Access to Knowledge (A2K)" (http:// www.cptech. org/ a2k/ ). Cptech.org. . Retrieved
2011-04-10.
[52] http:// www. law. cornell.edu/ uscode/ 17/ 107. html
[53] 17 U.S.C.  § 107 (http:// www. law. cornell.edu/ uscode/ 17/ 107. html)
Copyright
201
[54] "International comparison of Educational "fair use" legislation" (http:// teflpedia.com/ index.
php?title=Copyright_in_English_language_teaching). Teflpedia.com. 2010-12-19. . Retrieved 2011-04-10.
[55] FAPL v. Ploni, 2 September 2009 (http:/ / info1. court.gov. il/ Prod03/ManamHTML5.nsf/ 03386E2BD41B4FF74225762500514826/
$FILE/DC517C1BE60D537E42257486003ED1E6.html?OpenElement)
[56] "A more thorough analysis of the FAPL v. Ploni decision" (http:// blog.ericgoldman.org/archives/ 2009/ 09/ israeli_judge_p.htm).
Blog.ericgoldman.org. 2009-09-21. . Retrieved 2011-04-10.
[57] WIPO Guide on the Licensing of Copyright and Related Rights (http:// www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 15. ISBN 9789280512717.
.
[58] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 100.
ISBN 9789280512717. .
[59] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 7. ISBN 9789280512717. .
[60] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 8. ISBN 9789280512717. .
[61] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. pp. 10–11.
ISBN 9789280512717. .
[62] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 11. ISBN 9789280512717.
.
[63] "Collective Management of Copyright and Related Rights" (http:/ / www.wipo. int/ about-ip/en/ about_collective_mngt.html#P46_4989).
World Intellectual Property Organization. . Retrieved 14 November 2010.
[64] Gervais, Daniel (2006). Collective management of copyright and related rights (http:// books. google.com/ books?id=W_N0ctyT10wC&
source=gbs_navlinks_s). Kulwar Law International. pp. 264–265. ISBN 9789041123589. .
[65] Gervais, Daniel (June 2003). "Application of an Extended Collective Licensing Regime in Canada: Principles and Issues Relating to
Implementation" (http:// aix1. uottawa. ca/ ~dgervais/ publications/ extended_licensing. pdf) (PDF). Study prepared for the Department of
Canadian Heritage. University of Ottawa. p. 5. .
[66] Olsson, Henry (10 March 2010). "The Extended Collective License As Applied in the Nordic Countries" (http:// www.kopinor.no/ en/
copyright/extended-collective-license/ documents/ The+Extended+Collective+ License+ as+ Applied+in+ the+ Nordic+Countries.748.
cms). Presentation at Kopinor 25th Anniversary International Symposium May 2005. Kopinor. . Retrieved 14 November 2010.
[67] WIPO Guide on the Licensing of Copyright and Related Rights (http:// www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 16. ISBN 9789280512717.
.
[68] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 101.
ISBN 9789280512717. .
[69] The life and times of Daniel De Foe: with remarks digressive and discursive. Google Books (http:/ / books. google. com/
books?id=kRhAAAAAYAAJ& pg=PA59#v=onepage&q& f=false)
[70] T. Dekker Wonderfull Yeare 1603 University of Oregon (http:/ / www.luminarium.org/renascence-editions/ yeare.html)
[71] The work and operation of the Copyright Tribunal: second report of session 2007-08, report, together with formal minutes, oral and written
evidence (http:/ / www.google. com/ books?id=kKw6n5kCFlkC& dq=copyright+orphan+works& source=gbs_navlinks_s). Great Britain:
Parliament: House of Commons: Innovation, Universities & Skills Committee. 2009. p. 28. ISBN 9780215514257. .
[72] Boyle, James (2008). The Public Domain: Enclosing the Commons of the Mind (http:// www.google. com/ books?id=Fn1Pl9Gv_EMC&
dq=public+domain& source=gbs_navlinks_s). CSPD. p. 38. ISBN 0300137400, 9780300137408. .
[73] Graber, Christoph Beat; and Mira Burri Nenova (2008). Intellectual Property and Traditional Cultural Expressions in a digital environment
(http:// www. google. com/ books?id=gK6OI0hrANsC& dq="public+domain"+ intellectual+ property&lr=&as_brr=3&
source=gbs_navlinks_s). Edward Elgar Publishing. p. 173. ISBN 1847209211, 9781847209214. .
[74] Deazley, Ronan; Kretschmer, Martin & Bently, Lionel (2010). Privilege and Property: Essays on the History of Copyright (http:/ / books.
google. com/ books?id=SRBkCOC8d-4C& dq=copyright+Limitations+ and+ exceptions+ history& source=gbs_navlinks_s). Open Book
Publishers. p. 347. ISBN 9781906924188. .
[75] Coyle, Michael (23 April 2002). "The History of Copyright" (http:// www. lawdit.co.uk/ reading_room/room/ view_article.asp?name=.. /
articles/ The History of Copyright.htm). Lawdit. . Retrieved 6 March 2010.
[76] Laikwan, Pang (2006). Cultural control and globalization in Asia: copyright, piracy, and cinema (http:// books.google. com/
books?id=a38gdoGOF-oC&printsec=frontcover&source=gbs_ge_summary_r& cad=0#v=onepage&q& f=false). Routledge. p. 32.
ISBN 9780415352017. .
[77] Brad, Sherman; Lionel Bently (1999). The making of modern intellectual property law: the British experience, 1760-1911 (http:// www.
google.com/ books?id=u2aMRA-eF1gC& dq=statute+ of+anne+ copyright&lr=&as_brr=3& source=gbs_navlinks_s). Cambridge
Copyright
202
University Press. p. 19. ISBN 9780521563635. .
[78] Brad, Sherman; Lionel Bently (1999). The making of modern intellectual property law: the British experience, 1760-1911 (http:/ / www.
google. com/ books?id=u2aMRA-eF1gC& dq=statute+ of+anne+ copyright&lr=&as_brr=3& source=gbs_navlinks_s). Cambridge
University Press. p. 207. ISBN 9780521563635. .
[79] " property as a common descriptor of the field probably traces to the foundation of the World Intellectual Property Organization (WIPO) by
the United Nations." in Mark A. Lemley, Property, Intellectual Property, and Free Riding (http:// www.utexas. edu/ law/ journals/ tlr/
abstracts/ 83/ 83Lemley.pdf), Texas Law Review, 2005, Vol. 83:1031, page 1033, footnote 4.
[80] Peter K, Yu (2007). Intellectual Property and Information Wealth: Copyright and related rights (http:/ / www.google. com/
books?id=tgK9BzcF5WgC&dq=statute+ of+anne+ copyright&lr=&as_brr=3& source=gbs_navlinks_s). Greenwood Publishing Group.
pp. 345–346. ISBN 9780275988838. .
[81] Dutfield, Graham; Suthersanen, Uma (2008). Global intellectual property (http:/ / books. google. co. uk/ books?id=-Nc77dN6eDUC&
dq=copyright+designs+ and+patent+ 1988+history& source=gbs_navlinks_s). Edward Elgar Publishing. pp. vi. ISBN 9781847203649. .
[82] Van Horn Melton, James (2001). The rise of the public in Enlightenment Europe (http:// books.google.com/ books?id=QZovusQ1SjYC&
dq="perpetual+copyright"& source=gbs_navlinks_s). Cambridge University Press. p. 140. ISBN 9780521469692. .
[83] Jonathan, Rosenoer (1997). Cyberlaw: the law of the internet (http:// www.google.com/ books?id=HlG2esMIm7kC& dq=statute+ of+
anne+ copyright&lr=&as_brr=3&source=gbs_navlinks_s). Springer. pp. 34–35. ISBN 9780387948324. .
[84] Kenneth L. Port (2005). Licensing Intellectual Property in the Information Age (2nd ed.). Carolina Academic Press. pp. 425–566.
ISBN 0-89089-890-1.
[85] WIPO Guide on the Licensing of Copyright and Related Rights (http:/ / www.google.com/ books?id=LvRRvXBIi8MC& dq=copyright+
transfer+ and+licensing& as_brr=3&source=gbs_navlinks_s). World Intellectual Property Organization. 2004. p. 78. ISBN 9789280512717.
.
[86] "The Relationship Between Copyright Law and Contract Law" (http:/ / www. ipo.gov. uk/ ipresearch-coprightworkshop-201003.pdf).
October 2010. .
[87] Comparison of historical copyright situtaion in Britain and Germany (http:/ / www.cippm. org.uk/ downloads/ Symposium 2009/ Hoffner -
vortrag_eng-10_min.pdf)
[88] "Der Spiegel 18 August 2010 article: No Copyright Law" (http:// www.spiegel. de/ international/zeitgeist/ 0,1518,710976,00. html).
Spiegel.de. . Retrieved 2011-04-10.
[89] Geschichte und Wesen des Urheberrechts (History and nature of copyright) by Eckhard Höffner, July 2010 (in German) ISBN
3-930893-16-9
Further reading
• Dowd, Raymond J. (2006). Copyright Litigation Handbook (1st ed. ed.). Thomson West. ISBN 0314962794.
• Gantz, John & Rochester, Jack B. (2005). Pirates of the Digital Millennium. Financial Times Prentice Hall.
ISBN O-13-146315-2.
• Ghosemajumder, Shuman. Advanced Peer-Based Technology Business Models (http:/ / shumans. com/
p2p-business-models.pdf). MIT Sloan School of Management, 2002.
• Lehman, Bruce: Intellectual Property and the National Information Infrastructure (Report of the Working Group
on Intellectual Property Rights, 1995)
• Lindsey, Marc: Copyright Law on Campus. Washington State University Press, 2003. ISBN 978-0-87422-264-7.
• Mazzone, Jason. Copyfraud. SSRN (http:/ / ssrn. com/ abstract=787244)
• Nimmer, Melville; David Nimmer (1997). Nimmer on Copyright. Matthew Bender. ISBN 0-8205-1465-9.
• Patterson, Lyman Ray (1968). Copyright in Historical Perspective. Vanderbilt University Press.
ISBN 0826513735.
• Pievatolo, Maria Chiara. Publicness and Private Intellectual Property in Kant's Political Thought. http:// bfp. sp.
unipi.it/ ~pievatolo/ lm/ kantbraz.html
• Rosen, Ronald (2008). Music and Copyright. Oxford Oxfordshire: Oxford University Press. ISBN 0195338367.
• Shipley, David E. Thin But Not Anorexic: Copyright Protection for Compilations and Other Fact Works (http://
ssrn. com/ abstract=1076789) UGA Legal Studies Research Paper No. 08-001; Journal of Intellectual Property
Law, Vol. 15, No. 1, 2007.
• Silverthorne, Sean. Music Downloads: Pirates- or Customers? (http:/ / hbswk. hbs. edu/ item. jhtml?id=4206&
t=innovation). Harvard Business School Working Knowledge, 2004.
• Sorce Keller, Marcello. "Originality, Authenticity and Copyright", Sonus, VII(2007), no. 2, pp. 77–85.
Copyright
203
• Steinberg, S.H. & Trevitt, John (1996). Five Hundred Years of Printing (4th ed. ed.). London and New Castle:
The British Library and Oak Knoll Press. ISBN 1-884718-19-1.
• Story, Alan; Darch, Colin & Halbert, Deborah, ed (2006). The Copy/South Dossier: Issues in the Economics,
Politics and Ideology of Copyright in the Global South (http:/ / copysouth. org/ en/ documents/ csdossier. pdf).
Copy/South Research Group. ISBN 978-0-9553140-1-8.
External links
• Copyright (http:// www.dmoz.org/Society/ Law/ Legal_Information/Intellectual_Property/Copyrights/ ) at the
Open Directory Project
• Collection of laws for electronic access (http:/ / www.wipo.int/ clea/ en/ ) from WIPO - intellectual property
laws of many countries
• Copyright (http:/ / ucblibraries.colorado. edu/ govpubs/ us/ copyrite. htm) from UCB Libraries GovPubs
• About Copyright (http:/ / www.ipo.gov. uk/ types/ copy.htm) at the UK Intellectual Property Office
• A Bibliography on the Origins of Copyright and Droit d'Auteur (http:// www.lawtech. jus. unitn. it/ index. php/
copyright-history/ bibliography)
• 6.912 Introduction to Copyright Law (http:// ocw. mit. edu/ courses/
electrical-engineering-and-computer-science/6-912-introduction-to-copyright-law-january-iap-2006/) taught by
Keith Winstein, MIT OpenCourseWare January IAP 2006
• IPR Toolkit - An Overview, Key Issues and Toolkit Elements (http:/ / www.jisc. ac. uk/ whatwedo/ themes/
content/ contentalliance/ reports/ipr.aspx) (Sept 2009) by Professor Charles Oppenheim and Naomi Korn at the
Strategic Content Alliance (http:/ / www. jisc. ac. uk/ whatwedo/ themes/ content/ contentalliance. aspx)
• MIT OpenCourseWare 6.912 Introduction to Copyright Law (http:// ocw. mit. edu/ courses/
electrical-engineering-and-computer-science/ 6-912-introduction-to-copyright-law-january-iap-2006/) Free
self-study course with video lectures as offered during the January, 2006, Independent Activities Period (IAP)
Core Data
204
Core Data
Core Data
Developer(s) Apple Inc.
Stable release 3.2.0
Operating system Mac OS X
Type System Utility
License Proprietary
Website
Apple Developer Data Management
[1]
In simplest terms, Core Data is an object graph that can be persisted to Disk. [...] Core Data can do a lot more
for us. It serves as the entire model layer for us. It is not just the persistence on disk, but it is also all the
objects in memory that we normally consider to be data objects.
[2]
—Marcus Zarra, Core Data
Core Data is part of the Cocoa API in Mac OS X first introduced with Mac OS X 10.4 Tiger and for iOS with
iPhone SDK 3.0.
[3]
It allows data organised by the relational entity-attribute model to be serialised into XML, binary,
or SQLite stores. The data can be manipulated using higher level objects representing entities and their relationships.
Core Data manages the serialised version, providing object lifecycle and object graph management, including
persistence. Core Data interfaces directly with SQLite, insulating the developer from the underlying SQL.
[4]
Just as Cocoa Bindings handles many of the duties of the Controller in a Model-View-Controller design, Core Data
handles many of the duties of the data Model. Among other tasks, it handles change management, serializing to disk,
memory footprint minimization, and queries against the data.
Usage
Core Data describes data with a high level data model expressed in terms of entities and their relationships plus fetch
requests that retrieve entities meeting specific criteria. Code can retrieve and manipulate this data on a purely object
level without having to worry about the details of storage and retrieval. The controller objects available in Interface
Builder can retrieve and manipulate these entities directly. When combined with Cocoa bindings the UI can display
many components of the data model without needing background code.
For example: a developer might be writing a program to handle vCards. In order to manage these, the author intends
to read the vCards into objects, and then store them in a single larger XML file. Using Core Data the developer
would drag their schema from the data designer in Xcode into an interface builder window to create a GUI for their
schema. They could then write standard Objective-C code to read vCard files and put the data into Core Data
managed entities. From that point on the author's code manipulates these Core Data objects, rather than the
underlying vCards. Connecting the Save menu item to the appropriate method in the controller object will direct the
controller to examine the object stack, determine which objects are dirty, and then re-write a Core Data document
file with these changes.
Core Data is organized into a large hierarchy of classes, though interaction is only prevalent with a small set of them.
Core Data
205
Name Use Key Methods
NSManagedObject • Access attributes
• A "row" of data
• -entity
• -valueForKey:
• -setValue: forKey:
NSManagedObjectContext • Actions
• Changes
• -executeFetchRequest: error:
• -save
NSManagedObjectModel • Structure
• Storage
• -entities
• -fetchRequestTemplateForName:
• -setFetchRequestTemplate: forName:
NSFetchRequest • Request data • -setEntity:
• -setPredicate:
• -setFetchBatchSize:
NSPersistentStoreCoordinator • Mediator
• Persisting the data
• -addPersistentStoreWithType: configuration: URL: options: error:
• -persistentStoreForURL:
NSPredicate • Specify query • +predicateWithFormat:
• -evaluateWithObject:
[2]

[4]

[5]

[6]
Storage formats
Core Data can serialize objects into XML, Binary, or SQLite for storage.
[4]
With the release of Mac OS X 10.5
Leopard, developers can also create their own custom atomic store types. Each method carries advantages and
disadvantages, such as being human readable (XML) or more memory efficient (SQLite). This portion of Core Data
is similar to the original Enterprise Objects Framework (EOF) system, in that one can write fairly sophisticated
queries. Unlike EOF, it is not possible to write your own SQL.
Core Data schemas are standarized. If you have the Xcode Data Model file, you can read and write files in that
format freely. Unlike EOF, though, Core Data is not currently designed for multiuser or simultaneous access.
Schema migration is also non-trivial, virtually always requiring code. If other developers have access to and depend
upon your data model, you may need to provide version translation code in addition to a new data model if your
schema changes.
History and genesis
Core Data owes much of its design to an early NeXT product, Enterprise Objects Framework (EOF).
[7]
EOF was specifically aimed at object-relational mapping for high-end SQL database engines such as Microsoft SQL
Server and Oracle. EOF's purpose was twofold, one to connect to the database engine and hide the implementation
details, and two to read the data out of the simple relational format and translate that into a set of objects. Developers
typically interacted with the objects only, dramatically simplifying development of complex programs for the cost of
some "setup". The EOF object model was deliberately set up to make the resulting programs "document like", in that
the user could edit the data locally in memory, and then write out all changes with a single Save command.
Throughout its history EOF "contained" a number of bits of extremely useful code that were not otherwise available
under NeXTSTEP/OpenStep. For instance, EOF required the ability to track which objects were "dirty" so the
system could later write them out, and this was presented to the developer not only as a document-like system, but
also in the form of an unlimited Undo command stack. Many developers complained that this state management
code was far too useful to be isolated in EOF, and it was moved into the Cocoa API during the transition to Mac OS
X.
Core Data
206
Oddly what was not translated was EOF itself. EOF was used primarily along with another OpenStep-era product,
WebObjects, an application server originally based on Objective-C that was in the process of being ported to the
Java programming language. As part of this conversion EOF was also converted to Java, and thus became much
more difficult to use from Cocoa. Enough developers complained about this that Apple apparently decided to do
something about it.
One critical realization is that the object state management system in EOF did not really have anything to do with
relational databases. The same code could be, and was, used by developers to manage graphs of other objects as
well. In this role the really useful parts of EOF were those that automatically built the object sets from the raw data,
and then tracked them. It is this concept, and perhaps code, that forms the basis of Core Data.
Notes
[1] http:/ / developer.apple. com/ technologies/ mac/ data-management.html
[2] Zarra, Core Data.
[3] Apple, "Core Data Tutorial for iPhone OS".
[4] Apple, "Core Data Programming Guide".
[5] Stevenson, "Core Data Class Overview"
[6] Jurewitz, "Working With Core Data"
[7] Apple, "EOModeler User Guide"
References
• Apple Inc. (2009, 17 September). "Core Data Programming Guide". Retrieved from http:// developer.apple.
com/ iphone/ library/documentation/ Cocoa/ Conceptual/ CoreData/ cdProgrammingGuide.html (http://
developer. apple. com/ mac/ library/documentation/ cocoa/ conceptual/ CoreData/ cdProgrammingGuide.html)
• Apple Inc. (2009, 09 September). "Core Data Tutorial for iPhone OS". Retrieved from http:// developer.apple.
com/ iphone/ library/documentation/ DataManagement/ Conceptual/ iPhoneCoreData01/ Introduction/
Introduction.html (http:// developer.apple. com/ iphone/ library/documentation/ DataManagement/ Conceptual/
iPhoneCoreData01/ Introduction/Introduction.html)
• Apple Inc. (2006). "EOModeler User Guide". Retrieved from http:/ / developer.apple. com/ legacy/ mac/ library/
documentation/ WebObjects/ UsingEOModeler/ Introduction/Introduction.html#/ / apple_ref/doc/ uid/
TP30001018-CH201-TP1 (http:/ / developer.apple. com/ legacy/ mac/ library/documentation/ WebObjects/
UsingEOModeler/ Introduction/Introduction.html#/ / apple_ref/doc/ uid/ TP30001018-CH201-TP1)
• Jurewitz, M. & Apple Inc. (2010). "iPhone Development Videos: Working With Core Data". Retrieved from
http:/ / developer.apple. com/ videos/ iphone/ #video-advanced-coredata (http:// developer. apple. com/ videos/
iphone/ #video-advanced-coredata)
• Stevenson, S. (2005). "Core Data Class Overview". Retrieved from http:/ / cocoadevcentral. com/ articles/
000086.php (http:// cocoadevcentral. com/ articles/ 000086.php)
• Zarra, M. S. (2009). Core Data Apple's API for Persisting Data on Mac OS X. The Pragmatic Programmers.
• LaMarche, J., & Mark, D. (2009). More iPhone 3 Development: Tackling iPhone SDK 3. Apress.
External links
• Apple Inc. (2006). "Developing With Core Data". Retrieved from http:// developer. apple. com/ macosx/
coredata. html (http:/ / developer.apple. com/ macosx/ coredata.html)
• Apple Inc. (2009). "Web Objects Tutorial". Retrieved from http:// developer.apple.com/ legacy/ mac/ library/
documentation/ DeveloperTools/Conceptual/ WOTutorial/Introduction/Introduction.html (http:// developer.
apple. com/ legacy/ mac/ library/documentation/ DeveloperTools/Conceptual/ WOTutorial/Introduction/
Introduction.html)
• CocoaDev. (n.d.). Retrieved from http:// www.cocoadev.com/ (http:// www.cocoadev.com/ )
Core Data
207
• Stevenson, S. (2005). "Build A Core Data Application". Retrieved from http:/ / cocoadevcentral.com/ articles/
000085. php (http:// cocoadevcentral. com/ articles/ 000085.php)
Core data integration
Core data integration is the use of data integration technology for a significant, centrally planned and managed IT
initiative within a company. Examples of core data integration initiatives could include:
• ETL (Extract, transform, load) implementations
• EAI (Enterprise Application Integration) implementations
• SOA (Service-Oriented Architecture) implementations
• ESB (Enterprise Service Bus) implementations
Core data integrations are often designed to be enterprise-wide integration solutions. They may be designed to
provide a data abstraction layer, which in turn will be used by individual core data integration implementations, such
as ETL servers or applications integrated through EAI.
Because it is difficult to promptly roll out a centrally managed data integration solution that anticipates and meets all
data integration requirements across an organization, IT engineers and even business users create edge data
integration, using technology that may be incompatible with that used at the core. In contrast to a core data
integration, an edge data integration is not centrally planned and is generally completed with a smaller budget and a
tighter deadline.
References
• http:/ / searchsoa. techtarget.com/ tip/ 0,289483,sid26_gci1171085,00. html*
• List of Data Integration/Migration Technologies
[1]
References
[1] http:/ / www. datamigrationpro.com/ ?page=navigator#vendor_list
Customer data management
208
Customer data management
Customer Data Management (CDM) is a term used to describe the way in which businesses keep track of their
customer information and survey their customer base in order to obtain feedback. CDM embraces a range of
software or cloud computing applications designed to give large organizations rapid and efficient access to customer
data. Surveys and data can be centrally located and widely accessible within a company, as opposed to being
warehoused in separate departments. CDM encompasses the collection, analysis, organizing, reporting and sharing
of customer information throughout an organization. Businesses need a thorough understanding of their customers’
needs if they are to retain and increase their customer base. Efficient CDM solutions provide companies with the
ability to deal instantly with customer issues and obtain immediate feedback. As a result, customer retention and
customer satisfaction can show dramatic improvement. According to a recent study by Aberdeen Group inc.:
"Above-average and best-in-class companies... attain greater than 20% annual improvement in retention rates,
revenues, data accuracy and partner/customer satisfaction rates."
[1]
Customer Data Management and Cloud Computing
Cloud computing offers an attractive choice for CDM in many companies due to its accessibility and
cost-effectiveness. Businesses can decide who, within their company, should have the ability to create, adjust,
analyze or share customer information. In December 2010, 52% of Information Technology (IT) professionals
worldwide were deploying, or planning to deploy, cloud computing;
[2]
this percentage is far higher in many
countries.
Uses for Management
Customer Data Management
• should provide a cost-effective, user-friendly solution for marketing, research, sales, HR and IT departments
• enables companies to create and email online surveys, reports and newsletters
• encompasses and simplifies Customer Relationship Management (CRM) and Customer Feedback Management
(CFM)
Background
Customer Data Management, as a term, was coined in the 1990s, pre-dating the alternative term Enterprise Feedback
Management (EFM). Customer Data Management (CDM) was introduced as a software solution that would replace
earlier disc-based or paper-based surveys and spreadsheet data. Initially, CDM solutions were marketed to
businesses as software, specific to one company, and often to one department within that company. This was
superseded by application service providers (ASPs) where software was hosted for end user organizations, thus
avoiding the necessity for IT professionals to deploy and support software. However, ASPs with their single-tenancy
architecture were, in turn, superseded by software as a service (SaaS), engineered for multi-tenancy. By 2007 SaaS
applications, giving businesses on-demand access to their customer information, were rapidly gaining popularity
compared with ASPs. Cloud computing now includes SaaS and many prominent CDM providers offer cloud-based
applications to their clients.
In recent years, there has been a push away from the term Enterprise Feedback Management (EFM), with many of
those working in this area advocating the slightly updated use of Customer Data Management (CDM). The return to
the term CDM is largely based on the greater need for clarity around the solutions offered by companies, and on the
desire to retire terminology veering on techno-jargon that customers may have a hard time understanding.
[3]
Customer data management
209
References
[1] Smalltree, Hannah (2006) (http:// searchcrm.techtarget.com/ news/ 1212337/ Best-practices-in-managing-customer-data)
[2] Cisco.com (December 2010) (http:/ / newsroom. cisco. com/ dlls/ 2010/ prod_120810.html)
[3] InSiteSystems.com (December, 2010) (http:// www. insitesystems. com/ systems/ blogs/ the-problem-with-efm.html)
DAMA
DAMA (the Data Management Association) is a not-for-profit, vendor-independent, international association of
technical and business professionals dedicated to advancing the concepts and practices of information resource
management (IRM) and data resource management (DRM).
DAMA's primary purpose is to promote the understanding, development and practice of managing information and
data as a key enterprise asset.
The group is organized as a set of more than 40 chapters and members-at-large around the world, with an
International Conference held every year. DAMA currently has chapters in 16 countries, with the most substantial
presence in the United States.
The DAMA Guide to the Data Management Body of Knowledge" (DAMA-DMBOK Guide), under the guidance of
a new DAMA-DMBOK Editorial Board. This publication is available from April 5th, 2009.
External links
• DAMA International
[1]
• Full List of DAMA Chapters
[2]
• DAMA Central Virginia Chapter
[3]
• DAMA National Capital Region (Washington D.C.)
[4]
• DAMA Australia
[5]
• San Francisco Bay Area DAMA
[6]
References
[1] http:/ / www. dama. org/i4a/ pages/ index. cfm?pageid=1
[2] http:/ / www. dama. org/i4a/ pages/ index. cfm?pageid=3280
[3] http:/ / www. damacv. org/
[4] http:/ / www. dama-ncr.org/
[5] http:/ / www. dama. org.au/ Default. htm
[6] http:/ / www. sfdama. org/
Dashboard (business)
210
Dashboard (business)
Business Dashboards.
Dashboard provides at-a-glance views of key performance
indicators (KPIs) relevant to a particular objective or business
process (e.g. sales, marketing, human resources, or production)
[1]
. The term dashboard originates from the automobile dashboard
where drivers can monitor the major functions at a glance for
example, how fast you're going, how much fuel you have
remaining and whether the engine is overheating. Dashboards do
not need to provide every piece of information and typically
limited to show summaries, key trends, comparisons, and
exceptions. Well designed business dashboards can provide a
unique and powerful means to present and monitor business
information.
History
Early predecessors of the modern business dashboard were first developed in the 1980s in the form of Executive
Information Systems (EISs). Due to problems primarily with data refreshing and handling, it was soon realized that
the approach wasn’t practical as information was often incomplete, unreliable, and spread across too many disparate
sources
[2]
. Thus, EISs hibernated until the 1990s when the information age quickened pace and data warehousing,
and online analytical processing (OLAP) allowed dashboards to function adequately. Despite the availability of
enabling technologies, the rapid rise in dashboard use didn't become popular until later in that decade, primarily due
to the rise of key performance indicators (KPIs), introduced by Robert S. Kaplan and David P. Norton as the
Balanced Scorecard
[3]
. Today, the use of dashboards forms an important part of Business Performance Management
(BPM).
Dashboard Classification
Dashboards can be broken down according to role and are either strategic, analytical, operational, or informational
[4]
. Strategic dashboards support managers at any level in an organization, and provide the quick overview that decision
makers need to monitor the health and opportunities of the business. Dashboards of this type focus on high level
measures of performance, and forecasts. Strategic dashboards benefit from static snapshots of data (daily, weekly,
monthly, and quarterly) that are not constantly changing from one moment to the next. Dashboards for analytical
purposes often include more context, comparisons, and history, along with subtler performance evaluators.
Analytical dashboards typically support interactions with the data, such as drilling down into the underlying details.
Dashboards for monitoring operations are often designed differently from those that support strategic decision
making or data analysis and often require monitoring of activities and events that are constantly changing and might
require attention and response at a moment's notice.
Dashboard (business)
211
Commercial dashboards
Dashboards have been part of large-scale applications and with the rise of on demand software services that provide
affordable complex applications for small to medium businesses the appearance of dashboards is now quite common.
Many companies are offering products designed to give businesses an accurate snapshot of their operations. While
each provider provides its own set of features, there are qualities to which all aspire—such as simplicity.
See Also
• Dashboard (interface)
• Dashboard (software)
• Dashboard (disambiguation)
References
[1] Michael Alexander and John Walkenbach, Excel Dashboards and Reports (Wiley, 2010)
[2] Steven Few, Information Dashboard Design: The Effective Visual Communication of Data (O'Reilly, 2006)
[3] Wayne W. Eckerson, Performance Dashboards: Measuring, Monitoring, and Managing Your Business (Wiley , 2010)
[4] Steven Few, Information Dashboard Design: The Effective Visual Communication of Data (O'Reilly, 2006)
External Links
• The Birth of the Enterprise Dashboard (http:// www.toptechnews. com/ story. xhtml?story_id=1120003S4Z00)
• Dashboard Confusion (http:/ / www. intelligententerprise.com/ showArticle. jhtml?articleID=18300136)
• Stop Searching for Information – Monitor it with Dashboard Technology (http:// www.
information-management. com/ infodirect/20020208/ 4681-1. html)
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of
"datum") are typically the results of measurements and can be the basis of graphs, images, or observations of a set of
variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are
derived. Raw data, i.e. unprocessed data, refers to a collection of numbers, characters, images or other outputs from
devices that collect information to convert physical quantities into symbols.
Etymology
The word data (pronounced /ˈdeɪtə/ day-tə, English pronunciation: /ˈdætə/ da-tə, or English pronunciation: /ˈdɑːtə/ dah-tə)
is the Latin plural of datum, neuter past participle of dare, "to give", hence "something given". In discussions of
problems in geometry, mathematics, engineering, and so on, the terms givens and data are used interchangeably.
Also, data is a representation of a fact, figure, and idea. Such usage is the origin of data as a concept in computer
science: data are numbers, words, images, etc., accepted as they stand.
Usage in English
In English, the word datum is still used in the general sense of "an item given". In cartography, geography, nuclear
magnetic resonance and technical drawing it is often used to refer to a single specific reference datum from which
distances to all other data are measured. Any measurement or result is a datum, but data point is more usual,
[1]
albeit
tautological. Both datums (see usage in datum article) and the originally Latin plural data are used as the plural of
datum in English, but data is commonly treated as a mass noun and used with a verb in the singular form, especially
Data
212
in day-to-day usage. For example, This is all the data from the experiment. This usage is inconsistent with the rules
of Latin grammar and traditional English (These are all the data from the experiment). Even when a very small
quantity of data is referenced (One number, for example) the phrase piece of data is often used, as opposed to datum.
The debate over appropriate usage is ongoing.
Many style guides
[2]
and international organizations, such as the IEEE Computer Society,
[3]
allow usage of data as
either a mass noun or plural based on author preference . Other professional organizations and style guides
[4]
require
that authors treat data as a plural noun. For example, the Air Force Flight Test Center specifically states that the
word data is always plural, never singular.
[5]
Data is accepted as a singular mass noun in everyday educated usage.
[6]

[7]
Some major newspapers such as The New
York Times use it either in the singular or plural. In the New York Times the phrases "the survey data are still being
analyzed" and "the first year for which data is available" have appeared on the same day. In scientific writing data is
often treated as a plural, as in These data do not support the conclusions, but it is also used as a singular mass entity
like information. British usage now widely accepts treating data as singular in standard English,
[8]
including
everyday newspaper usage
[9]
at least in non-scientific use.
[10]
UK scientific publishing still prefers treating it as a
plural.
[11]
Some UK university style guides recommend using data for both singular and plural use
[12]
and some
recommend treating it only as a singular in connection with computers.
[13]
Meaning of data, information and knowledge
The terms information and knowledge are frequently used for overlapping concepts. The main difference is in the
level of abstraction being considered. Data is the lowest level of abstraction, information is the next level, and
finally, knowledge is the highest level among all three. Data on its own carries no meaning. For data to become
information, it must be interpreted and take on a meaning. For example, the height of Mt. Everest is generally
considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a
report containing practical information on the best way to reach Mt. Everest's peak may be considered as
"knowledge".
Information as a concept bears a diversity of meanings, from everyday usage to technical settings. Generally
speaking, the concept of information is closely related to notions of constraint, communication, control, data, form,
instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.
Beynon-Davies uses the concept of a sign to distinguish between data and information; data are symbols while
information occurs when symbols are used to refer to something.
[14]

[15]
It is people and computers who collect data and impose patterns on it. These patterns are seen as information which
can be used to enhance knowledge. These patterns can be interpreted as truth, and are authorized as aesthetic and
ethical criteria. Events that leave behind perceivable physical or virtual remains can be traced back through data.
Marks are no longer considered data once the link between the mark and observation is broken.
[16]
Raw data refers to a collection of numbers, characters, images or other outputs from devices to convert physical
quantities into symbols, that are unprocessed. Such data is typically further processed by a human or input into a
computer, stored and processed there, or transmitted (output) to another human or computer (possibly through a data
cable). Raw data is a relative term; data processing commonly occurs by stages, and the "processed data" from one
stage may be considered the "raw data" of the next.
Mechanical computing devices are classified according to the means by which they represent data. An analog
computer represents a datum as a voltage, distance, position, or other physical quantity. A digital computer
represents a datum as a sequence of symbols drawn from a fixed alphabet. The most common digital computers use a
binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1". More familiar representations,
such as numbers or letters, are then constructed from the binary alphabet.
Data
213
Some special forms of data are distinguished. A computer program is a collection of data, which can be interpreted
as instructions. Most computer languages make a distinction between programs and the other data on which
programs operate, but in some languages, notably Lisp and similar languages, programs are essentially
indistinguishable from other data. It is also useful to distinguish metadata, that is, a description of other data. A
similar yet earlier term for metadata is "ancillary data." The prototypical example of metadata is the library catalog,
which is a description of the contents of books.
Experimental data refers to data generated within the context of a scientific investigation by observation and
recording. Field data refers to raw data collected in an uncontrolled in situ environment.
References
This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed
under the GFDL.
[1] Matt Dye (2001). "Writing Reports" (http:/ / www.bris. ac.uk/ Depts/ DeafStudiesTeaching/ dissert/ Writing Reports. htm). University of
Bristol. .
[2] UoN Style Book – Singular or plural – Media and Public Relations Office – The University of Nottingham (http:/ /www. nottingham. ac. uk/
public-affairs/uon-style-book/ singular-plural.htm)
[3] "IEEE Computer Society Style Guide, DEF" (http:/ / www. computer.org/portal/site/ ieeecs/ menuitem.
c5efb9b8ade9096b8a9ca0108bcd45f3/ index. jsp?& pName=ieeecs_level1& path=ieeecs/ publications/ author/style& file=def.xml&
xsl=generic.xsl& ). IEEE Computer Society. .
[4] "WHO Style Guide" (http:/ / whqlibdoc. who. int/ hq/ 2004/ WHO_IMD_PUB_04. 1.pdf). Geneva: World Health Organization. 2004. p. 43.
.
[5] The Author's Guide to Writing Air Force Flight Test Center Technical Reports. Air Force Flight Center.
[6] New Oxford Dictionary of English, 1999
[7] "...in educated everyday usage as represented by the Guardian newspaper, it is nowadays most often used as a singular." http:// www. eisu2.
bham. ac.uk/ johnstf/ revis006. htm
[8] New Oxford Dictionary of English. 1999.
[9] Tim Johns (1997). "Data: singular or plural?" (http:// www.eisu2.bham.ac. uk/ johnstf/ revis006.htm). . "...in educated everyday usage as
represented by The Guardian newspaper, it is nowadays most often used as a singular."
[10] "Data" (http:// www.askoxford. com/ concise_oed/ data?view=uk). Compact Oxford Dictionnary. .
[11] "Data: singular or plural?" (http:// www.eisu2. bham. ac.uk/ johnstf/ revis006.htm). Blair Wisconsin International University. .
[12] "Singular or plural" (http:// www.nottingham. ac. uk/ public-affairs/uon-style-book/singular-plural.htm). University of Nottingham Style
Book. University of Nottingham. .
[13] "Computers and computer systems" (http:// openlearn. open.ac. uk/ mod/ resource/view.php?id=182902). OpenLearn. .
[14] P. Beynon-Davies (2002). Information Systems: An introduction to informatics in organisations. Basingstoke, UK: Palgrave Macmillan.
ISBN 0-333-96390-3.
[15] P. Beynon-Davies (2009). Business information systems. Basingstoke, UK: Palgrave. ISBN 978-0-230-20368-6.
[16] Sharon Daniel. The Database: An Aesthetics of Dignity.
External links
• data is a singular noun (http:/ / purl.org/ nxg/ note/ singular-data) (a detailed assessment)
Data access
214
Data access
Data access typically refers to software and activities related to storing, retrieving, or acting on data housed in a
database or other repository. There are two types of data access, sequential access and random access.
Historically, different methods and languages were required for every repository, including each different database,
file system, etc., and many of these repositories stored their content in different and incompatible formats.
In more recent days, standardized languages, methods, and formats, have been created to serve as interfaces between
the often proprietary, and always idiosyncratic, specific languages and methods. Such standards include SQL,
ODBC, JDBC, ADO.NET, XML, XQuery, XPath, and Web Services.
Some of these standards enable translation of data from unstructured (such as HTML or free-text files) to structured
(such as XML or SQL).
Data aggregator
A data aggregator is an organization such as Acxiom and ChoicePoint involved in compiling information from
detailed databases on individuals and selling that information to others.
[1]
Description
The source information for data aggregators may originate from public records and criminal databases; the
information is packaged into aggregate reports and then sold to businesses, as well as to local, state, and federal
government agencies. This information can also be useful for marketing purposes. Many data brokers' activities fall
under the Fair Credit Reporting Act (FCRA) which regulates consumer reporting agencies. The agencies then gather
and package personal information into consumer reports that are sold to creditors, employers, insurers, and other
businesses.
Various reports of information are provided by database aggregators. Individuals may request their own consumer
reports which contain basic biographical information such as name, date of birth, current address, and phone number.
Employee background check reports, which contain highly detailed information such as past addresses and length of
residence, professional licenses, and criminal history, may be requested by eligible and qualified third parties. Not
only can this data be used in employee background checks, but it may also be used to make decisions about
insurance coverage, pricing, and law enforcement. Privacy activists agrue that database aggregators can provide
erroneous information.
[2]
Role of the Internet
The potential of the Internet to consolidate and manipulate information has a new application in data aggregation,
also known as screen scraping. The Internet gives users the opportunity to consolidate their usernames and
passwords, or PINs. Such consolidation enables consumers to access a wide variety of PIN-protected websites
containing personal information by using one master PIN on a single website. Online account providers include
financial institutions, stockbrokers, airline and frequent flyer and other reward programs, and e-mail accounts. Data
aggregators can gather account or other information from designated websites by using account holders' PINs, and
then making the users' account information available to them at a single website operated by the aggregator at an
account holder's request. Aggregation services may be offered on a standalone basis or in conjunction with other
financial services, such as portfolio tracking and bill payment provided by a specialized website, or as an additional
service to augment the online presence of an enterprise established beyond the virtual world. Many established
companies with an Internet presence appear to recognize the value of offering an aggregation service to enhance
Data aggregator
215
other web-based services and attract visitors. Offering a data aggregation service to a website may be attractive
because of the potential that it will frequently draw users of the service to the hosting website.
Legal implications
Financial institutions are concerned about the possibility of liability arising from data aggregation activities, potential
security problems, infringement on intellectual property rights and the possibility of diminishing traffic to the
institution's website. The aggregator and financial institution may agree on a data feed arrangement activated on the
customer's request, using an Open Financial Exchange (OFX) standard to request and deliver information to the site
selected by the customer as the place from which they will view their account data. Agreements provide an
opportunity for institutions to negotiate to protect their customers' interests and offer aggregators the opportunity to
provide a robust service. Aggregators who agree with information providers to extract data without using an OFX
standard may reach a lower level of consensual relationship; therefore, "screen scraping" may be used to obtain
account data, but for business or other reasons, the aggregator may decide to obtain prior consent and negotiate the
terms on which customer data is made available. "Screen scraping" without consent by the content provider has the
advantage of allowing subscribers to view almost any and all accounts they happen to have opened anywhere on the
Internet through one website.
Outlook
Over time, the transfer of large amounts of account data from the account provider to the aggregator's server could
develop into a comprehensive profile of a user, detailing their banking and credit card transactions, balances,
securities transactions and portfolios, and travel history and preferences. As the sensitivity to data protection
considerations grows, it is likely there will be a considerable focus on the extent to which data aggregators may seek
to use this data either for their own purposes or to share it on some basis with the operator of a website on which the
service is offered or with other third parties.
[3]
References
[1] Stanley, Jay and Steinhardt, Barry (January, 2003). Bigger Monster, Weaker Chains: The Growth of an American Surveillance Society.
American Civil Liberties Union.
[2] Pierce, Deborah and Ackerman, Linda (2005-05-19). "Data Aggregators: A Study of Data Quality and Responsiveness" (http:// web.archive.
org/ web/ 20070319220412/http:/ / www.privacyactivism. org/ docs/ DataAggregatorsStudy.html). Privacyactivism.org. Archived from the
original (http:// www. privacyactivism. org/docs/ DataAggregatorsStudy.html) on 2007-03-19. . Retrieved 2007-04-02.
[3] Ledig, Robert H. and Vartanian, Thomas P. (2002-09-11). "Scrape It, Scrub It and Show It: The Battle Over Data Aggregation" (http:/ / www.
ffhsj.com/ bancmail/ bmarts/ aba_art. htm). Fried Frank. . Retrieved 2007-04-02.
Data architect
216
Data architect
A data architect is a person responsible for ensuring that the data assets of an organization are supported by an
architecture supporting the organization in achieving its strategic goals. The architecture should cover databases,
data integration and the means to get to the data. Usually the data architect achieves his/her goals via setting
enterprise data standards. A Data Architect can also be referred to as a Data Modeler, although the role involves
much more than just creating data models.
The definition of an IT architecture used in ANSI/IEEE Std 1471-2000 is: The fundamental organization of a system,
embodied in its components, their relationships to each other and the environment, and the principles governing its
design and evolution., where the data architect primarily focuses on the aspects related to data.
In TOGAF (the Open Group Architecture Framework) [1], architecture has two meanings depending upon its
contextual usage:
• A formal description of a system, or a detailed plan of the system at component level to guide its implementation
• The structure of components, their inter-relationships, and the principles and guidelines governing their design
and evolution over time.
According to DAMA (Data Management Association)[2], , Data Architect is often interchangeable with, but
includes enterprise architecture considerations. A DAMA recognized Certified Data Management Professional
would have a wide range of such skills.
Translating this to Data architecture helps defining the role of the data architect as the one responsible for developing
and maintaining a formal description of the data and data structures - this can include data definitions, data models,
data flow diagrams, etc. (in short metadata). Data architecture includes topics such as metadata management,
business semantics, data modeling and metadata workflow management.
A data architect's job frequently includes the set up a metadata registry and allows domain-specific stakeholders to
maintain their own data elements.
Some fundamental skills of a Data Architect are:
• Logical Data modeling
• Physical Data modeling
• Development of a data strategy and associated polices
• Selection of capabilities and systems to meet business information needs
A Data Strategy enumerates the Data Policies each of which commit the organization to codifying a best practice. A
policy may specify any one area of data standards; data security or Information Assurance; data retention or data
stewardship.
Data architects usually have experience in one or more of the following technologies:
• Data dictionaries
• Data warehousing
• Enterprise application integration
• Metadata registry
• Relational Databases
• Semantics
• Data retention
• Structured Query Language (SQL)
• Procedural SQL
• XML, including schema definitions and transformations.
Data architect
217
References
[1] http:/ / www. togaf.org
[2] http:/ / www. dama. org
Data architecture
Data Architecture in enterprise architecture is the design of data for use in defining the target state and the
subsequent planning needed to achieve the target state. It is usually one of several architecture domains that form the
pillars of an enterprise architecture or solution architecture.
[1]
Overview
A data architecture describes the data structures used by a business and/or its applications. There are descriptions of
data in storage and data in motion; descriptions of data stores, data groups and data items; and mappings of those
data artifacts to data qualities, applications, locations etc.
Essential to realizing the target state, Data Architecture describes how data is processed, stored, and utilized in a
given system. It provides criteria for data processing operations that make it possible to design data flows and also
control the flow of data in the system.
The Data Architect is responsible for defining the target state, alignment during development and then minor follow
up to ensure enhancements are done in the spirit of the original blueprint.
During the definition of the target state, the Data Architecture breaks a subject down to the atomic level and then
builds it back up to the desired form. The Data Architect breaks the subject down by going through 3 traditional
architectural processes:
• Conceptual - represents all business entities.
• Logical - represents the logic of how entities are related.
• Physical - the realization of the data mechanisms for a specific type of functionality.
The "data" column of the Zachman Framework for enterprise architecture –
Layer View Data (What) Stakeholder
1 Scope/Contextual List of things important to the business (subject areas) Planner
2 Business Model/Conceptual
Semantic model or Conceptual/Enterprise Data Model
[2]
Owner
3 System Model/Logical Enterprise/Logical Data Model Designer
4 Technology Model/Physical Physical Data Model Builder
5 Detailed Representations/ out-of-context Data Definition Subcontractor
In this second, broader sense, data architecture includes a complete analysis of the relationships between an
organization's functions, available technologies, and data types.
Data architecture should be defined in the planning phase of the design of a new data processing and storage
system. The major types and sources of data necessary to support an enterprise should be identified in a manner that
is complete, consistent, and understandable. The primary requirement at this stage is to define all of the relevant data
entities, not to specify computer hardware items. A data entity is any real or abstracted thing about which an
organization or individual wishes to store data.
Data architecture
218
Data Architecture Topics
Physical data architecture
Physical data architecture of an information system is part of a technology plan. As its name implies, the technology
plan is focused on the actual tangible elements to be used in the implementation of the data architecture design.
Physical data architecture encompasses database architecture. Database architecture is a schema of the actual
database technology that will support the designed data architecture.
Elements of data architecture
There are certain elements that must be defined as the data architecture schema of an organization is designed. For
example, the administrative structure that will be established in order to manage the data resources must be
described. Also, the methodologies that will be employed to store the data must be defined. In addition, a description
of the database technology to be employed must be generated, as well as a description of the processes that will
manipulate the data. It is also important to design interfaces to the data by other systems, as well as a design for the
infrastructure that will support common data operations (i.e. emergency procedures, data imports, data backups,
external transfers of data).
Without the guidance of a properly implemented data architecture design, common data operations might be
implemented in different ways, rendering it difficult to understand and control the flow of data within such systems.
This sort of fragmentation is highly undesirable due to the potential increased cost, and the data disconnects
involved. These sorts of difficulties may be encountered with rapidly growing enterprises and also enterprises that
service different lines of business (e.g. insurance products).
Properly executed, the data architecture phase of information system planning forces an organization to specify and
delineate both internal and external information flows. These are patterns that the organization may not have
previously taken the time to conceptualize. It is therefore possible at this stage to identify costly information
shortfalls, disconnects between departments, and disconnects between organizational systems that may not have been
evident before the data architecture analysis.
[3]
Constraints and influences
Various constraints and influences will have an effect on data architecture design. These include enterprise
requirements, technology drivers, economics, business policies and data processing needs.
Enterprise requirements
These will generally include such elements as economical and effective system expansion, acceptable
performance levels (especially system access speed), transaction reliability, and transparent management of
data. In addition, the conversion of raw data such as transaction records and image files into more useful
information forms through such features as data warehouses is also a common organizational requirement,
since this enables managerial decision making and other organizational processes. One of the architecture
techniques is the split between managing transaction data and (master) reference data. Another one is splitting
data capture systems from data retrieval systems (as done in a data warehouse).
Technology drivers
These are usually suggested by the completed data architecture and database architecture designs. In addition,
some technology drivers will derive from existing organizational integration frameworks and standards,
organizational economics, and existing site resources (e.g. previously purchased software licensing).
Economics
These are also important factors that must be considered during the data architecture phase. It is possible that
some solutions, while optimal in principle, may not be potential candidates due to their cost. External factors
Data architecture
219
such as the business cycle, interest rates, market conditions, and legal considerations could all have an effect
on decisions relevant to data architecture.
Business policies
Business policies that also drive data architecture design include internal organizational policies, rules of
regulatory bodies, professional standards, and applicable governmental laws that can vary by applicable
agency. These policies and rules will help describe the manner in which enterprise wishes to process their data.
Data processing needs
These include accurate and reproducible transactions performed in high volumes, data warehousing for the
support of management information systems (and potential data mining), repetitive periodic reporting, ad hoc
reporting, and support of various organizational initiatives as required (i.e. annual budgets, new product
development.
References
[1] What is data architecture (http:// www.learn.geekinterview.com/ data-warehouse/data-architecture/what-is-data-architecture.html|)
GeekInterview, 2008-01-28, accessed 2011-04-28
[2] http:// www. tdan. com/ i005fe12.htm
[3] Mittal, Prashant (2009). Author (http:// books. google. com/ books?id=BpkhYDj4tm0C& dq=inauthor:"PRASHANT+MITTAL"&
source=gbs_navlinks_s). pg 256: Global India Publications. pp. 314. ISBN 9789380228204. .
Further reading
• Bass, L.; John, B.; & Kates, J. (2001). Achieving Usability Through Software Architecture, Carnegie Mellon
University.
• Lewis, G.; Comella-Dorda, S.; Place, P.; Plakosh, D.; & Seacord, R., (2001). Enterprise Information System Data
Architecture Guide Carnegie Mellon University.
• Adleman, S.; Moss, L.; Abai, M. (2005). Data Strategy Addison-Wesley Professional.
External links
• Achieving Usability Through Software Architecture (http:// www.sei.cmu. edu/ library/abstracts/ reports/
01tr005. cfm), sei.cmu.edu 2001
• The Logical Data Architecture (http:// sunsite. uakom. sk/ sunworldonline/ swol-07-1998/swol-07-itarchitect.
html), by Nirmal Baid
Data bank
220
Data bank
In telecommunications, a data bank is a repository of information on one or more subjects that is organized in a way
that facilitates local or remote information retrieval. A data bank may be either centralized or decentralized. In this
sense, data bank is synonymous with database.
Data bank may also refer to an organization primarily concerned with the construction and maintenance of a
database.
Sources
•  This article incorporates public domain material from websites or documents of the General Services
Administration (in support of MIL-STD-188).
• The American Heritage Dictionary of the English Language, Fourth Edition. Houghton Mifflin, 2000.
Data binding
Data binding is a general technique that binds two data/information sources together and maintains synchronization
of data. This is usually done with two data/information sources with different types as in XML data binding.
However, in UI data binding, data and information objects of the same type are bound together (e.g. Java UI
elements to Java objects).
If the binding has been made in the proper manner, then, each data change is reflected automatically by the elements
that are bound to the data. The term data binding is also used in case where an outer representation of data in an
element changes, and the underlying data is automatically updated to reflect this change. As an example, a change in
a TextBox element could modify the underlying data value.
[1]
References
[1] What is data binding? (http:/ / msdn. microsoft.com/ en-us/ library/ms752347.aspx#what_is_data_binding)
Data center
221
Data center
An operation engineer overseeing a Network
Operations Control Room of a data center.
A data center (or data centre or datacentre or datacenter) is a
facility used to house computer systems and associated components,
such as telecommunications and storage systems. It generally includes
redundant or backup power supplies, redundant data communications
connections, environmental controls (e.g., air conditioning, fire
suppression) and security devices.
History
Data centers have their roots in the huge computer rooms of the early
ages of the computing industry. Early computer systems were complex
to operate and maintain, and required a special environment in which to operate. Many cables were necessary to
connect all the components, and methods to accommodate and organize these were devised, such as standard racks to
mount equipment, elevated floors, and cable trays (installed overhead or under the elevated floor). Also, old
computers required a great deal of power, and had to be cooled to avoid overheating. Security was important –
computers were expensive, and were often used for military purposes. Basic design guidelines for controlling access
to the computer room were therefore devised.
During the boom of the microcomputer industry, and especially during the 1980s, computers started to be deployed
everywhere, in many cases with little or no care about operating requirements. However, as information technology
(IT) operations started to grow in complexity, companies grew aware of the need to control IT resources. With the
advent of client-server computing, during the 1990s, microcomputers (now called "servers") started to find their
places in the old computer rooms. The availability of inexpensive networking equipment, coupled with new
standards for network cabling, made it possible to use a hierarchical design that put the servers in a specific room
inside the company. The use of the term "data center," as applied to specially designed computer rooms, started to
gain popular recognition about this time,
The boom of data centers came during the dot-com bubble. Companies needed fast Internet connectivity and nonstop
operation to deploy systems and establish a presence on the Internet. Installing such equipment was not viable for
many smaller companies. Many companies started building very large facilities, called Internet data centers (IDCs),
which provide businesses with a range of solutions for systems deployment and operation. New technologies and
practices were designed to handle the scale and the operational requirements of such large-scale operations. These
practices eventually migrated toward the private data centers, and were adopted largely because of their practical
results.
As of 2007, data center design, construction, and operation is a well-known discipline. Standard Documents from
accredited professional groups, such as the Telecommunications Industry Association, specify the requirements for
data center design. Well-known operational metrics for data center availability can be used to evaluate the business
impact of a disruption. There is still a lot of development being done in operation practice, and also in
environmentally-friendly data center design. Data centers are typically very expensive to build and maintain. For
instance, Amazon.com's new 116000 sq ft (10800 m
2
) data center in Oregon is expected to cost up to $100
million.
[1]
Data center
222
Requirements for modern data centers
Racks of telecommunications equipment in part
of a data center.
IT operations are a crucial aspect of most organizational operations.
One of the main concerns is business continuity; companies rely on
their information systems to run their operations. If a system becomes
unavailable, company operations may be impaired or stopped
completely. It is necessary to provide a reliable infrastructure for IT
operations, in order to minimize any chance of disruption. Information
security is also a concern, and for this reason a data center has to offer
a secure environment which minimizes the chances of a security
breach. A data center must therefore keep high standards for assuring
the integrity and functionality of its hosted computer environment.
This is accomplished through redundancy of both fiber optic cables
and power, which includes emergency backup power generation.
Telcordia GR-3160, NEBS Requirements for Telecommunications Data Center Equipment and Spaces
[2]
, provides
guidelines for data center spaces within telecommunications networks, and environmental requirements for the
equipment intended for installation in those spaces. These criteria were developed jointly by Telcordia and industry
representatives. They may be applied to data center spaces housing data processing or Information Technology (IT)
equipment. The equipment may be used to:
• Operate and manage a carrier’s telecommunication network
• Provide data center based applications directly to the carrier’s customers
• Provide hosted applications for a third party to provide services to their customers
• Provide a combination of these and similar data center applications.
Effective data center operation requires a balanced investment in both the facility and the housed equipment. The
first step is to establish a baseline facility environment suitable for equipment installation. Standardization and
modularity can yield savings and efficiencies in the design and construction of telecommunications data centers.
Standardization means integrated building and equipment engineering. Modularity has the benefits of scalability and
easier growth, even when planning forecasts are less than optimal. For these reasons, telecommunications data
centers should be planned in repetitive building blocks of equipment, and associated power and support
(conditioning) equipment when practical. The use of dedicated centralized systems requires more accurate forecasts
of future needs to prevent expensive over construction, or perhaps worse — under construction that fails to meet
future needs.
The "lights-out" data center, also known as a darkened or a dark data center, is a data center that, ideally, has all but
eliminated the need for direct access by personnel, except under extraordinary circumstances. Because of the lack of
need for staff to enter the data center, it can be operated without lighting. All of the devices are accessed and
managed by remote systems, with automation programs used to perform unattended operations. In addition to the
energy savings, reduction in staffing costs and the ability to locate the site further from population centers,
implementing a lights-out data center reduces the threat of malicious attacks upon the infrastructure.
[3]

[4]
Data center classification
The TIA-942:Data Center Standards Overview
[5]
describes the requirements for the data center infrastructure. The
simplest is a Tier 1 data center, which is basically a server room, following basic guidelines for the installation of
computer systems. The most stringent level is a Tier 4 data center, which is designed to host mission critical
computer systems, with fully redundant subsystems and compartmentalized security zones controlled by biometric
access controls methods. Another consideration is the placement of the data center in a subterranean context, for data
security as well as environmental considerations such as cooling requirements.
[6]
Data center
223
The Uptime Institute, a think tank and professional-services organization based Santa Fe, New Mexico, defines and
holds the copyright on the four levels. The levels describe the availability of data from the hardware at a location.
The higher the tier, the greater the accessibility. The levels are:
[7]

[8]

[9]
Tier Level Requirements
1 • Single non-redundant distribution path serving the IT equipment
• Non-redundant capacity components
• Basic site infrastructure guaranteeing 99.671% availability
2 • Fulfills all Tier 1 requirements
• Redundant site infrastructure capacity components guaranteeing 99.741% availability
3 • Fulfills all Tier 1 and Tier 2 requirements
• Multiple independent distribution paths serving the IT equipment
• All IT equipment must be dual-powered and fully compatible with the topology of a site's architecture
• Concurrently maintainable site infrastructure guaranteeing 99.982% availability
4 • Fulfills all Tier 1, Tier 2 and Tier 3 requirements
• All cooling equipment is independently dual-powered, including chillers and heating, ventilating and air-conditioning (HVAC)
systems
• Fault-tolerant site infrastructure with electrical power storage and distribution facilities guaranteeing 99.995% availability
Design considerations
A typical server rack, commonly seen in
colocation.
A data center can occupy one room of a building, one or more floors,
or an entire building. Most of the equipment is often in the form of
servers mounted in 19 inch rack cabinets, which are usually placed in
single rows forming corridors (so-called aisles) between them. This
allows people access to the front and rear of each cabinet. Servers
differ greatly in size from 1U servers to large freestanding storage silos
which occupy many tiles on the floor. Some equipment such as
mainframe computers and storage devices are often as big as the racks
themselves, and are placed alongside them. Very large data centers
may use shipping containers packed with 1,000 or more servers
each;
[10]
when repairs or upgrades are needed, whole containers are
replaced (rather than repairing individual servers).
[11]
Local building codes may govern the minimum ceiling heights.
Data center
224
A bank of batteries in a large data center, used to
provide power until diesel generators can start.
Environmental control
The physical environment of a data center is rigorously controlled. Air
conditioning is used to control the temperature and humidity in the data
center. ASHRAE's "Thermal Guidelines for Data Processing
Environments"
[12]
recommends a temperature range of 16–24 °C
(61–75 °F) and humidity range of 40–55% with a maximum dew point
of 15°C as optimal for data center conditions.
[13]
The temperature in a
data center will naturally rise because the electrical power used heats
the air. Unless the heat is removed, the ambient temperature will rise,
resulting in electronic equipment malfunction. By controlling the air
temperature, the server components at the board level are kept within
the manufacturer's specified temperature/humidity range. Air
conditioning systems help control humidity by cooling the return space
air below the dew point. Too much humidity, and water may begin to condense on internal components. In case of a
dry atmosphere, ancillary humidification systems may add water vapor if the humidity is too low, which can result in
static electricity discharge problems which may damage components. Subterranean data centers may keep computer
equipment cool while expending less energy than conventional designs.
Modern data centers try to use economizer cooling, where they use outside air to keep the data center cool.
[14]
Washington State now has a few data centers that cool all of the servers using outside air 11 months out of the year.
They do not use chillers/air conditioners, which creates potential energy savings in the millions.
[15]
Telcordia GR-2930, NEBS: Raised Floor Generic Requirements for Network and Data Centers
[16]
, presents generic
engineering requirements for raised floors that fall within the strict NEBS guidelines.
There are many types of commercially available floors that offer a wide range of structural strength and loading
capabilities, depending on component construction and the materials used. The general types of raised floors include
stringerless, stringered, and structural platforms, all of which are discussed in detail in GR-2930 and summarized
below.
• Stringerless Raised Floors - One non-earthquake type of raised floor generally consists of an array of pedestals
that provide the necessary height for routing cables and also serve to support each corner of the floor panels. With
this type of floor, there may or may not be provisioning to mechanically fasten the floor panels to the pedestals.
This stringerless type of system (having no mechanical attachments between the pedestal heads) provides
maximum accessibility to the space under the floor. However, stringerless floors are significantly weaker than
stringered raised floors in supporting lateral loads and are not recommended.
• Stringered Raised Floors - This type of raised floor generally consists of a vertical array of steel pedestal
assemblies (each assembly is made up of a steel base plate, tubular upright, and a head) uniformly spaced on
two-foot centers and mechanically fastened to the concrete floor. The steel pedestal head has a stud that is
inserted into the pedestal upright and the overall height is adjustable with a leveling nut on the welded stud of the
pedestal head.
• Structural Platforms - One type of structural platform consists of members constructed of steel angles or
channels that are welded or bolted together to form an integrated platform for supporting equipment. This design
permits equipment to be fastened directly to the platform without the need for toggle bars or supplemental
bracing. Structural platforms may or may not contain panels or stringers.
Data center
225
Electrical power
Backup power consists of one or more uninterruptible power supplies and/or diesel generators.
[17]
To prevent single points of failure, all elements of the electrical systems, including backup systems, are typically
fully duplicated, and critical servers are connected to both the "A-side" and "B-side" power feeds. This arrangement
is often made to achieve N+1 redundancy in the systems. Static switches are sometimes used to ensure instantaneous
switchover from one supply to the other in the event of a power failure.
Data centers typically have raised flooring made up of 60 cm (2 ft) removable square tiles. The trend is towards
80–100 cm (31–39 in) void to cater for better and uniform air distribution. These provide a plenum for air to
circulate below the floor, as part of the air conditioning system, as well as providing space for power cabling.
Low-voltage cable routing
Data cabling is typically routed through overhead cable trays in modern data centers. But some are still
recommending under raised floor cabling for security reasons and to consider the addition of cooling systems above
the racks in case this enhancement is necessary. Smaller/less expensive data centers without raised flooring may use
anti-static tiles for a flooring surface. Computer cabinets are often organized into a hot aisle arrangement to
maximize airflow efficiency.
Fire protection
Data centers feature fire protection systems, including passive and active design elements, as well as implementation
of fire prevention programs in operations. Smoke detectors are usually installed to provide early warning of a
developing fire by detecting particles generated by smoldering components prior to the development of flame. This
allows investigation, interruption of power, and manual fire suppression using hand held fire extinguishers before the
fire grows to a large size. A fire sprinkler system is often provided to control a full scale fire if it develops. Fire
sprinklers require 18 in (46 cm) of clearance (free of cable trays, etc.) below the sprinklers. Clean agent fire
suppression gaseous systems are sometimes installed to suppress a fire earlier than the fire sprinkler system. Passive
fire protection elements include the installation of fire walls around the data center, so a fire can be restricted to a
portion of the facility for a limited time in the event of the failure of the active fire protection systems, or if they are
not installed. For critical facilities these firewalls are often insufficient to protect heat-sensitive electronic equipment,
however, because conventional firewall construction is only rated for flame penetration time, not heat penetration.
There are also deficiencies in the protection of vulnerable entry points into the server room, such as cable
penetrations, coolant line penetrations and air ducts. For mission critical data centers fireproof vaults with a Class
125 rating are necessary to meet NFPA 75
[18]
standards.
Security
Physical security also plays a large role with data centers. Physical access to the site is usually restricted to selected
personnel, with controls including bollards and mantraps.
[19]
Video camera surveillance and permanent security
guards are almost always present if the data center is large or contains sensitive information on any of the systems
within. The use of finger print recognition man traps is starting to be commonplace.
Energy use
Energy use is a central issue for data centers. Power draw for data centers ranges from a few kW for a rack of servers
in a closet to several tens of MW for large facilities. Some facilities have power densities more than 100 times that of
a typical office building.
[20]
For higher power density facilities, electricity costs are a dominant operating expense
and account for over 10% of the total cost of ownership (TCO) of a data center.
[21]
By 2012 the cost of power for the
data center is expected to exceed the cost of the original capital investment.
[22]
Data center
226
Greenhouse gas emissions
In 2007 the entire information and communication technologies or ICT sector was estimated to be responsible for
roughly 2% of global carbon emissions with data centers accounting for 14% of the ICT footprint.
[23]
The US EPA
estimates that servers and data centers are responsible for up to 1.5% of the total US electricity consumption,
[24]
or
roughly .5% of US GHG emissions,
[25]
for 2007. Given a business as usual scenario greenhouse gas emissions from
data centers is projected to more than double from 2007 levels by 2020.
[23]
Siting is one of the factors that affect the energy consumption and environmental effects of a datacenter. In areas
where climate favors cooling and lots of renewable electricity is available the environmental effects will be more
moderate. Thus countries with favorable conditions, such as Finland,
[26]
Sweden
[27]
and Switzerland,
[28]
are trying to
attract cloud computing data centers.
In an 18-month investigation by scholars at Rice University’s Baker Institute for Public Policy in Houston and the
Institute for Sustainable and Applied Infodynamics in Singapore, data center-related emissions will more than triple
by 2020.
[29]
Energy efficiency
The most commonly used metric to determine the energy efficiency of a data center is power usage effectiveness, or
PUE. This simple ratio is the total power entering the data center divided by the power used by the IT equipment.
Power used by support equipment, often referred to as overhead load, mainly consists of cooling systems, power
delivery, and other facility infrastructure like lighting. The average data center in the US has a PUE of 2.0,
[24]
meaning that the facility uses one Watt of overhead power for every Watt delivered to IT equipment. State-of-the-art
data center energy efficiency is estimated to be roughly 1.2.
[30]
Some large data center operators like Microsoft and
Yahoo! have published projections of PUE for facilities in development; Google publishes quarterly actual
efficiency performance from data centers in operation.
[31]
The U.S. Environmental Protection Agency has an Energy Star rating for standalone or large data centers. To qualify
for the ecolabel, a data center must be within the top quartile of energy efficiency of all reported facilities.
[32]
Network infrastructure
An example of "rack mounted" servers.
Communications in data centers today are most often based on
networks running the IP protocol suite. Data centers contain a set of
routers and switches that transport traffic between the servers and to
the outside world. Redundancy of the Internet connection is often
provided by using two or more upstream service providers (see
Multihoming).
Some of the servers at the data center are used for running the basic
Internet and intranet services needed by internal users in the
organization, e.g., e-mail servers, proxy servers, and DNS servers.
Network security elements are also usually deployed: firewalls, VPN
gateways, intrusion detection systems, etc. Also common are monitoring systems for the network and some of the
applications. Additional off site monitoring systems are also typical, in case of a failure of communications inside
the data center.
Data center
227
Applications
Multiple racks of servers, and how a data center
commonly looks.
The main purpose of a data center is running the applications that
handle the core business and operational data of the organization. Such
systems may be proprietary and developed internally by the
organization, or bought from enterprise software vendors. Such
common applications are ERP and CRM systems.
A data center may be concerned with just operations architecture or it
may provide other services as well.
Often these applications will be composed of multiple hosts, each
running a single component. Common components of such
applications are databases, file servers, application servers,
middleware, and various others.
Data centers are also used for off site backups. Companies may subscribe to backup services provided by a data
center. This is often used in conjunction with backup tapes. Backups can be taken of servers locally on to tapes.
However, tapes stored on site pose a security threat and are also susceptible to fire and flooding. Larger companies
may also send their backups off site for added security. This can be done by backing up to a data center. Encrypted
backups can be sent over the Internet to another data center where they can be stored securely.
For disaster recovery, several large hardware vendors have developed mobile solutions that can be installed and
made operational in very short time. Vendors such as Cisco Systems,
[33]
Sun Microsystems,
[34]

[35]
, Bull
[36]
IBM,
and HP have developed systems that could be used for this purpose.
[37]
Cold Aisle Containment
Cold Aisle Containment separates the supply and exhaust air between your hot and cold aisles, helping to manage
your environment. After Containment is installed, the cold air is delivered directly to the servers and other active
equipment. This maximises the efficiency of your mechanical infrastructure, saving you money. It also gives floor
and aisle airflow balancing improvements.
Data centre facility managers continue to face challenges of coping with ever increasing demand for M&E capacity
to support more high density IT equipment. Air Containment is a low cost, high value approach to increasing the
effectiveness of your current air conditioning, enabling you to:
• Significantly reduce the power cost associated with supporting your current IT requirement
• Defer the expensive M&E upgrade cost of supporting increased IT demand
• Improve cooling for high density racks.
:How it Works
Classic data centre design incorporates the concept of hot and cold aisles in order to better direct cold air from the air
conditioning system to the input side of the IT equipment. Air Containment is an engineered solution that fully
separates hot exhaust and cold input air flows within the data centre white space, preventing significant quantities of
the cold air bypassing equipment cabinets and returning mixed with the hot air to the CRAC units. Typically
unwanted mixture is compensated for by running the air conditioning system at a lower temperature than would
otherwise be required, using scarce capacity and incurring avoidable significant additional energy cost.
:Hot or Cold Aisle Containment?
There are essentially two different approaches based on either containing the hot aisles containment
[38]
or the cold
aisles containment
[39]
in the data centre. The industry continues to debate the relative efficiency benefits from each
option. In reality, the approach you take should be dependent on your own specific business drivers and objectives in
relation to your current data centre situation. Keyzone empowers its clients with the necessary facts to make an
Data center
228
informed decision. Nevertheless, if you plan to design and commission a new data centre facility with new IT
equipment you have more options available to you to achieve air flow separation. To significantly improve the
effectiveness of a current data centre requires a solution that can be easily retro-fitted.
:Additional Benefits
By increasing air pressure in the cold aisle and not allowing hot air to mix in, Air Containment also has the potential
to reduce the risk of IT service failure in two key ways:
• By accelerating the flow of cold air into servers, even to those in the top third of the cabinet*, in turn decreasing
fan loading and technology operating temperatures, thereby extending server life span
Note * - 74% of server failures occur in the top third of the rack where the temperature is normally higher
• By creating a cold air buffer that can delay the thermal shut down of IT equipment in the event of an air
conditioning failure.
References
[1] Amazon Building Large Data Center in Oregon « Data Center Knowledge (http:// www.datacenterknowledge.com/ archives/ 2008/ 11/ 07/
amazon-building-large-data-center-in-oregon/)
[2] http:// telecom-info.telcordia.com/ site-cgi/ ido/ docs. cgi?ID=SEARCH&DOCUMENT=GR-3160&
[3] Kasacavage, Victor (2002). Complete book of remote access: connectivity and security. The Auerbach Best Practices Series. CRC Press.
p. 227. ISBN 0849312531.
[4] Burkey, Roxanne E.; Breakfield, Charles V. (2000). Designing a total data solution: technology, implementation and deployment. Auerbach
Best Practices. CRC Press. p. 24. ISBN 0849308933.
[5] http:// www. adc. com/ Attachment/ 1270711929361/ 102264AE.pdf
[6] A ConnectKentucky article mentioning Stone Mountain Data Center Complex "Global Data Corp. to Use Old Mine for Ultra-Secure Data
Storage Facility" (http:/ / connectkentucky. org/ _documents/ connected_fall_FINAL.pdf) (PDF). ConnectKentucky. 2007-11-01. . Retrieved
2007-11-01.
[7] A definition from Webopedia "Data Center Tiers" (http:// www.webopedia.com/ TERM/ D/ data_center_tiers.html). Webopedia.
2010-02-13. . Retrieved 2010-02-13.
[8] A document from the Uptime Institute describing the different tiers (click through the download page) "Data Center Site Infrastructure Tier
Standard: Topology" (http:/ / uptimeinstitute. org/ index. php?option=com_docman&task=doc_download& gid=82) (PDF). Uptime Institute.
2010-02-13. . Retrieved 2010-02-13.
[9] The rating guidelines from the Uptime Institute "Data Center Site Infrastructure Tier Standard: Topology" (http:/ / professionalservices.
uptimeinstitute.com/ UIPS_PDF/TierStandard.pdf) (PDF). Uptime Institute. 2010-02-13. . Retrieved 2010-02-13.
[10] "Google Container Datacenter Tour (video)" (http:/ /www. youtube. com/ watch?v=zRwPSFpLX8I). .
[11] "Walking the talk: Microsoft builds first major container-based data center" (http:// web.archive.org/web/ 20080612193106/ http:/ / www.
computerworld.com/ action/ article.do?command=viewArticleBasic& articleId=9075519). Archived from the original (http://www.
computerworld. com/ action/ article.do?command=viewArticleBasic& articleId=9075519) on 2008-06-12. . Retrieved 2008-09-22.
[12] "ASHRAE's "Thermal Guidelines for Data Processing Environments"" (http:// tc99. ashraetcs. org/documents/
ASHRAE_Extended_Environmental_Envelope_Final_Aug_1_2008.pdf) (PDF). .
[13] "ServersCheck's Blog on Why Humidity Monitoring" (http:// www.serverscheck.com/ blog/ 2008/ 07/
why-monitor-humidity-in-computer-rooms. html). July 1, 2008. .
[14] Detailed expanation of economizer-based cooing "Economizer Fundamentals: Smart Approaches to Energy-Efficient Free-Cooling for Data
Centers" (http:// www. emersonnetworkpower.com/ en-US/ Brands/ Liebert/Documents/ White Papers/ Economizer Fundamentals - Smart
Approaches to Energy-Efficient Free-Cooling for Data Centers.pdf) (PDF). .
[15] "tw telecom and NYSERDA Announce Co-location Expansion" (http:// www. reuters. com/ article/pressRelease/ idUS141369+
14-Sep-2009+ PRN20090914). Reuters. 2009-09-14. .
[16] http:// telecom-info.telcordia.com/ site-cgi/ ido/ docs. cgi?ID=SEARCH&DOCUMENT=GR-2930&
[17] Detailed expanation of UPS topologies "EVALUATING THE ECONOMIC IMPACT OF UPS TECHNOLOGY" (http:// www.
emersonnetworkpower.com/ en-US/Brands/ Liebert/Documents/ White Papers/ Evaluating the Economic Impact of UPS Technology.pdf)
(PDF). .
[18] Fixen, Edward L. and Vidar S. Landa,"Avoiding the Smell of Burning Data," Consulting-Specifying Engineer, May 2006, Vol. 39 Issue 5,
p47-51
[19] 19 Ways to Build Physical Security Into a Data Center (http:/ / www. csoonline. com/ article/220665)
[20] "Data Center Energy Consumption Trends" (http:/ / www1.eere.energy.gov/femp/ program/dc_energy_consumption.html). U.S.
Department of Energy. . Retrieved 2010-06-10.
Data center
229
[21] J Koomey, C. Belady, M. Patterson, A. Santos, K.D. Lange. Assessing Trends Over Time in Performance, Costs, and Energy Use for
Servers (http:// www.intel. com/ assets/ pdf/general/servertrendsreleasecomplete-v25.pdf) Released on the web August 17th, 2009.
[22] "Quick Start Guide to Increase Data Center Energy Efficiency" (http:// www1. eere.energy.gov/ femp/pdfs/ data_center_qsguide. pdf).
U.S. Department of Energy. . Retrieved 2010-06-10.
[23] "Smart 2020: Enabling the low carbon economy in the information age" (http:// www.smart2020.org/ _assets/ files/
03_Smart2020Report_lo_res.pdf). The Climate Group for the Global e-Sustainability Initiative. . Retrieved 2008-05-11.
[24] "Report to Congress on Server and Data Center Energy Efficiency" (http://www.energystar.gov/ ia/ partners/prod_development/
downloads/ EPA_Datacenter_Report_Congress_Final1. pdf). U.S. Environmental Protection Agency ENERGY STAR Program. .
[25] A calculation of data center electricity burden cited in the Report to Congress on Server and Data Center Energy Efficiency (http:/ / www.
energystar.gov/ ia/partners/ prod_development/downloads/ EPA_Datacenter_Report_Congress_Final1.pdf) and electricity generation
contributions to green house gas emissions published by the EPA in the Greenhouse Gas Emissions Inventory Report (http:/ / epa. gov/
climatechange/ emissions/ downloads10/ US-GHG-Inventory-2010_ExecutiveSummary.pdf). Retrieved 2010-06-08.
[26] Finland - First Choice for Siting Your Cloud Computing Data Center. (http:/ / www.fincloud.freehostingcloud. com/ ). Retrieved 4 August
2010.
[27] Stockholm sets sights on data center customers. (http:// www. stockholmbusinessregion. se/ templates/ page____41724.
aspx?epslanguage=EN) Accessed 4 August 2010.
[28] Swiss Carbon-Neutral Servers Hit the Cloud. (http:/ / www.greenbiz.com/ news/ 2010/ 06/ 30/ swiss-carbon-neutral-servers-hit-cloud).
Retrieved 4 August 2010.
[29] Katrice R. Jalbuena (October 15, 2010). "Green business news." (http:// ecoseed. org/en/ business-article-list/ article/1-business/
8219-i-t-industry-risks-output-cut-in-low-carbon-economy). EcoSeed. . Retrieved November 11, 2010.
[30] "Data Center Energy Forecast" (https:/ / microsite. accenture.com/ svlgreport/Documents/ pdf/SVLG_Report. pdf). Silicon Valley
Leadership Group. .
[31] "Google Efficiency Update" (http:// www. datacenterknowledge. com/ archives/ 2009/ 10/ 15/ google-efficiency-update-pue-of-1-22/).
Data Center Knowledge. . Retrieved 2010-06-08.
[32] Commentary on introduction of Energy Star for Data Centers "Introducing EPA ENERGY STAR® for Data Centers" (http:// www.
emerson.com/ edc/ post/ 2010/ 06/ 15/ Introducing-EPA-ENERGY-STARc2ae-for-Data-Centers.aspx) (Web site). Jack Pouchet. 2010-09-27.
. Retrieved 2010-09-27.
[33] "Info and video about Cisco's solution" (http:/ / www.datacenterknowledge.com/ archives/ 2008/ May/ 15/
ciscos_mobile_emergency_data_center.html). Datacentreknowledge. May 15, 2007. . Retrieved 2008-05-11.
[34] "Technical specs of Sun's Blackbox" (http:// web. archive. org/web/ 20080513090300/ http:/ / www. sun.com/ products/ sunmd/ s20/
specifications. jsp). Archived from the original (http:// www.sun. com/ products/ sunmd/ s20/ specifications. jsp) on 2008-05-13. . Retrieved
2008-05-11.
[35] And English Wiki article on Sun's modular datacentre
[36] Kidger, Daniel. "Mobull Plug and Boot Datacenter" (http:// www. bull. com/ extreme-computing/mobull.html). Bull. . Retrieved
2011-05-24.
[37] Kraemer, Brian (June 11, 2008). "IBM's Project Big Green Takes Second Step" (http:/ / www. crn.com/ hardware/208403225).
ChannelWeb. . Retrieved 2008-05-11.
[38] http:// www. cold-aisle-containment.co.uk/ solutions/ hot-air-return/index.html
[39] http:// www. cold-aisle-containment.co.uk/ solutions/ cold-aisle-containment/index. html
External links
• Lawrence Berkeley Lab (http:// hightech. lbl.gov/ datacenters. html) - Research, development, demonstration,
and deployment of energy-efficient technologies and practices for data centers
• The Uptime Institute (http:/ / www. uptimeinstitute. org/) - Organization that defines data center reliability and
conducts site certifications.
• Timelapse BIT2BCD (https:/ / weblog. bit. nl/ blog/ 2009/ 07/ 29/ timelapse-video-bouw-bit-2bcd/) - timelapse
video of a data centre building in Ede, Netherlands.
• The Neher-McGrath Institute (http:// www. neher-mcgrath.org/) - Organization that certifies data center
underground duct bank installation to provide lower operating cost and increased up-time.
Data classification (data management)
230
Data classification (data management)
In the field of data management, data classification as a part of Information Lifecycle Management (ILM) process
can be defined as tool for categorization of data to enable/help organization to effectively answer following
questions:
• What data types are available?
• Where are certain data located?
• What access levels are implemented?
• What protection level is implemented and does it adhere to compliance regulations?
When implemented it provides a bridge between IT professionals and process or application owners. IT staff is
informed about the data value and on the other hand management (usually application owners) understands better to
what segment of data centre has to be invested to keep operations running effectively. This can be of particular
importance in risk management, legal discovery, and compliance with government regulations. Data classification is
typically a manual process; however, there are many tools from different vendors that can help gather information
about the data.
How to start process of data classification?
First step is to evaluate and divide applications and data as follows:
• Structured data (statistically around 15% of data)
• Generally describes proprietary data which can be accessible only through application or application
programming interfaces (API)
• Applications that produce structured data are usually database applications
• This type of data usually brings complex procedures of data evaluation and migration between storage tiers
• Unstructured data (all other data that cannot be categorized as structured around 85%)
• Generally describes data files that has no physical interconnectivity (e.g. documents, pictures, multimedia files,
... )
• Relatively simple process of data classification criteria assignment
• Simple process of data migration between assigned segments of predefined storage tiers
Basic criteria for unstructured data classification
• Time criteria is the simplest and most commonly used where different type of data is evaluated by time of
creation, time of access, time of update, etc.
• Metadata criteria as type, name, owner, location and so on can be used to create more advanced classification
policy
• Content criteria which involve usage of advanced content classification algorithms are most advanced forms of
unstructured data classification
Data classification (data management)
231
Basic criteria for structured data classification
These criteria are usually initiated by application requirements such as:
• Disaster recovery and Business Continuity rules
• Data centre resources optimization and consolidation
• Hardware performance limitations and possible improvements by reorganization
Benefits of data classification
Benefits of effective implementation of appropriate data classification can significantly improve ILM process and
save data centre storage resources. If implemented systemically it can generate improvements in data centre
performance and utilization. Data classification can also reduce costs and administration overhead. "Good enough"
data classification can produce these results:
• Data compliance and easier risk management. Data are located where expected on predefined storage tier and
"point in time"
• Simplification of data encryption because all data need not be encrypted. This saves valuable processor cycles and
all related consecutiveness.
• Data indexing to improve user access times
• Data protection is redefined where RTO (Recovery Time Objective) is improved.
References
• Josh Judd and Dan Kruger (2005), Principles of SAN Design. Infinity Publishing
• Stephen J. Bigelown (November 2005), SearchStorage.com, http:// searchstorage. techtarget.com/ news/ article/
0,289142,sid5_gci1139240,00. html
Data conditioning
232
Data conditioning
Data conditioning is the use of data management and optimization techniques which result in the intelligent routing,
optimization and protection of data for storage or data movement in a computer system. Data conditioning features
enable enterprise and cloud data centers to dramatically improve system utilization and increase application
performance lowering both capital expenditures and operating costs.
Data conditioning technologies delivered through a Data Conditioning Platform optimize data as it moves through a
computer’s I/O (Input/Output) path or I/O bus -- the data path between the main processor complex and storage
subsystems. The functions of a Data Conditioning Platform typically reside on a storage controller add-in card
inserted into the PCI-e slots of a server. This enables easy integration of new features in a server or a whole data
center.
Data conditioning features delivered via a Data Conditioning Platform are designed to simplify system integration,
and minimize implementation risks associate with deploying new technologies by ensuring seamless compatibility
with all leading server and storage hardware, operating systems and applications, and meeting all current
commercial/off-the-shelf (COTS) standards. By delivering optimization features via a Data Conditioning Platform,
data center managers can improve system efficiency and reduce cost with minimal disruption and avoid the need to
modify existing applications or operating systems, and leverage existing hardware systems.
Summary
Data conditioning builds on existing data storage functionality delivered in the I/O path including RAID (Redundant
Arrays of Inexpensive Disks), intelligent I/O-based power management
[1]
, and SSD (Solid-State Drive)
performance caching techniques. Data conditioning is enabled both by advanced ASIC ontroller technology and
intelligent software. New data conditioning capabilities can be designed into and delivered via storage controllers in
the I/O path or to achieve the data center’s technical and business goals.
Data Conditioning strategies can also be applied to improving server and storage utilization and for better managing
a wide range of hardware and system-level capabilities.
Background and Purpose
Data conditioning principles can be applied to any demanding computing environment to create significant cost,
performance and system utilization efficiencies, and are typically deployed by data center managers, system
integrators, and storage and server OEMs seeking to optimize hardware and software utilization, simplified,
non-intrusive technology integration, and minimal risks and performance hits traditionally associated with
incorporating new data center technologies.
References
Adaptec MaxIQ
[2]
References
[1] http:/ / www. adaptec. com/ en-us/ _common/ greenpower?refURL=/greenpower/
[2] http:/ / www. adaptec/ maxIQ
Data custodian
233
Data custodian
In Data Governance groups, responsibilities for data management are increasingly divided between the business
process owners and information technology (IT) departments. Two functional titles commonly used for these roles
are Data Steward and Data Custodian.
Data Stewards are commonly responsible for data content, context, and associated business rules. Data Custodians
are responsible for the safe custody, transport, storage of the data and implementation of business rules
[1]

[2]
. Simply
put, Data Stewards are responsible for what is stored in a data field, while Data Custodians are responsible for the
technical environment and database structure. Common job titles for data custodians are Database Administrator
(DBA), Data Modeler, and ETL Developer.
Data Custodian Responsibilities
A data custodian ensures:
1. Access to the data is authorized and controlled
2. Data stewards are identified for each data set
3. Technical processes sustain data integrity
4. Processes exist for data quality issue resolution in partnership with Data Stewards
5. Technical controls safeguard data
6. Data added to data sets are consistent with the common data model
7. Versions of Master Data are maintained along with the history of changes
8. Change management practices are applied in maintenance of the database
9. Data content and changes can be audited
References
[1] Carnegie Mellon - Information Security Roles and Responsibilities, http:/ / www.cmu.edu/ iso/ governance/roles/ data-custodian.html
[2] Policies, Regulations and Rules: Data Management Procedures - REG 08.00.3 - Information Technology, , NC State University, http://
www.ncsu. edu/ policies/ informationtechnology/REG08.00.3.php
Related Links
• Establishing data stewards, by Jonathan G. Geiger, Teradata Magazine Online, September 2008, http:/ / www.
teradata.com/ tdmo/ v08n03/ Features/ EstablishingDataStewards. aspx
• A Rose By Any Other Name – Titles In Data Governance, by Anne Marie Smith, Ph.D., EIMInstitute.ORG
Archives, Volume 1, Issue 13, March 2008, http:/ / www.eiminstitute. org/library/eimi-archives/
volume-1-issue-13-march-2008-edition/a-rose-by-any-other-name-2013-titles-in-data-governance
Data deduplication
234
Data deduplication
In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained
redundant data, typically to improve storage utilization. In the deduplication process, duplicate data is deleted,
leaving only one copy of the data to be stored, along with references to the unique copy of data. Deduplication is
able to reduce the required storage capacity since only the unique data is stored.
Depending on the type of deduplication, redundant files may be reduced, or even portions of files or other data that
are similar can also be removed. As a simple example of file based deduplication, a typical email system might
contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or
archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of
the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this
example, the deduplication ratio is roughly 100 to 1.
Different applications and data types naturally have different levels of data redundancy. Backup applications
generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system.
Like a traditional stream-based dictionary coder, deduplication identifies identical sections of data and replaces them
by references to a single copy of the data. However, whereas standard file compression tools like LZ77 and LZ78
identify short repeated substrings inside single files, the focus of data deduplication is to take a very large volume of
data and identify large sections - such as entire files or large sections of files - that are identical, and store only one
copy of it. This copy may be additionally compressed by single-file compression techniques.
Benefits
Data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications
where many copies of very similar or even identical data are stored on a single disk—a surprisingly common
scenario.
One very good application for data deduplication is in backups. Most data in a given backup isn't changed from the
previous backup; common backup systems try to exploit this by omitting (or hard linking) files that haven't changed
or storing differences between files. Neither approach captures all redundancies, however. Hard linking does not help
with large files that have only changed in small ways, such as an email database; differences only find redundancies
in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image
included in many documents).
Data deduplication allows a backup program to essentially just copy files onto a backup disk without trying to omit
or difference them; the storage system itself will ensure that only one copy of the data ends up on the disk, no matter
how many versions ago the duplicate data occurred or even if the similarity appears in a different file. This can
reduce backup storage requirements by 90% or more—making it feasible to retain data for months on a fast, readily
accessible backup medium. It also reduces the data that must be sent across a WAN for remote backups, replication,
and disaster recovery.
Data deduplication is also especially effective when used with virtual servers, allowing the nominally separate
system files for each virtual server to be coalesced into a single storage space. (At the same time, if a given server
customizes a file, deduplication will not change the files on the other servers—something that alternatives like hard
links or shared disks do not offer.) Backing up or making duplicate copies of virtual environments is similarly
improved.
By reducing the amount of storage needed, deduplication can save other resources, too: the energy use, physical
volume, cooling needs, and carbon footprint needed to store the data are all reduced. Less hardware needs to be
purchased, recycled, and replaced, further lowering costs.
Data deduplication
235
Deduplication overview
When deduplication may occur
Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written.
Post-process deduplication
With post-process deduplication, new data is first stored on the storage device and then a process at a later time will
analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and
lookup to be completed before storing the data thereby ensuring that store performance is not degraded.
Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or
to process files based on type and location. One potential drawback is that you may unnecessarily store duplicate
data for a short time which is an issue if the storage system is near full capacity.
In-line deduplication
This is the process where the deduplication hash calculations are created on the target device as the data enters the
device in real time. If the device spots a block that it already stored on the system it does not store the new block,
just references to the existing block. The benefit of in-line deduplication over post-process deduplication is that it
requires less storage as data is not duplicated. On the negative side, it is frequently argued that because hash
calculations and lookups takes so long, it can mean that the data ingestion can be slower thereby reducing the backup
throughput of the device. However, certain vendors with in-line deduplication have demonstrated equipment with
similar performance to their post-process deduplication counterparts.
Post-process and in-line deduplication methods are often heavily debated.
[1]

[2]
Where deduplication may occur
Deduplication can occur close to where data is created, which is often referred to as "source deduplication." It can
occur close to where the data is stored, which is commonly called "target deduplication."
Source versus target deduplication
When describing deduplication for backup architectures, it is common to hear two terms: source deduplication and
target deduplication.
Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a
file-system.
[3]

[4]
The file system will periodically scan new files creating hashes and compare them to hashes of
existing files. When files with same hashes are found then the file copy is removed and the new file points to the old
file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated
files is later modified, then using a system called Copy-on-write a copy of that file or changed block is created. The
deduplication process is transparent to the users and backup applications. Backing up a deduplicated filesystem will
often cause duplication to occur resulting in the backups being bigger than the source data.
Target deduplication is the process of removing duplicates of data in the secondary store. Generally this will be a
backup store such as a data repository or a virtual tape library. There are three different ways of performing the
deduplication process.
Data deduplication
236
How deduplication occurs
There are many variations employed.
Chunking and deduplication overview
One of the most common forms of data deduplication implementations work by comparing chunks of data to detect
duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically
using cryptographic hash functions. In many implementations, the assumption is made that if the identification is
identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other
implementations do not assume that two blocks of data with the same identifier are identical, but actually verify that
data with the same identification is identical.
[5]
If the software either assumes that a given identification already
exists in the deduplication namespace or actually verifies the identity of the two blocks of data, depending on the
implementation, then it will replace that duplicate chunk with a link.
Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system simply replaces
that link with the referenced data chunk. The de-duplication process is intended to be transparent to end users and
applications.
Chunking methods
Between commercial deduplication implementations, technology varies primarily in chunking method and in
architecture. In some systems, chunks are defined by physical layer constraints (e.g. 4KB block size in WAFL). In
some systems only complete files are compared, which is called Single Instance Storage or SIS. The most intelligent
(but CPU intensive) method to chunking is generally considered to be sliding-block. In sliding block, a window is
passed along the file stream to seek out more naturally occurring internal file boundaries.
Client backup deduplication
This is the process where the deduplication hash calculations are initially created on the source (client) machines.
Files that have identical hashes to files already in the target device are not sent, the target device just creates
appropriate internal links to reference the duplicated data. The benefit of this is that it avoids data being
unnecessarily sent across the network thereby reducing traffic load.
Primary storage vs. secondary storage deduplication
By definition, primary storage systems are designed for optimal performance, rather than lowest possible cost. The
design criteria for these systems is to increase performance, at the expense of other considerations. Moreover,
primary storage systems are much less tolerant of any operation that can negatively impact performance.
Also by definition, secondary storage systems contain primarily duplicate, or secondary copies of data. These copies
of data are typically not used for actual production operations and as a result are more tolerant of some performance
degradation, in exchange for increased efficiency.
To date, data deduplication has predominantly been used with secondary storage systems. The reasons for this are
two-fold. First, data deduplication requires overhead to discover and remove the duplicate data. In primary storage
systems, this overhead may impact performance. The second reason why deduplication is applied to secondary data,
is that secondary data tends to have more duplicate data. Backup application in particular commonly generate
significant portions of duplicate data over time.
Data deduplication has been deployed successfully with primary storage in some cases where the system design does
not require significant overhead, or impact performance.
Data deduplication
237
Drawbacks and concerns
Whenever data is transformed, concerns arise about potential loss of data. By definition, data deduplication systems
store data differently from how it was written. As a result, users are concerned with the integrity of their data. The
various methods of deduplicating data all employ slightly different techniques. However, the integrity of the data
will ultimately depend upon the design of the deduplicating system, and the quality used to implement the
algorithms. As the technology has matured over the past decade, the integrity of most of the major products has been
well proven.
One method for deduplicating data relies on the use of cryptographic hash functions to identify duplicate segments of
data. If two different pieces of information generate the same hash value, this is known as a collision. The
probability of a collision depends upon the hash function used, and although the probabilities are small, they are
always non zero.
Thus, the concern arises that data corruption can occur if a hash collision occurs, and additional means of
verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process
architectures may offer bit-for-bit validation of original data for guaranteed data integrity.
[6]
The hash functions used include standards such as SHA-1, SHA-256 and others. These provide a far lower
probability of data loss than the risk of an undetected/uncorrected hardware error in most cases and can be in the
order of 0.00000000000000000000000000000000000000000000000013% (1.3 · 10
-49
%) per petabyte (1,000
terabyte) of data.
[7]
Some cite the computational resource intensity of the process as a drawback of data deduplication. However, this is
rarely an issue for stand-alone devices or appliances, as the computation is completely offloaded from other systems.
This can be an issue when the deduplication is embedded within devices providing other services.
To improve performance, many systems utilize weak and strong hashes. Weak hashes are much faster to calculate
but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong
hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system
overhead associated with calculating and looking up hash values is primarily a function of the deduplication
workflow. The reconstitution of files does not require this processing and any incremental performance penalty
associated with re-assembly of data chunks is unlikely to impact application performance.
Another area of concern with deduplication is the related effect on snapshots, backup, and archival, especially where
deduplication is applied against primary storage (for example inside a NAS filer). Reading files out of a storage
device causes full rehydration of the files, so any secondary copy of the data set is likely to be larger than the
primary copy. In terms of snapshots, if a file is snapshotted prior to deduplication, the post-deduplication snapshot
will preserve the entire original file. This means that although storage capacity for primary file copies will shrink,
capacity required for snapshots may expand dramatically.
Another concern is the effect of compression and encryption. Although deduplication is a version of compression, it
works in tension with traditional compression. Deduplication achieves better efficiency against smaller data chunks,
whereas compression achieves better efficiency against larger chunks. The goal of encryption is to eliminate any
discernible patterns in the data. Thus encrypted data will have 0% gain from deduplication, even though the
underlying data may be redundant.
Scaling has also been a challenge for dedupe systems because the hash table or dedupe namespace needs to be shared
across storage devices. If there are multiple disk backup devices in an infrastructure with discrete dedupe
namespaces, then space efficiency is adversely affected. A namespace shared across devices - called Global Dedupe
- preserves space efficiency, but is technically challenging from a reliability and performance perspective.
Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may ruin the underlying
reliability of the system. (Compare this, for example, to the LOCKSS storage architecture that achieves reliability
through multiple copies of data.)
Data deduplication
238
References
[1] "In-line or post-process de-duplication? (updated 6-08)" (http:// www.backupcentral.com/ content/ view/ 134/ 47/ ). Backup Central. .
Retrieved 2009-10-16.
[2] "Inline vs. post-processing deduplication appliances" (http:// searchdatabackup.techtarget.com/ tip/ 0,289483,sid187_gci1315295,00. html).
Searchdatabackup.techtarget.com. . Retrieved 2009-10-16.
[3] "Windows Server 2008: Windows Storage Server 2008" (http:/ / www.microsoft.com/ windowsserver2008/ en/ us/ WSS08/ SIS. aspx).
Microsoft.com. . Retrieved 2009-10-16.
[4] "Products - Platform OS" (http:// www.netapp. com/ us/ products/ platform-os/dedupe.html). NetApp. . Retrieved 2009-10-16.
[5] An example of an implementation that checks for identity rather than assuming it is described in "US Patent application # 20090307251"
(http:// appft1.uspto. gov/ netacgi/ nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/ netahtml/ PTO/search-bool.html& r=1&f=G&
l=50& co1=AND&d=PG01& s1=shnelvar& OS=shnelvar& RS=shnelvar).
[6] http:// www. evaluatorgroup.com/ data-deduplication-why-when-where-and-how
[7] http:// www. exdupe. com/ collision. pdf
Doing More with Less by Jatinder Singh http:/ / www.itnext. in/ content/ doing-more-less.html
External links
• Biggar, Heidi(2007.12.11). WebCast: The Data Deduplication Effect (http:// www.infostor. com/ webcast/
display_webcast. cfm?ID=540)
• Fellows, Russ(Evaluator Group, Inc.) Data Deduplication, why when where and how? (http:// www.
evaluatorgroup. com/ data-deduplication-why-when-where-and-how)
• Using Latent Semantic Indexing for Data Deduplication (http:// www.tacoma. washington. edu/ tech/ docs/
research/ gradresearch/MSpiz. pdf).
• A Better Way to Store Data (http:/ / www. forbes.com/ 2009/ 08/ 08/
exagrid-storage-data-technology-cio-network-tape.html).
• What Is the Difference Between Data Deduplication, File Deduplication, and Data Compression? (http:/ / www.
eweek.com/ c/ a/ Knowledge-Center/
What-Is-the-Difference-Between-Data-Deduplication-File-Deduplication-and-Data-Compression/) - Database
from eWeek
• SNIA DDSR SIG (http:/ / www. snia. org/ forums/dmf/programs/data_protect_init/ ddsrsig/ ) Understanding
Data Deduplication Ratios (http:/ / www. snia. org/forums/dmf/knowledge/ white_papers_and_reports/
Understanding_Data_Deduplication_Ratios-20080718. pdf)
Data dictionary
239
Data dictionary
A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized
repository of information about data such as meaning, relationships to other data, origin, usage, and format."
[1]
The
term may have one of several closely related meanings pertaining to databases and database management systems
(DBMS):
• a document describing a database or collection of databases
• an integral component of a DBMS that is required to determine its structure
• a piece of middleware that extends or supplants the native data dictionary of a DBMS
Documentation
Database users and application developers can benefit from an authoritative data dictionary document that catalogs
the organization, contents, and conventions of one or more databases.
[2]
This typically includes the names and
descriptions of various tables and fields in each database, plus additional details, like the type and length of each data
element. There is no universal standard as to the level of detail in such a document, but it is primarily a weak kind of
data.
Middleware
In the construction of database applications, it can be useful to introduce an additional layer of data dictionary
software, i.e. middleware, which communicates with the underlying DBMS data dictionary. Such a "high-level" data
dictionary may offer additional features and a degree of flexibility that goes beyond the limitations of the native
"low-level" data dictionary, whose primary purpose is to support the basic functions of the DBMS, not the
requirements of a typical application. For example, a high-level data dictionary can provide alternative
entity-relationship models tailored to suit different applications that share a common database.
[3]
Extensions to the
data dictionary also can assist in query optimization against distributed databases.
[4]
Software frameworks aimed at rapid application development sometimes include high-level data dictionary facilities,
which can substantially reduce the amount of programming required to build menus, forms, reports, and other
components of a database application, including the database itself. For example, PHPLens includes a PHP class
library to automate the creation of tables, indexes, and foreign key constraints portably for multiple databases.
[5]
Another PHP-based data dictionary, part of the RADICORE toolkit, automatically generates program objects,
scripts, and SQL code for menus and forms with data validation and complex JOINs.
[6]
For the ASP.NET
environment, Base One's data dictionary provides cross-DBMS facilities for automated database creation, data
validation, performance enhancement (caching and index utilization), application security, and extended data
types.
[7]
Data dictionary
240
References
[1] ACM, IBM Dictionary of Computing (http:/ / portal.acm. org/ citation. cfm?id=541721), 10th edition, 1993
[2] TechTarget, SearchSOA, What is a data dictionary? (http:// searchsoa.techtarget.com/ sDefinition/ 0,,sid26_gci211896,00. html)
[3] U.S. Patent 4774661, Database management system with active data dictionary (http:/ / www.freepatentsonline.com/ 4774661.html), 19
November 1985, AT&T
[4] U.S. Patent 4769772, Automated query optimization method using both global and parallel local optimizations for materialization access
planning for distributed databases (http:/ / www. freepatentsonline. com/ 4769772.html), 28 February 1985, Honeywell Bull
[5] PHPLens, ADOdb Data Dictionary Library for PHP (http:// phplens. com/ lens/ adodb/ docs-datadict.htm)
[6] RADICORE, What is a Data Dictionary? (http:/ / www.radicore.org/ viewarticle.php?article_id=5)
[7] Base One International Corp., Base One Data Dictionary (http:// www.boic.com/ b1ddic. htm)
External links
• Yourdon, Structured Analysis Wiki, Data Dictionaries (http:/ / yourdon.com/ strucanalysis/ wiki/ index.
php?title=Chapter_10)
Data Domain (corporation)
Data Domain Corporation is an Information Technology company specializing in target-based deduplication
solutions for disk based backup.
[1]
In June 2009, EMC Corporation announced their intention to acquire Data Domain Corp, outbidding NetApp's
previous offer.
[2]
In July, the two companies reached definitive agreement regarding the acquisition.
References
[1] "Data Domain, an EMC company." Data Domain. (http:/ / www.datadomain.com/ company/ )
[2] "EMC Tops NetApp’s Bid for Data Domain." 1 June 2009. New York Times Online. (http:/ / dealbook.blogs. nytimes.com/ 2009/ 06/ 01/
emc-tops-netapps-bid-for-data-domain/)
Data exchange
241
Data exchange
Data exchange is the process of taking data structured under a source schema and actually transforming it into data
structured under a target schema, so that the target data is an accurate representation of the source data. Data
exchange is similar to the related concept of data integration except that data is actually restructured (with possible
loss of content) in data exchange. There may be no way to transform an instance given all of our constraints.
Conversely, there may be numerous ways to transform the instance (possibly infinitely many), in which case we
must identify and justify a "best" choice of solutions.
References
• R. Fagin, P. Kolaitis, R. Miller, and L. Popa. "Data ex- change: semantics and query answering." Theoretical
Computer Science, 336(1):89–124, 2005.
• P. Kolaitis. "Schema mappings, data exchange, and metadata management." Proceedings of the twenty- fourth
ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 61–75, 2005
Data extraction
Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources
for further data processing or data storage (data migration). The import into the intermediate extracting system is
thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in
the data workflow.
[1]
Usually, the term data extraction is applied when (experimental) data is first imported into a computer from primary
sources, like measuring or recording devices. Today's electronic devices will usually present a electrical connector
(e.g. USB) through which 'raw data' can be streamed into a personal computer.
Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports,
spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge
where as historically data extraction has had to deal with changes in physical hardware formats, the majority of
current data extraction deals with extracting data from these unstructured data sources, and from different software
formats. This growing process of data extraction from the web is referred to as Web scraping.
The act of adding structure to unstructured data takes a number of forms
• Using text pattern matching such as regular expressions to identify small or large-scale structure e.g. records in a
report and their associated data from headers and footers;
• Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes,
identifying skills, previous work experience, qualifications etc using a standard set of commonly used headings
(these would differ from language to language), eg Education might be found under
Education/Qualification/Courses;
• Using text analytics to attempt to understand the text and link it to other information
Data extraction
242
Notes
[1] Definition of data extraction. (http:// www.extractingdata.com)
External links
• Data Extraction (http:// www. etltools. org/extraction.html) as a part of the ETL process in a Data Warehousing
environment
Data field
A data field is a place where you can store data. Commonly used to refer to a column in a database or a field in a
data entry form or web form.
The field may contain data to be entered as well as data to be displayed.
Data flow diagram
Data flow diagram example.
[1]
A data flow diagram (DFD) is a
graphical representation of the "flow"
of data through an information system,
modelling its process aspects. Often
they are a preliminary step used to
create an overview of the system
which can later be elaborated.
[2]
DFDs
can also be used for the visualization
of data processing (structured design).
A DFD shows what kinds of data will
be input to and output from the system,
where the data will come from and go
to, and where the data will be stored. It
does not show information about the timing of processes, or information about whether processes will operate in
sequence or in parallel (which is shown on a flowchart).
Overview
Data flow diagram example.
It is common practice to draw the
context-level data flow diagram first,
which shows the interaction between
the system and external agents which
act as data sources and data sinks. On
the context diagram the system's
interactions with the outside world are modelled purely in terms of data flows across the system boundary. The
Data flow diagram
243
Data flow diagram - Yourdon/DeMarco
notation.
context diagram shows the entire system as a single process, and gives no
clues as to its internal organization.
This context-level DFD is next "exploded", to produce a Level 0 DFD that
shows some of the detail of the system being modeled. The Level 0 DFD
shows how the system is divided into sub-systems (processes), each of
which deals with one or more of the data flows to or from an external agent,
and which together provide all of the functionality of the system as a whole.
It also identifies internal data stores that must be present in order for the
system to do its job, and shows the flow of data between the various parts
of the system.
Data flow diagrams were proposed by Larry Constantine, the original
developer of structured design,
[3]
based on Martin and Estrin's "data flow
graph" model of computation.
Data flow diagrams (DFDs) are one of the three essential perspectives of the structured-systems analysis and design
method SSADM. The sponsor of a project and the end users will need to be briefed and consulted throughout all
stages of a system's evolution. With a data flow diagram, users are able to visualize how the system will operate,
what the system will accomplish, and how the system will be implemented. The old system's dataflow diagrams can
be drawn up and compared with the new system's data flow diagrams to draw comparisons to implement a more
efficient system. Data flow diagrams can be used to provide the end user with a physical idea of where the data they
input ultimately has an effect upon the structure of the whole system from order to dispatch to report. How any
system is developed can be determined through a data flow diagram.
In the course of developing a set of levelled data flow diagrams the analyst/designers is forced to address how the
system may be decomposed into component sub-systems, and to identify the transaction data in the data model.
There are different notations to draw data flow diagrams (Yourdon & Coad and Gane & Sarson
[4]
), defining
different visual representations for processes, data stores, data flow, and external entities.
[5]
Notes
[1] John Azzolini (2000). Introduction to Systems Engineering Practices (http:// ses. gsfc.nasa.gov/ ses_data_2000/ 000712_Azzolini.ppt). July
2000.
[2] Bruza, P. D., Van der Weide, Th. P., "The Semantics of Data Flow Diagrams", University of Nijmegen, 1993.
[3] W. Stevens, G. Myers, L. Constantine, "Structured Design", IBM Systems Journal, 13 (2), 115-139, 1974.
[4] Chris Gane and Trish Sarson. Structured Systems Analysis: Tools and Techniques. McDonnell Douglas Systems Integration Company, 1977
[5] How to draw Data Flow Diagrams (http:// www.smartdraw.com/ tutorials/ software/ dfd/tutorial_01.htm)
Further reading
• P. D. Bruza and Th. P. van der Weide. "The Semantics of Data Flow Diagrams" (http:// citeseer. ist. psu. edu/
271116.html).
• Scot W. Ambler. The Object Primer 3rd Edition Agile Model Driven Development with UML 2 http:// www.
agilemodeling. com/ artifacts/dataFlowDiagram. htm
Data flow diagram
244
External links
• Case study " Current physical dataflow diagram for Acme Fashion Supplies (http:/ / www.cilco. co. uk/
briefing-studies/acme-fashion-supplies-feasibility-study/ slides/ top-level-dfd.html)" ..and accompanying
elementary process descriptions
• " Yourdon's chapter on DFDs (http:/ / www. yourdon.com/ strucanalysis/ wiki/ index.php?title=Chapter_9)"
• " DFD Examples and Summary (http:// www. excelsoftware.com/ processmodel. html)"
Data governance
Data governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of
data quality, data management, data policies, business process management, and risk management surrounding the
handling of data in an organization. Through data governance, organizations are looking to exercise positive control
over the processes and methods used by their data stewards and data custodians to handle data.
Data governance is a set of processes that ensures that important data assets are formally managed throughout the
enterprise. Data governance ensures that data can be trusted and that people can be made accountable for any adverse
event that happens because of low data quality. It is about putting people in charge of fixing and preventing issues
with data so that the enterprise can become more efficient. Data governance also describes an evolutionary process
for a company, altering the company’s way of thinking and setting up the processes to handle information so that it
may be utilized by the entire organization. It’s about using technology when necessary in many forms to help aid the
process. When companies desire, or are required, to gain control of their data, they empower their people, set up
processes and get help from technology to do it.
[1]
There are some commonly cited vendor definitions for data governance. Data governance is a quality control
discipline for assessing, managing, using, improving, monitoring, maintaining, and protecting organizational
information.
[2]
It is a system of decision rights and accountabilities for information-related processes, executed
according to agreed-upon models which describe who can take what actions with what information, and when, under
what circumstances, using what methods.
[3]
:
Overview
Data governance encompasses the people, processes, and information technology required to create a consistent and
proper handling of an organization's data across the business enterprise. Goals may be defined at all levels of the
enterprise and doing so may aid in acceptance of processes by those who will use them. Some goals include:
• Increasing consistency and confidence in decision making
• Decreasing the risk of regulatory fines
• Improving data security
• Maximizing the income generation potential of data
• Designating accountability for information quality
• Enable better planning by supervisory staff
• Minimizing or eliminating re-work
• Optimize staff effectiveness
• Establish process performance baselines to enable improvement efforts
• Acknowledge and hold all gains
These goals are realized by the implementation of Data governance programs, or initiatives using Change
Management techniques.
Data governance
245
Data governance drivers
While data governance initiatives can be driven by a desire to improve data quality, they are more often driven by
C-Level leaders responding to external regulations. Examples of these regulations include Sarbanes-Oxley, Basel I,
Basel II, HIPAA, and a number data privacy regulations. To achieve compliance with these regulations, business
processes and controls require formal management processes to govern the data subject to these regulations.
Successful programs identify drivers meaningful to both supervisory and executive leadership.
Common themes among the external regulations center on the need to manage risk. The risks can be financial
misstatement, inadvertent release of sensitive data, or poor data quality for key decisions. Methods to manage these
risks vary from industry to industry. Examples of commonly referenced best practices and guidelines include
COBIT, ISO/IEC 38500, and others. The proliferation of regulations and standards creates challenges for data
governance professionals, particularly when multiple regulations overlap the data being managed. Organizations
often launch data governance initiatives to address these challenges.
Data governance initiatives
Data governance initiatives improve data quality by assigning a team responsible for data's accuracy, accessibility,
consistency, and completeness, among other metrics. This team usually consists of executive leadership, project
management, line-of-business managers, and data stewards. The team usually employs some form of methodology
for tracking and improving enterprise data, such as Six Sigma, and tools for data mapping, profiling, cleansing, and
monitoring data.
Data governance initiatives may be aimed at achieving a number of objectives including offering better visibility to
internal and external customers (such as supply chain management), compliance with regulatory law, improving
operations after rapid company growth or corporate mergers, or to aid the efficiency of enterprise knowledge
workers by reducing confusion and error and increasing their scope of knowledge. Many data governance initiatives
are also inspired by past attempts to fix information quality at the departmental level, leading to incongruent and
redundant data quality processes. Most large companies have many applications and databases that can't easily share
information. Therefore, knowledge workers within large organizations often don't have access to the information
they need to best do their jobs. When they do have access to the data, the data quality may be poor. By setting up a
data governance practice or Corporate Data Authority, these problems can be mitigated.
The structure of a data governance initiative will vary not only with the size of the organization, but with the desired
objectives or the 'focus areas'
[4]
of the effort.
Implementation
Implementation of a Data Governance initiative may vary in scope as well as origin. Sometimes, an executive
mandate will arise to initiate an enterprise wide effort, sometimes the mandate will be to create a pilot project or
projects, limited in scope and objectives, aimed at either resolving existing issues or demonstrating value. Sometimes
an initiative will originate lower down in the organization’s hierarchy, and will be deployed in a limited scope to
demonstrate value to potential sponsors higher up in the organization.
Data governance tools
Leaders of successful data governance programs declared in December 2006 at the Data Governance Conference in
Orlando, Fl, that data governance is between 80 and 95 percent communication.”
[5]
That stated, it is a given that
many of the objectives of a Data Governance program must be accomplished with appropriate tools. Many vendors
are now positioning their products as Data Governance tools; due to the different focus areas of various data
governance initiatives, any given tool may or may not be appropriate, in addition, many tools that are not marketed
as governance tools address governance needs.
[6]
Data governance
246
Data governance organizations
The IBM Data Governance Council
[7]
The IBM Data Governance Council is an organization formed by IBM consisting of companies, institutions
and technology solution providers with the stated objective to build consistency and quality control in
governance, which will help companies better protect critical data."
The Data Governance and Stewardship Community of Practice (DGS-COP)
[8]
The Data Governance and Stewardship Community of Practice is a vendor-neutral organization open to
practitioners, stakeholders and academics, as well as vendors and consultants. The DGS-COP offers a large
collection of data governance artifacts to members including case studies, metrics, dashboards, and maturity
models as well as on-line events.
Data Governance Conferences
Two major conferences are held annually, the Data Governance Conference, held in the USA
[9]
, and the Data
Governance Conference Europe
[10]
, held in London, England.
Master Data Management & Data Governance Conferences
[11]
Six major conferences are held annually, London, San Francisco, Sydney and Toronto in the spring, and
Madrid, Frankfurt, and New York City in the fall. 2009 was the 4th annual iteration with more than 2,000
attendees per year receiving their data governance and master data management updates via this 2-3 day event.
Data Governance Professionals Organization (DGPO)
[12]
The Data Governance Professionals Organization (DGPO) is a non-profit, vendor neutral, association of
business, IT and data professionals dedicated to advancing the discipline of data governance. The objective of
the DGPO is to provide a forum that fosters discussion and networking for members and to encourage,
develop and advance the skills of members working in the data governance discipline.
References
[1] Sarsfield, Steve (2009). "The Data Governance Imperative", IT Governance.
[2] "IBM Data Governance webpage" (http:/ / www-306.ibm. com/ software/tivoli/ governance/ servicemanagement/ data-governance.html). .
Retrieved 2008-07-09.
[3] "Data Governance Institute Data Governance Framework" (http:// datagovernance.com/ dgi_framework.pdf). .
[4] "Data Governance Focus Areas" (http:// datagovernance. com/ fc_focus_areas_for_data_governance.html). .
[5] Hopwood, Peter (2008-06). "Data Governance: One Size Does Not Fit All" (http:// www. webcitation.org/5bGHaz1gA). DM Review
Magazine. Archived from the original (http:// www. dmreview.com/ issues/ 2007_48/ 10001356-1.html) on 2008-10-02. . Retrieved
2008-10-02. "At the inaugural Data Governance Conference in Orlando, Florida, in December 2006, leaders of successful data governance
programs declared that in their experience, data governance is between 80 and 95 percent communication. Clearly, data governance is not a
typical IT project."
[6] "DataGovernanceSoftware.com" (http:/ / www. webcitation. org/ 5bGI3dfHV). The Data Governance Institute. Archived from the original
(http:// www. datagovernancesoftware.com) on 2008-10-02. . Retrieved 2008-10-02.
[7] IBM Data Governance (http:/ / www-306.ibm. com/ software/tivoli/ governance/ servicemanagement/ data-governance.html)
[8] The Data Governance & Stewardship Community of Practice (http:// www. datastewardship.com)
[9] Data Governance Conference (http:/ / dg-conference.com/ )
[10] Data Governance Conference Europe (http:/ / www.irmuk. co.uk/ dg2010/ )
[11] MDM SUMMIT Conference (http:/ / www.tcdii. com/ events/ cdimdmsummitseries. html)
[12] Data Governance Professionals Organization (http:/ / www.dgpo. org/ )
Data independence
247
Data independence
Data independence is the type of data transparency that matters for a centralized DBMS. It refers to the immunity
of user applications to make changes in the definition and organization of data.
Physical data independence deals with hiding the details of the storage structure from user applications. The
application should not be involved with these issues, since there is no difference in the operation carried out against
the data.
The data independence and operation independence together gives the feature of data abstraction. There are two
levels of data independence.
First level
The logical structure of the data is known as the schema definition. In general, if a user application operates on a
subset of the attributes of a relation, it should not be affected later when new attributes are added to the same
relation. Logical data independence indicates that the conceptual schema can be changed without affecting the
existing schemas.
Second level
The physical structure of the data is referred to as "physical data description". Physical data independence deals with
hiding the details of the storage structure from user applications. The application should not be involved with these
issues since, conceptually, there is no difference in the operations carried out against the data. There are two types of
data independence:
1. Logical data independence: The ability to change the logical (conceptual) schema without changing the External
schema (User View) is called logical data independence. For example, the addition or removal of new entities,
attributes, or relationships to the conceptual schema should be possible without having to change existing external
schemas or having to rewrite existing application programs.
2. Physical data independence: The ability to change the physical schema without changing the logical schema is
called physical data independence. For example, a change to the internal schema, such as using different file
organization or storage structures, storage devices, or indexing strategy, should be possible without having to
change the conceptual or external schemas.
Data Independence Types
Data independence has two types: 1. Physical Independence and 2. Logical Independence. With knowledge about the
three-schemes architecture the term data independence can be explained as follows: Each higher level of the data
architecture is immune to changes of the next lower level of the architecture.
Physical Independence: The logical scheme stays unchanged even though the storage space or type of some data is
changed for reasons of optimisation or reorganisation.
[The ability to change the physical schema without changing the logical schema is called as Physical Data
Independence.]
Logical Independence: The external scheme may stay unchanged for most changes of the logical scheme. This is
especially desirable as the application software does not need to be modified or newly translated.
[The ability to change the logical schema without changing the external schema or application programs is called as
Logical Data Independence.]
Data integration
248
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of
these data.
[1]
This process becomes significant in a variety of situations both commercial (when two similar
companies need to merge their databases) and scientific (combining research results from different bioinformatics
repositories, for example). Data integration appears with increasing frequency as the volume and the need to share
existing data explodes.
[2]
It has become the focus of extensive theoretical work, and numerous open problems remain
unsolved. In management circles, people frequently refer to data integration as "Enterprise Information Integration"
(EII).
History
Figure 1: Simple schematic for a data warehouse. The ETL process extracts
information from the source databases, transforms it and then loads it into the
data warehouse.
Figure 2: Simple schematic for a data-integration solution. A system designer
constructs a mediated schema against which users can run queries. The
virtual database interfaces with the source databases via wrapper code if
required.
Issues with combining heterogeneous data
sources under a single query interface have
existed for some time. The rapid adoption of
databases after the 1960s naturally led to the
need to share or to merge existing repositories.
This merging can take place at several levels in
the database architecture.
[3]
One popular solution
involves data warehousing (see figure 1). The
warehouse system extracts, transforms, and
loads data from heterogeneous sources into a
single common queriable schema so data
becomes compatible with each other. This
approach offers a tightly coupled architecture
because the data is already physically reconciled
in a single repository at query-time, so it usually
takes little time to resolve queries. Problems
with tight coupling can arise with the "freshness"
of data, which means information in warehouse
is not always up-to-date. Therefore, when an
original data source gets updated, the warehouse
still retains outdated data and the ETL process
needs re-execution for synchronization.
Difficulties also arise in constructing data
warehouses when one has only a query interface
to summary data sources and no access to the
full data. This problem frequently emerges when
integrating several commercial query services
like travel or classified advertisement web
applications.
As of 2009 the trend in data integration has
favored loosening the coupling between data.
This may involve providing a unified
Data integration
249
query-interface to access real time data over a mediated schema (see figure 2), from which information can be
retrieved directly from original databases. This approach may need to specify mappings between the mediated
schema and the schema of original sources, and transform a query into specialized queries to match the schema of
the original databases. Therefore, this middleware architecture is also termed as "view-based query-answering"
because each data source is represented as a view over the (nonexistent) mediated schema. Formally, computer
scientists label such an approach "Local As View" (LAV) — where "Local" refers to the local sources/databases. An
alternate model of integration has the mediated schema functioning as a view over the sources. This approach, called
"Global As View" (GAV) — where "Global" refers to the global (mediated) schema — has attractions due to the
simplicity of answering queries by means of the mediated schema. However, one must reconstitute the view for the
mediated schema whenever a new source gets integrated and/or an existing source modifies its schema.
As of 2010 some of the work in data integration research concerns the semantic integration problem. This problem
addresses not the structuring of the architecture of the integration, but how to resolve semantic conflicts between
heterogeneous data sources. For example if two companies merge their databases, certain concepts and definitions in
their respective schemas like "earnings" inevitably have different meanings. In one database it may mean profits in
dollars (a floating-point number), while in the other it might represent the number of sales (an integer). A common
strategy for the resolution of such problems involves the use of ontologies which explicitly define schema terms and
thus help to resolve semantic conflicts. This approach represents ontology-based data integration.
Example
Consider a web application where a user can query a variety of information about cities (such as crime statistics,
weather, hotels, demographics, etc). Traditionally, the information must exist in a single database with a single
schema. But any single enterprise would find information of this breadth somewhat difficult and expensive to
collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases,
weather websites, and census data.
A data-integration solution may address this problem by considering these external resources as materialized views
over a virtual mediated schema, resulting in "virtual data integration". This means application-developers construct a
virtual schema — the mediated schema — to best model the kinds of answers their users want. Next, they design
"wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply
transform the local query results (those returned by the respective websites or databases) into an easily processed
form for the data integration solution (see figure 2). When an application-user queries the mediated schema, the
data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the
virtual database combines the results of these queries into the answer to the user's query.
This solution offers the convenience of adding new sources by simply constructing an adapter or an application
software blade for them. It contrasts with ETL systems or with a single database solution, which require manual
integration of the entire new dataset into the system. The virtual ETL solutions leverage virtual mediated schema to
implement data harmonization; whereby the data is copied from the designated "master" source to the defined
targets, field by field. Advanced Data virtualization is also built on the concept of object-oriented modeling in order
to construct virtual mediated schema or virtual metadata repository, using hub and spoke architecture.
Data integration
250
Theory of data integration
The theory of data integration
[1]
forms a subset of database theory and formalizes the underlying concepts of the
problem in first-order logic. Applying the theories gives indications as to the feasibility and difficulty of data
integration. While its definitions may appear abstract, they have sufficient generality to accommodate all manner of
integration systems.
Definitions
Data integration systems are formally defined as a triple where is the global (or mediated) schema,
is the heterogeneous set of source schemas, and is the mapping that maps queries between the source and the
global schemas. Both and are expressed in languages over alphabets composed of symbols for each of their
respective relations. The mapping consists of assertions between queries over and queries over . When
users pose queries over the data integration system, they pose queries over and the mapping then asserts
connections between the elements in the global schema and the source schemas.
A database over a schema is defined as a set of sets, one for each relation (in a relational database). The database
corresponding to the source schema would comprise the set of sets of tuples for each of the heterogeneous data
sources and is called the source database. Note that this single source database may actually represent a collection of
disconnected databases. The database corresponding to the virtual mediated schema is called the global
database. The global database must satisfy the mapping with respect to the source database. The legality of this
mapping depends on the nature of the correspondence between and . Two popular ways to model this
correspondence exist: Global as View or GAV and Local as View or LAV.
Figure 3: Illustration of tuple space of the GAV and LAV mappings.
[4]
In GAV,
the system is constrained to the set of tuples mapped by the mediators while the set
of tuples expressible over the sources may be much larger and richer. In LAV, the
system is constrained to the set of tuples in the sources while the set of tuples
expressible over the global schema can be much larger. Therefore LAV systems
must often deal with incomplete answers.
GAV systems model the global database as
a set of views over . In this case
associates to each element of as a query
over . Query processing becomes a
straightforward operation due to the
well-defined associations between and
. The burden of complexity falls on
implementing mediator code instructing the
data integration system exactly how to
retrieve elements from the source databases.
If any new sources join the system,
considerable effort may be necessary to
update the mediator, thus the GAV approach
appears preferable when the sources seem
unlikely to change.
In a GAV approach to the example data integration system above, the system designer would first develop mediators
for each of the city information sources and then design the global schema around these mediators. For example,
consider if one of the sources served a weather website. The designer would likely then add a corresponding element
for weather to the global schema. Then the bulk of effort concentrates on writing the proper mediator code that will
transform predicates on weather into a query over the weather website. This effort can become complex if some
other source also relates to weather, because the designer may need to write code to properly combine the results
from the two sources.
On the other hand, in LAV, the source database is modeled as a set of views over . In this case associates to
each element of a query over . Here the exact associations between and are no longer well-defined. As
is illustrated in the next section, the burden of determining how to retrieve elements from the sources is placed on the
Data integration
251
query processor. The benefit of an LAV modeling is that new sources can be added with far less work than in a GAV
system, thus the LAV approach should be favored in cases where the mediated schema is more stable and unlikely to
change.
[1]
In an LAV approach to the example data integration system above, the system designer designs the global schema
first and then simply inputs the schemas of the respective city information sources. Consider again if one of the
sources serves a weather website. The designer would add corresponding elements for weather to the global schema
only if none existed already. Then programmers write an adapter or wrapper for the website and add a schema
description of the website's results to the source schemas. The complexity of adding the new source moves from the
designer to the query processor.
Query processing
The theory of query processing in data integration systems is commonly expressed using conjunctive queries.
[5]
One
can loosely think of a conjunctive query as a logical function applied to the relations of a database such as "
where ". If a tuple or set of tuples is substituted into the rule and satisfies it (makes it true), then
we consider that tuple as part of the set of answers in the query. While formal languages like Datalog express these
queries concisely and without ambiguity, common SQL queries count as conjunctive queries as well.
In terms of data integration, "query containment" represents an important property of conjunctive queries. A query
contains another query (denoted ) if the results of applying are a subset of the results of
applying for any database. The two queries are said to be equivalent if the resulting sets are equal for any
database. This is important because in both GAV and LAV systems, a user poses conjunctive queries over a virtual
schema represented by a set of views, or "materialized" conjunctive queries. Integration seeks to rewrite the queries
represented by the views to make their results equivalent or maximally contained by our user's query. This
corresponds to the problem of answering queries using views (AQUV).
[6]
In GAV systems, a system designer writes mediator code to define the query-rewriting. Each element in the user's
query corresponds to a substitution rule just as each element in the global schema corresponds to a query over the
source. Query processing simply expands the subgoals of the user's query according to the rule specified in the
mediator and thus the resulting query is likely to be equivalent. While the designer does the majority of the work
beforehand, some GAV systems such as Tsimmis
[7]
involve simplifying the mediator description process.
In LAV systems, queries undergo a more radical process of rewriting because no mediator exists to align the user's
query with a simple expansion strategy. The integration system must execute a search over the space of possible
queries in order to find the best rewrite. The resulting rewrite may not be an equivalent query but maximally
contained, and the resulting tuples may be incomplete. As of 2009 the MiniCon algorithm
[6]
is the leading query
rewriting algorithm for LAV data integration systems.
In general, the complexity of query rewriting is NP-complete.
[6]
If the space of rewrites is relatively small this does
not pose a problem — even for integration systems with hundreds of sources.
Data integration
252
References
[1] Maurizio Lenzerini (2002). "Data Integration: A Theoretical Perspective" (http:// www.dis. uniroma1.it/ ~lenzerin/ homepagine/ talks/
TutorialPODS02. pdf). PODS 2002. pp. 233–246. .
[2] Frederick Lane (2006). IDC: World Created 161 Billion Gigs of Data in 2006 "IDC: World Created 161 Billion Gigs of Data in 2006" (http://
www.toptechnews. com/ story. xhtml?story_id=01300000E3D0&full_skip=1). IDC: World Created 161 Billion Gigs of Data in 2006.
[3] Patrick Ziegler and Klaus R. Dittrich (2004). "Three Decades of Data Integration - All Problems Solved?" (http:// www.ifi.unizh.ch/ stff/
pziegler/papers/ ZieglerWCC2004. pdf). WCC 2004. pp. 3–12. .
[4] Christoph Koch (2001). Data Integration against Multiple Evolving Autonomous Schemata (http:/ / www.csd.uoc. gr/~hy562/ Papers/
thesis_final. pdf). .
[5] Jeffrey D. Ullman (1997). "Information Integration Using Logical Views" (http:// www-db. stanford.edu/ pub/ papers/
integration-using-views.ps). ICDT 1997. pp. 19–40. .
[6] Alon Y. Halevy (2001). "Answering queries using views: A survey" (http:// www. cs. uwaterloo.ca/ ~david/ cs740/
answering-queries-using-views. pdf). The VLDB Journal. pp. 270–294. .
[7] http:// www-db.stanford. edu/ tsimmis/
External links
• Large Collection of Data Integration Projects (http:// www.ifi. unizh. ch/ ~pziegler/IntegrationProjects.html)
Data library
A data library refers to both the content and the services that foster use of collections of numeric, audio-visual,
textual or geospatial data sets
[1]
for secondary use in research. (See below to view definition from the Online
Dictionary for Library and Information Science.) A data library is normally part of a larger institution (academic,
corporate, scientific, medical, governmental, etc.) established to serve the data users of that organisation. The data
library tends to house local data collections and provides access to them through various means (CD-/DVD-ROMs
or central server for download). A data library may also maintain subscriptions to licensed data resources for its
users to access. Whether a data library is also considered a data archive may depend on the extent of unique holdings
in the collection, whether long-term preservation services are offered, and whether it serves a broader community (as
national data archives do).
Importance of data libraries and data librarianship
In August 2001, the Association of Research Libraries (ARL)
[2]
published SPEC Kit 263: Numeric Data Products
and Services
[3]
, presenting results from a survey of ARL member institutions involved in collecting and providing
services for numeric data resources.
A list of university data libraries
[4]
and similar organisations can be found on this page of IASSIST members'
organisational websites.
Data library
253
Services offered by data libraries and data librarians
Library service providing support at the institutional level for the use of numerical and other types of datasets in
research. Amongst the support activities typically available:
• Reference Assistance — locating numeric or geospatial datasets containing measurable variables on a particular
topic or group of topics, in response to a user query.
• User Instruction — providing hands-on training to groups of users in locating data resources on particular topics,
how to download data and read it into spreadsheet, statistical, database, or GIS packages, how to interpret
codebooks and other documentation.
• Technical Assistance - including easing registration procedures, troubleshooting problems with the dataset, such
as errors in the documentation, reformatting data into something a user can work with, and helping with statistical
methodology.
• Collection Development & Management - acquire, maintain, and manage a collection of data files used for
secondary analysis by the local user community; purchase institutional data subscriptions; act as a site
representative to data providers and national data archives for the institution.
• Preservation and Data Sharing Services - act on a strategy of preservation of datasets in the collection, such as
media refreshment and file format migration; download and keep records on updated versions from a central
archive. Also, assist users in preparing original data for secondary use by others; either for deposit in a central
archive or institutional repository, or for less formal ways of sharing data. This may also involve marking up the
data into an appropriate XML standard, such as the Data Documentation Initiative
[5]
, or adding other metadata to
facilitate online discovery.
References
• Clubb, J., Austin, E., and Geda, C., "Sharing research data in the social sciences." In Sharing Research Data, S.
Fienberg, M. Martin, and M. Straf, Eds. National Academy Press, Washington, D.C., 1985, 39-88.
• Geraci, D., Humphrey, C., and Jacobs, J., Data Basics. Canadian Library Association, Ottawa, ON, forthcoming.
• Martinez, Luis and Macdonald, Stuart, "Supporting local data users in the UK academic community"
[6]
. Ariadne,
issue 44, July 2005.
• Olken, Frank and Frederic Gey, Social Science Data Library Manifesto
[7]
, 2006-02-14 v31.
• See the IASSIST Bibliography of Selected Works
[8]
for articles tracing the history of data libraries and its
relationship to the archivist profession, going back to the 1960s and '70s up to 1996.
• See IASSIST Quarterly
[9]
articles from 1993 to the present, focusing on data libraries, data archives, data support,
and information technology for the social sciences.
External links
Associations
• IASSIST
[10]
(International Association for Social Science Information and Service Technology)
• DISC-UK
[11]
(Data Information Specialists Committee — United Kingdom)
• APDU
[12]
(Association of Public Data Users - USA)
• CAPDU
[13]
(Canadian Association of Public Data Users)
References
[1] http:/ / lu.com/ odlis/ odlis_d. cfm
[2] http:/ / www. arl.org
[3] http:/ / www. arl.org/ bm~doc/ spec263web. pdf
[4] http:/ / www. iassistdata. org/tools/ membersites. html
Data library
254
[5] http:/ / www. icpsr. umich. edu/ DDI/
[6] http:/ / www. ariadne. ac. uk/ issue44/ martinez/
[7] http:/ / hpcrd.lbl.gov/ staff/ olken/ ssdl/ ssdl_manifesto. html
[8] http:// www. iassistdata. org/publications/ bibliography.html
[9] http:/ / www. iassistdata. org/publications/ iq/ index. html
[10] http:// www. iassistdata. org/
[11] http:// datalib.ed. ac. uk/ discuk/
[12] http:/ / www. apdu. org/
[13] http:/ / www. capdu. ca/
Data maintenance
Data maintenance is the adding, deleting, changing and updating of binary and high level files, and the real world
data associated with those files. Data can be maintained manually and/or through an automated program, but at
origination and translation/delivery point must be translated into a binary representation for storage. Data is usually
edited at a slightly higher level in a format relevant to the content of the data (such as text, images, or scientific or
financial information). It is also the backing up, storage and general up keep of this all this data in the long term.
Data management
Data management comprises all the disciplines related to managing data as a valuable resource.
Overview
The official definition provided by DAMA International, the professional organization for those in the data
management profession, is: "Data Resource Management is the development and execution of architectures, policies,
practices and procedures that properly manage the full data lifecycle needs of an enterprise."{{DAMA
International}} This definition is fairly broad and encompasses a number of professions which may not have direct
technical contact with lower-level aspects of data management, such as relational database management.
Alternatively, the definition provided in the DAMA Data Management Body of Knowledge (DAMA-DMBOK) is:
"Data management is the development, execution and supervision of plans, policies, programs and practices that
control, protect, deliver and enhance the value of data and information assets."
[1]
The concept of "Data Management" arose in the 1980s as technology moved from sequential processing (first cards,
then tape) to random access processing. Since it was now technically possible to store a single fact in a single place
and access that using random access disk, those suggesting that "Data Management" was more important than
"Process Management" used arguments such as "a customer's home address is stored in 75 (or some other large
number) places in our computer systems." During this period, random access processing was not competitivly fast,
so those suggesting "Process Management" was more important than "Data Management" used batch processing
time as their primary argument. As applications moved more and more into real-time, interactive applications, it
became obvious to most practitioners that both management processes were important. If the data was not well
defined, the data would be mis-used in applications. If the process wasn't well defined, it was impossible to meet
user needs.
Data management
255
Topics in Data Management
Topics in Data Management, grouped by the DAMA DMBOK Framework,
[2]
include:
1. Data governance
• Data asset
• Data governance
• Data steward
3. Data Architecture, Analysis and Design
• Data analysis
• Data architecture
• Data modeling
5. Database Management
• Data maintenance
• Database administration
• Database management system
7. Data Security Management
• Data access
• Data erasure
• Data privacy
• Data security
9. Data Quality Management
• Data cleansing
• Data integrity
• Data quality
• Data quality assurance
1. Reference and Master Data Management
• Data integration
• Master data management
• Reference data
3. Data Warehousing and Business Intelligence Management
• Business intelligence
• Data mart
• Data mining
• Data movement (extract, transform and load)
• Data warehousing
5. Document, Record and Content Management
• Document management system
• Records management
7. Meta Data Management
• Meta-data management
• Metadata
• Metadata discovery
• Metadata publishing
• Metadata registry
9. Contact Data Management
• Business continuity planning
• Marketing operations
• Customer data integration
• Identity management
• Identity theft
• Data theft
• ERP software
• CRM software
• Address (geography)
• Postal code
• Email address
• Telephone number
Body Of Knowledge
The DAMA Guide to the Data Management Body of Knowledge" (DAMA-DMBOK Guide), under the guidance of
a new DAMA-DMBOK Editorial Board. This publication is available from April 5, 2009.
Usage
In modern management usage, one can easily discern a trend away from the term 'data' in composite expressions to
the term information or even knowledge when talking in non-technical context. Thus there exists not only data
management, but also information management and knowledge management. This is a fairly detrimental tendency in
that it obscures the fact that is usually always plain, traditional data that is managed or somehow processed on
second looks. The extremely relevant distinction between data and derived values can be seen in the information
ladder. While data can exist as such, 'information' and 'knowledge' are always in the "eye" (or rather the brain) of the
beholder and can only be measured in relative units.
Data management
256
Notes
[1] http:/ / www. dama. org/files/ public/ DI_DAMA_DMBOK_Guide_Presentation_2007. pdf "DAMA-DMBOK Guide (Data Management
Body of Knowledge) Introduction & Project Status"
[2] http:// www. dama. org/i4a/ pages/ index. cfm?pageid=3364 "DAMA-DMBOK Functional Framework"
External links
• Data management (http:// www. dmoz. org/ Computers/ Software/Master_Data_Management/ Articles/ / ) at the
Open Directory Project
Data management plan
A data management plan is a formal document that outlines how you will handle your data both during your
research, and after the project is completed
[1]
. The goal of a data management plan is to consider the many aspects
of data management, metadata generation, data preservation, and analysis before the project begins; this ensure that
data are well-managed in the present, and prepared for preservation in the future.
Importance
Preparing a data management plan before data are collected ensures that data are in the correct format, organized
well, and better annotated
[2]
. This saves time in the long term because there is no need to re-organize, re-format, or
try to remember details about data. It also increases research efficiency since both the data collector and other
researchers will be able to understand and use well-annotated data in the future. One component of a good data
management plan is data archiving and preservation. By deciding on an archive ahead of time, the data collector can
format data during collection to make its future submission to a database easier. If data are preserved, they are more
relevant since they can be re-used by other researchers. It also allows the data collector to direct requests for data to
the database, rather than address requests individually. Data that are preserved have the potential to lead to new,
unanticipated discoveries, and they prevent duplication of scientific studies that have already been conducted. Data
archiving also provides insurance against loss by the data collector.
Funding agencies are beginning to require data management plans as part of the proposal and evaluation process.
[3]
Major Components
Information about data & data format
• Include a description of data to be produced by the project. This might include (but is not limited to) data that are:
• Experimental
• Observational
• Raw or derived
• Physical collections
• Models
• Simulations
• Curriculum materials
• Software
• Images
• How will the data be acquired? When and where will they be acquired?
• After collection, how will the data be processed? Include information about
Data management plan
257
• Software used
• Algorithms
• Scientific workflows
• Describe the file formats that will be used, justify those formats, and describe the naming conventions used.
• Identify the quality assurance & quality control measures that will be taken during sample collection, analysis,
and processing.
• If existing data are used, what are their origins? How will the data collected be combined with existing data?
What is the relationship between the data collected and existing data?
• How will the data be managed in the short-term? Consider the following:
• Version control for files
• Backing up data and data products
• Security & protection of data and data products
• Who will be responsible for management
Metadata content and format
Metadata are the contextual details, including any information important for using data. This may include
descriptions of temporal and spatial details, instruments, parameters, units, files, etc. Metadata is commonly referred
to as “data about data”
[4]
. Consider the following:
• What metadata are needed? Include any details that make data meaningful.
• How will the metadata be created and/or captured? Examples include lab notebooks, GPS hand-held units,
Auto-saved files on instruments, etc.
• What format will be used for the metadata? Consider the standard metadata commonly used in the scientific
discipline that contains your work. There should be justification for the format chosen.
Policies for access, sharing, and re-use
• Describe any obligations that exist for sharing data collected. These may include obligations from funding
agencies, institutions, other professional organizations, and legal requirements.
• Include information about how data will be shared, including when the data will be accessible, how long the data
will be available, how access can be gained, and any rights that the data collector reserves for using data.
• Address any ethical or privacy issues with data sharing
• Address intellectual property & copyright issues. Who owns the copyright? What are the institutional, publisher,
and/or funding agency policies associated with intellectual property? Are there embargoes for political,
commercial, or patent reasons?
• Describe the intended future uses/users for the data
• Indicate how the data should be cited by others. How will the issue of persistent citation be addressed? For
example, if the data will be deposited in a public archive, will the dataset have a digital object identifier (doi)
assigned to it?
Long-term storage and data management
• Researchers should identify an appropriate archive for long-term preservation of their data. By identifying the
archive early in the project, the data can be formatted, transformed, and documented appropriately to meet the
requirements of the archive. Researchers should consult colleagues and professional societies in their discipline to
determine the most appropriate database, and include a backup archive in their data management plan in case their
first choice goes out of existence.
• Early in the project, the primary researcher should identify what data will be preserved in an archive. Usually,
preserving the data in its most raw form is desirable, although data derivatives and products can also be preserved.
Data management plan
258
• An individual should be identified as the primary contact person for archived data, and ensure that contact
information is always kept up-to-date in case there are requests for data or information about data.
Budget
Data management and preservation costs may be considerable, depending on the nature of the project. By
anticipating costs ahead of time, researchers ensure that the data will be properly managed and archived. Potential
expenses that should be considered are
• Personnel time for data preparation, management, documentation, and preservation
• Hardware and/or software needed for data management, backing up, security, documentation, and preservation
• Costs associated with submitting the data to an archive
The data management plan should include how these costs will be paid.
NSF Data Management Plan
All grant proposals submitted to NSF must include a Data Management Plan that is no more than two pages
[5]
. This
is a supplement (not part of the 15 page proposal) and should describe how the proposal will conform to the Award
and Administration Guide policy (see below). It may include the following:
1. The types of data
2. The standards to be used for data and metadata format and content
3. Policies for access and sharing
4. Policies and provisions for re-use
5. Plans for archiving data
Policy summarized from of the NSF Award and Administration Guide, Section 4 (Dissemination and Sharing of
Research Results)
[6]
:
1. Promptly publish with appropriate authorship
2. Share data, samples, physical collections, and supporting materials with others, within a reasonable time frame
3. Share software and inventions
4. Investigators can keep their legal rights over their intellectual property, but they still have to make their results,
data, and collections available to others
5. Policies will be implemented via
1. Proposal review
2. Award negotiations and conditions
3. Support/incentives
References
[1] http:/ / www2. lib. virginia.edu/ brown/ data/ plan. html
[2] http:// libraries.mit. edu/ guides/ subjects/ data-management/ why. html
[3] http:// www. nsf. gov/ bfa/dias/ policy/ dmpfaqs. jsp
[4] Michener,WK and JW Brunt. 2000. Ecological Data: Design, Management and Processing. Blackwell Science, 180p.
[5] http:// www. nsf. gov/ pubs/ policydocs/ pappguide/ nsf11001/ gpg_2.jsp#dmp
[6] http:/ / www. nsf. gov/ bfa/dias/ policy/ dmp. jsp
External links
• DataONE http:// www. dataone. org/ plans
• University of Virginia Library http:// www2.lib.virginia.edu/ brown/data/ plan. html
• Digital Curation Centre http:/ / www. dcc. ac. uk/ resources/ data-management-plans
Data management plan
259
• University of Michigan Library http:// www.lib.umich. edu/ research-data-management-and-publishing-support/
nsf-data-management-plans#directorate_guide
• NSF Grant Proposal Guidelines http:/ / www. nsf. gov/ pubs/ policydocs/ pappguide/ nsf11001/ gpg_2. jsp#dmp
• Inter-University Consortium for Political and Social Research http:// www.icpsr. umich. edu/ icpsrweb/ ICPSR/
dmp/ index. jsp
• LTER Blog: How to write a data management plan http:// lno.lternet.edu/ node/ 269
Data mapping
Data mapping is the process of creating data element mappings between two distinct data models. Data mapping is
used as a first step for a wide variety of data integration tasks including:
• Data transformation or data mediation between a data source and a destination
• Identification of data relationships as part of data lineage analysis
• Discovery of hidden sensitive data such as the last four digits social security number hidden in another user id as
part of a data masking or de-identification project
• Consolidation of multiple databases into a single data base and identifying redundant columns of data for
consolidation or elimination
For example, a company that would like to transmit and receive purchases and invoices with other companies might
use data mapping to create data maps from a company's data to standardized ANSI ASC X12 messages for items
such as purchase orders and invoices.
Standards
X12 standards are generic Electronic Data Interchange (EDI) standards designed to allow a company to exchange
data with any other company, regardless of industry. The standards are maintained by the Accredited Standards
Committee X12 (ASC X12), with the American National Standards Institute (ANSI) accredited to set standards for
EDI. The X12 standards are often called ANSI ASC X12 standards.
In the future, tools based on semantic web languages such as Resource Description Framework (RDF), the Web
Ontology Language (OWL) and standardized metadata registry will make data mapping a more automatic process.
This process will be accelerated if each application performed metadata publishing. Full automated data mapping is a
very difficult problem (see Semantic translation).
Hand-coded, graphical manual
Data mappings can be done in a variety of ways using procedural code, creating XSLT transforms or by using
graphical mapping tools that automatically generate executable transformation programs. These are graphical tools
that allow a user to "draw" lines from fields in one set of data to fields in another. Some graphical data mapping tools
allow users to "Auto-connect" a source and a destination. This feature is dependent on the source and destination
data element name being the same. Transformation programs are automatically created in SQL, XSLT, Java
programming language or C++. These kinds of graphical tools are found in most ETL Tools (Extract, Transform,
Load Tools) as the primary means of entering data maps to support data movement.
Data mapping
260
Data-driven mapping
This is the newest approach in data mapping and involves simultaneously evaluating actual data values in two data
sources using heuristics and statistics to automatically discover complex mappings between two data sets. This
approach is used to find transformations between two data sets and will discover substrings, concatenations,
arithmetic, case statements as well as other kinds of transformation logic. This approach also discovers data
exceptions that do not follow the discovered transformation logic.
Semantic mapping
Semantic mapping is similar to the auto-connect feature of data mappers with the exception that a metadata registry
can be consulted to look up data element synonyms. For example, if the source system lists FirstName but the
destination lists PersonGivenName, the mappings will still be made if these data elements are listed as synonyms in
the metadata registry. Semantic mapping is only able to discover exact matches between columns of data and will
not discover any transformation logic or exceptions between columns.
References
• Bogdan Alexe, Laura Chiticariu, Renée J. Miller, Wang Chiew Tan: Muse: Mapping Understanding and deSign
by Example
[1]
. ICDE 2008: 10-19
• Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, Cornelia Hedeler:
Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces
[2]
. EDBT 2010:
573-584
• Laura Chiticariu, Wang Chiew Tan: Debugging Schema Mappings with Routes
[3]
. VLDB 2006: 79-90
• Ronald Fagin, Laura M. Haas, Mauricio A. Hernández, Renée J. Miller, Lucian Popa, Yannis Velegrakis: Clio:
Schema Mapping Creation and Data Exchange. Conceptual Modeling: Foundations and Applications 2009:
198-236
[4]
• Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, Lucian Popa: Data exchange: semantics and query answering
[5]
. Theor. Comput. Sci. 336(1): 89-124 (2005)
• Maurizio Lenzerini: Data Integration: A Theoretical Perspective
[6]
. PODS 2002: 233-246
• Renée J. Miller, Laura M. Haas, Mauricio A. Hernández: Schema Mapping as Query Discovery
[7]
. VLDB 2000:
77-88
References
[1] http:/ / dx. doi. org/10. 1109/ ICDE.2008. 4497409
[2] http:// www. edbt. org/Proceedings/ 2010-Lausanne/ edbt/ papers/ p0573-Belhajjame.pdf
[3] http:// www. vldb. org/conf/ 2006/ p79-chiticariu.pdf
[4] http:/ / dx. doi. org/10. 1007/ 978-3-642-02463-4_12
[5] http:/ / dx. doi. org/10. 1016/ j. tcs. 2004. 10. 033
[6] http:/ / www. acm. org/sigs/ sigmod/ pods/ proc02/papers/ 233-Lenzerini.pdf
[7] http:/ / www. informatik.uni-trier.de/ ~ley/ db/ conf/vldb/ MillerHH00.html
Data migration
261
Data migration
Data migration is the process of transferring data between storage types, formats, or computer systems. Data
migration is usually performed programmatically to achieve an automated migration, freeing up human resources
from tedious tasks. It is required when organizations or individuals change computer systems or upgrade to new
systems, or when systems merge (such as when the organizations that use them undergo a merger or takeover).
To achieve an effective data migration procedure, data on the old system is mapped to the new system providing a
design for data extraction and data loading. The design relates old data formats to the new system's formats and
requirements. Programmatic data migration may involve many phases but it minimally includes data extraction
where data is read from the old system and data loading where data is written to the new system.
If a decision has been made to provide a set input file specification for loading data onto the target system, this
allows a pre-load 'data validation' step to be put in place, interrupting the standard E(T)L process. Such a data
validation process can be designed to interrogate the data to be transferred, to ensure that it meets the predefined
criteria of the target environment, and the input file specification. An alternative strategy is to have on-the-fly data
validation occurring at the point of loading, which can be designed to report on load rejection errors as the load
progresses. However, in the event that the extracted and transformed data elements are highly 'integrated' with one
another, and the presence of all extracted data in the target system is essential to system functionality, this strategy
can have detrimental, and not easily quantifiable effects.
After loading into the new system, results are subjected to data verification to determine whether data was accurately
translated, is complete, and supports processes in the new system. During verification, there may be a need for a
parallel run of both systems to identify areas of disparity and forestall erroneous data loss.
Automated and manual data cleaning is commonly performed in migration to improve data quality, eliminate
redundant or obsolete information, and match the requirements of the new system.
Data migration phases (design, extraction, cleansing, load, verification) for applications of moderate to high
complexity are commonly repeated several times before the new system is deployed.
Categories
Data is stored on various media in files or databases, and is generated and consumed by software applications which
in turn support business processes. The need to transfer and convert data can be driven by multiple business
requirements and the approach taken to the migration depends on those requirements. Four major migration
categories are proposed on this basis.
Storage migration
A business may choose to rationalize the physical media to take advantage of more efficient storage technologies.
This will result in having to move physical blocks of data from one tape or disk to another, often using virtualization
techniques. The data format and content itself will not usually be changed in the process and can normally be
achieved with minimal or no impact to the layers above.Sunil
Database migration
Similarly, it may be necessary to move from one database vendor to another, or to upgrade the version of database
software being used. The latter case is less likely to require a physical data migration, but this can happen with major
upgrades. In these cases a physical transformation process may be required since the underlying data format can
change significantly. This may or may not affect behaviour in the applications layer, depending largely on whether
the data manipulation language or protocol has changed – but modern applications are written to be agnostic to the
database technology so that a change from Sybase, MySQL, DB2 or SQL Server to Oracle should only require a
Data migration
262
testing cycle to be confident that both functional and non-functional performance has not been adversely affected.
Application migration
Changing application vendor – for instance a new CRM or ERP platform – will inevitably involve substantial
transformation as almost every application or suite operates on its own specific data model. Further, to allow the
application to be sold to the widest possible market, commercial off-the-shelf packages are generally configured for
each customer using metadata. Application programming interfaces (APIs) are supplied to protect the integrity of the
data they have to handle. Use of the API is normally a condition of the software warranty, although a waiver may be
allowed if the vendor's own or certified partner professional services and tools are used.
Business process migration
Business processes operate through a combination of human and application syste