Business Analytics

Business analytics
From Wikipedia, the free encyclopedia

Not to be confused with Business analysis.

This article needs additional citations for verification. Please help improve this article
by adding citations to reliable sources. Unsourced material may be challenged and
removed. (October 2010)
Business analytics (BA) refers to the skills, technologies, practices for continuous iterative
exploration and investigation of past business performance to gain insight and drive business
planning.
[1]
Business analytics focuses on developing new insights and understanding of business
performance based on data and statistical methods. In contrast, business intelligence traditionally
focuses on using a consistent set of metrics to both measure past performance and guide business
planning, which is also based on data and statistical methods.
Business analytics makes extensive use of data, statistical and quantitative analysis, explanatory
and predictive modeling,
[2]
and fact-based management to drive decision making. It is therefore
closely related to management science. Analytics may be used as input for human decisions or
may drive fully automated decisions. Business intelligence is querying, reporting, online
analytical processing (OLAP), and "alerts."
In other words, querying, reporting, OLAP, and alert tools can answer questions such as what
happened, how many, how often, where the problem is, and what actions are needed. Business
analytics can answer questions like why is this happening, what if these trends continue, what
will happen next (that is, predict), what is the best that can happen (that is, optimize).
[3]

Contents
1 Examples of application
2 Types of analytics
3 Basic domains within analytics
4 History
5 Challenges
6 Competing on analytics
7 See also
8 References
9 Further reading
Examples of application
Banks, such as Capital One, use data analysis (or analytics, as it is also called in the business
setting), to differentiate among customers based on credit risk, usage and other characteristics
and then to match customer characteristics with appropriate product offerings. Harrahs, the
gaming firm, uses analytics in its customer loyalty programs. E & J Gallo Winery quantitatively
analyzes and predicts the appeal of its wines. Between 2002 and 2005, Deere & Company saved
more than $1 billion by employing a new analytical tool to better optimize inventory.
[3]

Types of analytics
Decisive analytics: supports human decisions with visual analytics the user models to
reflect reasoning.
Descriptive Analytics: Gain insight from historical data with reporting, scorecards,
clustering etc.
Predictive analytics (predictive modeling using statistical and machine learning
techniques)
Prescriptive analytics recommend decisions using optimization, simulation etc.
Basic domains within analytics
Behavioral analytics
Cohort Analysis
Collections analytics
Contextual data modeling - supports the human reasoning that occurs after viewing
"executive dashboards" or any other visual analytics
Financial services analytics
Fraud analytics
Marketing analytics
Pricing analytics
Retail sales analytics
Risk & Credit analytics
Supply Chain analytics
Talent analytics
Telecommunications
Transportation analytics
History
Analytics have been used in business since the management exercises were put into place by
Frederick Winslow Taylor in the late 19th century. Henry Ford measured the time of each
component in his newly established assembly line. But analytics began to command more
attention in the late 1960s when computers were used in decision support systems. Since then,
analytics have changed and formed with the development of enterprise resource planning (ERP)
systems, data warehouses, and a large number of other software tools and processes.
[3]

In latter years the business analytics have exploded with the introduction to computers. This
change has brought analytics to a whole new level and has made the possibilities endless. As far
as analytics has come in history, and what the current field of analytics is today many people
would never think that analytics started in the early 1900's with Mr. Ford himself.
Challenges
Business analytics depends on sufficient volumes of high quality data. The difficulty in ensuring
data quality is integrating and reconciling data across different systems, and then deciding what
subsets of data to make available.
[3]

Previously, analytics was considered a type of after-the-fact method of forecasting consumer
behavior by examining the number of units sold in the last quarter or the last year. This type of
data warehousing required a lot more storage space than it did speed. Now business analytics is
becoming a tool that can influence the outcome of customer interactions.
[4]
When a specific
customer type is considering a purchase, an analytics-enabled enterprise can modify the sales
pitch to appeal to that consumer. This means the storage space for all that data must react
extremely fast to provide the necessary data in real-time.
Competing on analytics
Thomas Davenport, professor of information technology and management at Babson College
argues that businesses can optimize a distinct business capability via analytics and thus better
compete. He identifies these characteristics of an organization that are apt to compete on
analytics:
[3]

One or more senior executives who strongly advocate fact-based decision making and,
specifically, analytics
Widespread use of not only descriptive statistics, but also predictive modeling and
complex optimization techniques
Substantial use of analytics across multiple business functions or processes
Movement toward an enterprise level approach to managing analytical tools, data, and
organizational skills and capabilities
Balanced scorecard
Part of a series on Strategy
Strategy

Major dimensions[hide]
Strategy Strategic management
Military strategy Strategic thinking
Strategic planning Game theory
Thought leaders[hide]
Michael Porter Henry Mintzberg
Bruce Henderson Gary Hamel
Jim Collins Liddell Hart
Carl Von Clausewitz Sun Tzu
Concepts[hide]
Competitive advantage Experience curve
Value chain Portfolio theory
Core competency Generic strategies
Frameworks & Tools[hide]
SWOT Five Forces
Balanced scorecard Strategy map
PEST analysis Growthshare matrix
v
t
e
The balanced scorecard (BSC) is a strategy performance management tool - a semi-standard
structured report, supported by design methods and automation tools, that can be used by
managers to keep track of the execution of activities by the staff within their control and to
monitor the consequences arising from these actions.
[1]

The critical characteristics that define a Balanced Scorecard are
[2]

its focus on the strategic agenda of the organization concerned
the selection of a small number of data items to monitor
a mix of financial and non-financial data items.
Contents
1 Use
2 History
3 Characteristics
4 Design
o 4.1 First Generation Balanced Scorecard
o 4.2 Second Generation Balanced Scorecard
o 4.3 Third Generation Balanced Scorecard
5 Popularity
6 Variants
7 Criticism
8 Software tools
9 See also
10 References
Use
Balanced Scorecard is an example of a closed-loop controller or cybernetic control applied to the
management of the implementation of a strategy.
[3]
Closed-loop or cybernetic control is where
actual performance is measured, the measured value is compared to an expected value and based
on the difference between the two corrective interventions are made as required. Such control
requires three things to be effective - a choice of data to measure, the setting of an expected
value for the data, and the ability to make a corrective intervention.
[3]

Within the strategy management context, all three of these characteristic closed-loop control
elements need to be derived from the organisation's strategy and also need to reflect the ability of
the observer to both monitor performance and subsequently intervene - both of which may be
constrained.
[4]

Two of the ideas that underpin modern Balanced Scorecard designs concern facilitating the
creation of such a control - through making it easier to select which data to observe, and ensuring
that the choice of data is consistent with the ability of the observer to intervene.
[5]

History
Organizations have used systems consisting of a mix of financial and non-financial measures to
track progress for quite some time.
[6]
One such system was created by Art Schneiderman in 1987
at Analog Devices, a mid-sized semi-conductor company; the Analog Devices Balanced
Scorecard.
[7]
Schneiderman's design was similar to what is now recognised as a "First
Generation" Balanced Scorecard design.
[5]

In 1990 Art Schneiderman participated in an unrelated research study in 1990 led by Dr. Robert
S. Kaplan in conjunction with US management consultancy Nolan-Norton, and during this study
described his work on performance measurement.
[7]
Subsequently, Kaplan and David P. Norton
included anonymous details of this balanced scorecard design in a 1992 article.
[8]
Kaplan and
Norton's article wasn't the only paper on the topic published in early 1992
[9]
but the 1992 Kaplan
and Norton paper was a popular success, and was quickly followed by a second in 1993.
[10]
In
1996, the two authors published a book The Balanced Scorecard.
[11]
These articles and the first
book spread knowledge of the concept of balanced scorecard widely, and has led to Kaplan and
Norton being seen as the creators of the concept.
While the "balanced scorecard" terminology was coined by Art Schneiderman, the roots of
performance management as an activity run deep in management literature and practice.
Management historians such as Alfred Chandler suggest the origins of performance management
can be seen in the emergence of the complex organisation - most notably during the 19th Century
in the USA.
[12]
More recent influences may include the pioneering work of General Electric on
performance measurement reporting in the 1950s and the work of French process engineers (who
created the tableau de bord literally, a "dashboard" of performance measures) in the early part
of the 20th century.
[6]
The tool also draws strongly on the ideas of the 'resource based view of the
firm'
[13]
proposed by Edith Penrose. However it should be noted that none of these influences is
explicitly linked to original descriptions of balanced scorecard by Schneiderman, Maisel, or
Kaplan & Norton.
Kaplan and Norton's first book
[11]
remains their most popular. The book reflects the earliest
incarnations of balanced scorecards - effectively restating the concept as described in the second
Harvard Business Review article.
[10]
Their second book, The Strategy Focused Organization,
[14]

echoed work by others (particularly a book published the year before by Olve et al. in
Scandinavia
[15]
) on the value of visually documenting the links between measures by proposing
the "Strategic Linkage Model" or strategy map.
As the title of Kaplan and Norton's second book
[14]
highlights, even by 2000 the focus of
attention among thought-leaders was moving from the design of Balanced Scorecards
themselves, towards the use of Balanced Scorecard as a focal point within a more comprehensive
strategic management system. Subsequent writing on Balanced Scorecard by Kaplan & Norton
has focused on uses of Balanced Scorecard rather than its design (e.g. "The Execution Premium"
in 2008
[16]
), however many others have continued to refine the device itself (e.g. Abernethy et
al.
[17]
).
Characteristics
The characteristics of the balanced scorecard and its derivatives is the presentation of a mixture
of financial and non-financial measures each compared to a 'target' value within a single concise
report. The report is not meant to be a replacement for traditional financial or operational reports
but a succinct summary that captures the information most relevant to those reading it. It is the
method by which this 'most relevant' information is determined (i.e., the design processes used to
select the content) that most differentiates the various versions of the tool in circulation. The
balanced scorecard indirectly also provides a useful insight into an organisation's strategy - by
requiring general strategic statements (e.g. mission, vision) to be precipitated into more specific /
tangible forms.
[18]

The first versions of balanced scorecard asserted that relevance should derive from the corporate
strategy, and proposed design methods that focused on choosing measures and targets associated
with the main activities required to implement the strategy. As the initial audience for this were
the readers of the Harvard Business Review, the proposal was translated into a form that made
sense to a typical reader of that journal - managers of US commercial businesses. Accordingly,
initial designs were encouraged to measure three categories of non-financial measure in addition
to financial outputs - those of "customer," "internal business processes" and "learning and
growth." These categories were not so relevant to non-profits or units within complex
organizations (which might have high degrees of internal specialization), and much of the early
literature on balanced scorecard focused on suggestions of alternative 'perspectives' that might
have more relevance to these groups.
Modern balanced scorecards have evolved since the initial ideas proposed in the late 1980s and
early 1990s, and the modern performance management tools including Balanced Scorecard are
significantly improved - being more flexible (to suit a wider range of organisational types) and
more effective (as design methods have evolved to make them easier to design, and use).
[19]

Design
Design of a balanced scorecard is about the identification of a small number of financial and
non-financial measures and attaching targets to them, so that when they are reviewed it is
possible to determine whether current performance 'meets expectations'. By alerting managers to
areas where performance deviates from expectations, they can be encouraged to focus their
attention on these areas, and hopefully as a result trigger improved performance within the part
of the organization they lead.
[3]

The original thinking behind a balanced scorecard was for it to be focused on information
relating to the implementation of a strategy, and over time there has been a blurring of the
boundaries between conventional strategic planning and control activities and those required to
design a Balanced Scorecard. This is illustrated well by the four steps required to design a
balanced scorecard included in Kaplan & Norton's writing on the subject in the late 1990s:
1. Translating the vision into operational goals;
2. Communicating the vision and link it to individual performance;
3. Business planning; index setting
4. Feedback and learning, and adjusting the strategy accordingly.
These steps go far beyond the simple task of identifying a small number of financial and non-
financial measures, but illustrate the requirement for whatever design process is used to fit within
broader thinking about how the resulting Balanced Scorecard will integrate with the wider
business management process.
Although it helps focus managers' attention on strategic issues and the management of the
implementation of strategy, it is important to remember that the Balanced Scorecard itself has no
role in the formation of strategy.
[5]
In fact, balanced scorecards can co-exist with strategic
planning systems and other tools.
[6]

First Generation Balanced Scorecard
The first generation of Balanced Scorecard designs used a "4 perspective" approach to identify
what measures to use to track the implementation of strategy. `The original four "perspectives"
proposed
[8]
were:
Financial: encourages the identification of a few relevant high-level financial measures. In
particular, designers were encouraged to choose measures that helped inform the answer to
the question "How do we look to shareholders?" Examples: cash flow, sales growth, operating
income, return on equity.
[20]

Customer: encourages the identification of measures that answer the question "How do
customers see us?" Examples: percent of sales from new products, on time delivery, share of
important customers purchases, ranking by important customers.
Internal business processes: encourages the identification of measures that answer the
question "What must we excel at?" Examples: cycle time, unit cost, yield, new product
introductions.
Learning and growth: encourages the identification of measures that answer the question "How
can we continue to improve, create value and innovate?". Examples: time to develop new
generation of products, life cycle to product maturity, time to market versus competition.
The idea was that managers used these perspective headings to prompt the selection of a small
number of measures that informed on that aspect of the organisation's strategic performance.
[8]

The perspective headings show that Kaplan and Norton were thinking about the needs of non-
divisional commercial organisations in their initial design. These headings are not very helpful to
other kinds of organisations (e.g. multi-divisional or multi-national commercial organisations,
governmental organisations, non-profits, non-governmental organisations, government agencies
etc.), and much of what has been written on balanced scorecard since has, in one way or another,
focused on the identification of alternative headings more suited to a broader range of
organisations, and also suggested using either additional or fewer perspectives (e.g. Butler et al.
(1997),
[21]
Ahn (2001),
[22]
Elefalke (2001),
[23]
Brignall (2002),
[24]
Irwin (2002),
[25]
Radnor et al.
(2003)
[26]
).
These suggestions were notably triggered by a recognition that different but equivalent headings
would yield alternative sets of measures, and this represents the major design challenge faced
with this type of balanced scorecard design: justifying the choice of measures made. "Of all the
measures you could have chosen, why did you choose these?" These issues contribute to dis-
satisfaction with early Balanced Scorecard designs, since if users are not confident that the
measures within the Balanced Scorecard are well chosen, they will have less confidence in the
information it provides.
[27]

Although less common, these early-style balanced scorecards are still designed and used today.
[1]

In short, first generation balanced scorecards are hard to design in a way that builds confidence
that they are well designed. Because of this, many are abandoned soon after completion.
[6]

Second Generation Balanced Scorecard
In the mid-1990s, an improved design method emerged.
[15]
In the new method, measures are
selected based on a set of "strategic objectives" plotted on a "strategic linkage model" or
"strategy map". With this modified approach, the strategic objectives are distributed across the
four measurement perspectives, so as to "connect the dots" to form a visual presentation of
strategy and measures.
[28]

In this modified version of balanced scorecard design, managers select a few strategic objectives
within each of the perspectives, and then define the cause-effect chain among these objectives by
drawing links between them to create a "strategic linkage model". A balanced scorecard of
strategic performance measures is then derived directly by selecting one or two measures for
each strategic objectives.
[5]
This type of approach provides greater contextual justification for the
measures chosen, and is generally easier for managers to work through. This style of balanced
scorecard has been commonly used since 1996 or so: it is significantly different in approach to
the methods originally proposed, and so can be thought of as representing the "2nd generation"
of design approach adopted for balanced scorecard since its introduction.
Third Generation Balanced Scorecard
Main article: Third-generation balanced scorecard
In the late 1990s, the design approach had evolved yet again. One problem with the "second
generation" design approach described above was that the plotting of causal links amongst
twenty or so medium-term strategic goals was still a relatively abstract activity. In practice it
ignored the fact that opportunities to intervene, to influence strategic goals are, and need to be,
anchored in current and real management activity. Secondly, the need to "roll forward" and test
the impact of these goals necessitated the creation of an additional design instrument: the Vision
or Destination Statement. This device was a statement of what "strategic success", or the
"strategic end-state", looked like. It was quickly realized that if a Destination Statement was
created at the beginning of the design process, then it was easier to select strategic activity and
outcome objectives to respond to it. Measures and targets could then be selected to track the
achievement of these objectives. Design methods that incorporate a Destination Statement or
equivalent (e.g. the results-based management method proposed by the UN in 2002) represent a
tangibly different design approach to those that went before, and have been proposed as
representing a "third generation" design method for balanced scorecards.
[5]

Design methods for balanced scorecards continue to evolve and adapt to reflect the deficiencies
in the currently used methods, and the particular needs of communities of interest (e.g. NGO's
and government departments have found the third generation methods embedded in results-based
management more useful than first or second generation design methods).
[29]

This generation refined the second generation of balanced scorecards to give more relevance and
functionality to strategic objectives. The major difference is the incorporation of Destination
Statements. Other key components are strategic objectives, strategic linkage model and
perspectives, measures and initiatives.
[5]

Popularity
In 1997, Kurtzman
[30]
found that 64 percent of the companies questioned were measuring
performance from a number of perspectives in a similar way to the balanced scorecard. Balanced
scorecards have been implemented by government agencies, military units, business units and
corporations as a whole, non-profit organizations, and schools.
Balanced Scorecard has been widely adopted, and has been found to be the most popular
performance management framework in a recent survey
[31]

Many examples of balanced scorecards can be found via web searches. However, adapting one
organization's balanced scorecard to another is generally not advised by theorists, who believe
that much of the benefit of the balanced scorecard comes from the design process itself.
[6]

Indeed, it could be argued that many failures in the early days of balanced scorecard could be
attributed to this problem, in that early balanced scorecards were often designed remotely by
consultants.
[32][33]
Managers did not trust, and so failed to engage with and use, these measure
suites created by people lacking knowledge of the organization and management
responsibility.
[19]

Variants
Since the balanced scorecard was popularized in the early 1990s, a large number of alternatives
to the original 'four box' balanced scorecard promoted by Kaplan and Norton in their various
articles and books have emerged. Most have very limited application, and are typically proposed
either by academics as vehicles for promoting other agendas (such as green issues) - e.g. Brignall
(2002)
[24]
or consultants as an attempt at differentiation to promote sales of books and / or
consultancy (e.g. Bourne (2002);
[34]
Niven (2002)
[35]
).
Many of the structural variations proposed are broadly similar, and a research paper published in
2004
[5]
attempted to identify a pattern in these variations - noting three distinct types of variation.
The variations appeared to be part of an evolution of the Balanced Scorecard concept, and so the
paper refers to these distinct types as "generations". Broadly, the original 'measures in boxes'
type design (as proposed by Kaplan & Norton) constitutes the 1st generation balanced scorecard
design; balanced scorecard designs that include a 'strategy map' or 'strategic linkage model' (e.g.
the Performance Prism,
[36]
later Kaplan & Norton designs
[16]
the Performance Driver model of
Olve, Roy & Wetter (English translation 1999,
[15]
1st published in Swedish 1997)) constitute the
2nd Generation of Balanced Scorecard design; and designs that augment the strategy map /
strategic linkage model with a separate document describing the long-term outcomes sought
from the strategy (the "destination statement" idea) comprise the 3rd generation balanced
scorecard design.
Variants that feature adaptations of the structure of Balanced Scorecard to suit better a particular
viewpoint or agenda are numerous. Examples of the focus of such adaptations include green
issues,
[24]
decision support,
[37]
public sector management,
[38]
and health care management.
[39]
The
performance management elements of the UN's Results Based Management system have strong
design and structural similarities to those used in the 3rd Generation Balanced Scorecard design
approach.
[29]

Balanced Scorecard is also often linked to quality management tools and activities.
[40]
Although
there are clear areas of cross-over and association, the two sets of tools are complementary rather
than duplicative.
[41]

A common use of balanced scorecard is to support the payments of incentives to individuals,
even though it was not designed for this purpose and is not particularly suited to it.
[2][42]

Criticism
The balanced scorecard has attracted criticism from a variety of sources. Most has come from the
academic community, who dislike the empirical nature of the framework: Kaplan and Norton
notoriously failed to include any citation of prior articles in their initial papers on the topic. Some
of this criticism focuses on technical flaws in the methods and design of the original Balanced
Scorecard proposed by Kaplan and Norton,.
[19][32][43]
Other academics have simply focused on
the lack of citation support.
[44]

A second kind of criticism is that the balanced scorecard does not provide a bottom line score or
a unified view with clear recommendations: it is simply a list of metrics (e.g. Jensen 2001
[45]
).
These critics usually include in their criticism suggestions about how the 'unanswered' question
postulated could be answered, but typically the unanswered question relate to things outside the
scope of balanced scorecard itself (such as developing strategies) (e.g. Brignall
[24]
)
A third kind of criticism is that the model fails to fully reflect the needs of stakeholders - putting
bias on financial stakeholders over others. Early forms of Balanced Scorecard proposed by
Kaplan & Norton focused on the needs of commercial organisations in the USA - where this
focus on investment return was appropriate.
[10]
This focus was maintained through subsequent
revisions.
[46]
Even now over 20 years after they were first proposed, the four most common
perspectives in Balanced Scorecard designs mirror the four proposed in the original Kaplan &
Norton paper.
[1]
However, as noted earlier in this wiki page, there have been many studies that
suggest other perspectives might better reflect the priorities of organisations - particularly but not
exclusively relating to the needs of organisations in the public and Non Governmental sectors.
[47]

More modern design approaches such as 3rd Generation Balanced Scorecard and the UN's
Results Based Management methods explicitly consider the interests of wider stakeholder
groups, and perhaps address this issue in its entirety.
[29]

There are few empirical studies linking the use of balanced scorecards to better decision making
or improved financial performance of companies, but some work has been done in these areas.
However, broadcast surveys of usage have difficulties in this respect, due to the wide variations
in definition of 'what a balanced scorecard is' noted above (making it hard to work out in a
survey if you are comparing like with like). Single organization case studies suffer from the 'lack
of a control' issue common to any study of organizational change - you don't know what the
organization would have achieved if the change had not been made, so it is difficult to attribute
changes observed over time to a single intervention (such as introducing a balanced scorecard).
However, such studies as have been done have typically found balanced scorecard to be
useful.
[6][19]

Software tools
It is important to recognize that the balanced scorecard by definition is not a complex thing -
typically no more than about 20 measures spread across a mix of financial and non-financial
topics, and easily reported manually (on paper, or using simple office software).
[46]

The processes of collecting, reporting, and distributing balanced scorecard information can be
labor-intensive and prone to procedural problems (for example, getting all relevant people to
return the information required by the required date). The simplest mechanism to use is to
delegate these activities to an individual, and many Balanced Scorecards are reported via ad-hoc
methods based around email, phone calls and office software.
In more complex organizations, where there are multiple balanced scorecards to report and/or a
need for co-ordination of results between balanced scorecards (for example, if one level of
reports relies on information collected and reported at a lower level) the use of individual
reporters is problematic. Where these conditions apply, organizations use balanced scorecard
reporting software to automate the production and distribution of these reports.
Recent surveys
[1][48]
have consistently found that roughly one third of organizations used office
software to report their balanced scorecard, one third used software developed specifically for
their own use, and one third used one of the many commercial packages available.
Dashboard
This article is about a control panel placed in the front of the car. For other uses, see Dashboard
(disambiguation).

The dashboard of a Bentley Continental GTC car

A Suzuki Hayabusa motorcycle dash

Dashboard instruments displaying various car and engine conditions

Carriage dashboard
A dashboard (also called dash, instrument panel, or fascia) is a control panel placed in front
of the driver of an automobile, housing instrumentation and controls for operation of the vehicle.
The word originally applied to a barrier of wood or leather fixed at the front of a horse-drawn
carriage or sleigh to protect the driver from mud or other debris "dashed up" (thrown up) by the
horses' hooves.
[1]

Contents
1 Dashboard items
2 Padding and safety
3 Fashion in instrumentation
4 Gallery
5 See also
6 References
Dashboard items
Items located on the dashboard at first included the steering wheel and the instrument cluster.
The instrument cluster pictured to the right contains gauges such as a speedometer, tachometer,
odometer and fuel gauge, and indicators such as gearshift position, seat belt warning light,
parking-brake-engagement warning light
[2]
and an engine-malfunction light. There may also be
indicators for low fuel, low oil pressure, low tire pressure and faults in the airbag (SRS) system.
Heating and ventilation controls and vents, lighting controls, audio equipment and automotive
navigation systems are also mounted on the dashboard.
The top of a dashboard may contain vents for the heating and air conditioning system and
speakers for an audio system. A glove compartment is commonly located on the passenger's side.
There may also be an ashtray and a cigarette lighter which can provide a power outlet for other
low-voltage appliances.
[3]

Padding and safety
In 1937, Chrysler, Dodge, DeSoto, and Plymouth cars came with a safety dashboard that was
flat, raised above knee height, and had all the controls mounted flush.
[4]

Padded dashboards were advocated in the 1930s by car safety pioneer Claire L. Straith.
[5]
In
1947, the Tucker became the first car with a padded dashboard.
[6]

One of the safety enhancements of the 1970s was the widespread adoption of padded
dashboards. The padding is commonly polyurethane foam, while the surface is commonly either
polyvinyl chloride (PVC) or leather in the case of luxury models.
In the early and mid-1990s, airbags became a standard feature of steering wheels and
dashboards.
Fashion in instrumentation

Stylised dashboard from a 1980s Lancia Beta
In the 1940s through the 1960s, American car manufacturers and their imitators designed
unusually-shaped instruments on a dashboard laden with chrome and transparent plastic, which
could be less readable, but was often thought to be more stylish. Sunlight could cause a bright
glare on the chrome, particularly for a convertible.
With the coming of the LED in consumer electronics, some manufacturers used instruments with
digital readouts to make their cars appear more up to date, but this has faded from practice. Some
cars use a head-up display to project the speed of the car onto the windscreen in imitation of
fighter aircraft, but in a far less complex display.
In recent years, spurred on by the growing aftermarket use of dash kits, many automakers have
taken the initiative to add more stylistic elements to their dashboards. One prominenet example
of this is the Chevrolet Sonic which offers both exterior (e.g., a custom graphics package) and
interior cosmetic upgrades.
[7]
In addition to OEM dashboard trim and upgrades a number of
companies offer domed polyurethane or vinyl applique dash trim accent kits or "dash kits."
[8]

Some of the major manufacturers of these kits are Sherwood Innovations, B&I Trims and
Rvinyl.com.
Manufacturers such as BMW, Honda, Toyota and Mercedes-Benz have included fuel-economy
gauges in some instrument clusters, showing fuel mileage in real time. The ammeter was the
gauge of choice for monitoring the state of the charging system until the 1970s. Later it was
replaced by the voltmeter. Today most family vehicles have warning lights instead of voltmeters
or oil pressure gauges in their dashboard instrument clusters, though sports cars often have
proper gauges for performance purposes and driver appeasement.
Data quality
Data are of high quality if, "they are fit for their intended uses in operations, decision making and
planning." (J. M. Juran). Alternatively, data are deemed of high quality if they correctly represent
the real-world construct to which they refer. Furthermore, apart from these definitions, as data
volume increases, the question of internal consistency within data becomes paramount,
regardless of fitness for use for any particular external purpose; e.g., a person's age and birth date
may conflict within different parts of the same database. The first views can often be in
disagreement, even about the same set of data used for the same purpose. This article discusses
the concept of data quality as it relates to business data processing, although of course other
fields have their own data quality issues as well.
Contents
1 Definitions
2 History
3 Overview
4 Data Quality Assurance
5 Data quality control
6 Optimum use of data quality
7 Criticism of existing tools and processes
8 Professional associations
9 See also
10 References
11 Further reading
Definitions
This list is taken from the online book "Data Quality: High-impact Strategies".
[1]
See also the
glossary of data quality terms.
[2]

Degree of excellence exhibited by the data in relation to the portrayal of the actual
scenario.
The state of completeness, validity, consistency, timeliness and accuracy that makes data
appropriate for a specific use.
[3]

The totality of features and characteristics of data that bears on their ability to satisfy a
given purpose; the sum of the degrees of excellence for factors related to data.
[4]

The processes and technologies involved in ensuring the conformance of data values to
business requirements and acceptance criteria.
[5]

Complete, standards based, consistent, accurate and time stamped.
[6]

History
Before the rise of the inexpensive server, massive mainframe computers were used to maintain
name and address data so that mail could be properly routed to its destination. The mainframes
used business rules to correct common misspellings and typographical errors in name and
address data, as well as to track customers who had moved, died, gone to prison, married,
divorced, or experienced other life-changing events. Government agencies began to make postal
data available to a few service companies to cross-reference customer data with the National
Change of Address registry (NCOA). This technology saved large companies millions of dollars
in comparison to manually correction of customer data. Large companies saved on postage, as
bills and direct marketing materials made their way to the intended customer more accurately.
Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and
powerful server technology became available.
Companies with an emphasis on marketing often focus their quality efforts on name and address
information, but data quality is recognized as an important property of all types of data.
Principles of data quality can be applied to supply chain data, transactional data, and nearly
every other category of data found in the enterprise. For example, making supply chain data
conform to a certain standard has value to an organization by: 1) avoiding overstocking of
similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding
of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking
and shipping parts across a large organization.
While name and address data has a clear standard as defined by local postal authorities, other
types of data have few recognized standards. There is a movement in the industry today to
standardize certain non-address data. The non-profit group GS1 is among the groups
spearheading this movement.
For companies with significant research efforts, data quality can include developing protocols for
research methods, reducing measurement error, bounds checking of the data, cross tabulation,
modeling and outlier detection, verifying data integrity, etc.
Overview
There are a number of theoretical frameworks for understanding data quality. A systems-
theoretical approach influenced by American pragmatism expands the definition of data quality
to include information quality, and emphasizes the inclusiveness of the fundamental dimensions
of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework,
dubbed "Zero Defect Data" (Hansen, 1991) adapts the principles of statistical process control to
data quality. Another framework seeks to integrate the product perspective (conformance to
specifications) and the service perspective (meeting consumers' expectations) (Kahn et al. 2002).
Another framework is based in semiotics to evaluate the quality of the form, meaning and use of
the data (Price and Shanks, 2004). One highly theoretical approach analyzes the ontological
nature of information systems to define data quality rigorously (Wand and Wang, 1996).
A considerable amount of data quality research involves investigating and describing various
categories of desirable attributes (or dimensions) of data. These lists commonly include
accuracy, correctness, currency, completeness and relevance. Nearly 200 such terms have been
identified and there is little agreement in their nature (are these concepts, goals or criteria?), their
definitions or measures (Wang et al., 1993). Software engineers may recognise this as a similar
problem to "ilities".
MIT has a Total Data Quality Management program, led by Professor Richard Wang, which
produces a large number of publications and hosts a significant international conference in this
field (International Conference on Information Quality, ICIQ). This program grew out of the
work done by Hansen on the "Zero Defect Data" framework (Hansen, 1991).
In practice, data quality is a concern for professionals involved with a wide range of information
systems, ranging from data warehousing and business intelligence to customer relationship
management and supply chain management. One industry study estimated the total cost to the
US economy of data quality problems at over US$600 billion per annum (Eckerson, 2002).
Incorrect data which includes invalid and outdated information can originate from different
data sources through data entry, or data migration and conversion projects.
[7]

In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all
U.S. mail sent is incorrectly addressed.
[8]

One reason contact data becomes stale very quickly in the average database more than 45
million Americans change their address every year.
[9]

In fact, the problem is such a concern that companies are beginning to set up a data governance
team whose sole role in the corporation is to be responsible for data quality. In some
[who?]

organizations, this data governance function has been established as part of a larger Regulatory
Compliance function - a recognition of the importance of Data/Information Quality to
organizations.
Problems with data quality don't only arise from incorrect data; inconsistent data is a problem as
well. Eliminating data shadow systems and centralizing data in a warehouse is one of the
initiatives a company can take to ensure data consistency.
Enterprises, scientists, and researchers are starting to participate within data curation
communities to improve the quality of their common data.
[10]

The market is going some way to providing data quality assurance. A number of vendors make
tools for analysing and repairing poor quality data in situ, service providers can clean the data on
a contract basis and consultants can advise on fixing processes or systems to avoid data quality
problems in the first place. Most data quality tools offer a series of tools for improving data,
which may include some or all of the following:
1. Data profiling - initially assessing the data to understand its quality challenges
2. Data standardization - a business rules engine that ensures that data conforms to quality
rules
3. Geocoding - for name and address data. Corrects data to US and Worldwide postal
standards
4. Matching or Linking - a way to compare data so that similar, but slightly different records
can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often
recognizes that 'Bob' and 'Robert' may be the same individual. It might be able to manage
'householding', or finding links between spouses at the same address, for example.
Finally, it often can build a 'best of breed' record, taking the best components from
multiple data sources and building a single super-record.
5. Monitoring - keeping track of data quality over time and reporting variations in the
quality of data. Software can also auto-correct the variations based on pre-defined
business rules.
6. Batch and Real time - Once the data is initially cleansed (batch), companies often want to
build the processes into enterprise applications to keep it clean.
There are several well-known authors and self-styled experts, with Larry English perhaps the
most popular guru. In addition, the International Association for Information and Data Quality
(IAIDQ) was established in 2004 to provide a focal point for professionals and researchers in this
field.
ISO 8000 is the international standard for data quality.
Data Quality Assurance
Data quality assurance is the process of profiling the data to discover inconsistencies and other
anomalies in the data, as well as performing data cleansing activities (e.g. removing outliers,
missing data interpolation) to improve the data quality .
These activities can be undertaken as part of data warehousing or as part of the database
administration of an existing piece of applications software.
Data quality control

This section may require copy-editing. (February 2014)
Data quality control is the process of controlling the usage of data with known quality
measurementfor an application or a process. This process is usually done after a Data Quality
Assurance (QA) process, which consists of discovery of data inconsistency and correction.
Data QA process provides following information to Data Quality Control (QC):
Severity of inconsistency
Incompleteness
Accuracy
Precision
Missing / Unknown
The Data QC process uses the information from the QA process, then it decides to use the data
for analysis or in an application or business process. For example, if a Data QC process finds that
the data contains too much error or inconsistency, then it prevents that data from being used for
its intended process. The usage of incorrect data might crucially impact output. For example,
providing invalid measurements from several sensors to the automatic pilot feature on an aircraft
could cause it to crash. Thus, establishing data QC process provides the protection of usage of
data control and establishes safe information usage.
Optimum use of data quality
Data Quality (DQ) is a niche area required for the integrity of the data management by covering
gaps of data issues. This is one of the key functions that aid data governance by monitoring data
to find exceptions undiscovered by current data management operations. Data Quality checks
may be defined at attribute level to have full control on its remediation steps.
DQ checks and business rules may easily overlap if an organization is not attentive of its DQ
scope. Business teams should understand the DQ scope thoroughly in order to avoid overlap.
Data quality checks are redundant if business logic covers the same functionality and fulfills the
same purpose as DQ. The DQ scope of an organization should be defined in DQ strategy and
well implemented. Some data quality checks may be translated into business rules after repeated
instances of exceptions in the past.
Below are a few areas of data flows that may need perennial DQ checks:
Completeness and precision DQ checks on all data may be performed at the point of entry for
each mandatory attribute from each source system. Few attribute values are created way after the
initial creation of the transaction; in such cases, administering these checks becomes tricky and
should be done immediately after the defined event of that attribute's source and the transaction's
other core attribute conditions are met.
All data having attributes referring to Reference Data in the organization may be validated
against the set of well-defined valid values of Reference Data to discover new or discrepant
values through the validity DQ check. Results may be used to update Reference Data
administered under Master Data Management (MDM).
All data sourced from a third party to organization's internal teams may undergo accuracy (DQ)
check against the third party data. These DQ check results are valuable when administered on
data that made multiple hops after the point of entry of that data but before that data becomes
authorized or stored for enterprise intelligence.
All data columns that refer to Master Data may be validated for its consistency check. A DQ
check administered on the data at the point of entry discovers new data for the MDM process,
but a DQ check administered after the point of entry discovers the failure (not exceptions) of
consistency.
As data transforms, multiple timestamps and the positions of that timestamps are captured and
may be compared against each other and its leeway to validate its value, decay, operational
significance against a defined SLA (service level agreement). This timeliness DQ check can be
utilized to decrease data value decay rate and optimize the policies of data movement timeline.
In an organization complex logic is usually segregated into simpler logic across multiple
processes. Reasonableness DQ checks on such complex logic yielding to a logical result within
a specific range of values or static interrelationships (aggregated business rules) may be
validated to discover complicated but crucial business processes and outliers of the data, its drift
from BAU (business as usual) expectations, and may provide possible exceptions eventually
resulting into data issues. This check may be a simple generic aggregation rule engulfed by large
chunk of data or it can be a complicated logic on a group of attributes of a transaction pertaining
to the core business of the organization. This DQ check requires high degree of business
knowledge and acumen. Discovery of reasonableness issues may aid for policy and strategy
changes by either business or data governance or both.
Conformity checks and integrity checks need not covered in all business needs, its strictly
under the database architecture's discretion.
There are many places in the data movement where DQ checks may not be required. For
instance, DQ check for completeness and precision on notnull columns is redundant for the data
sourced from database. Similarly, data should be validated for its accuracy with respect to time
when the data is stitched across disparate sources. However, that is a business rule and should
not be in the DQ scope.
Criticism of existing tools and processes
The main reasons cited are:
Project costs: costs are typically in the hundreds of thousands of dollars
Time: lack of enough time to deal with large-scale data-cleansing software
Security: concerns over sharing information, giving an application access across systems,
and effects on legacy systems
Master data management

removed. (April 2012)
In business, master data management (MDM) comprises the processes, governance, policies,
standards and tools that consistently define and manage the critical data of an organization to
provide a single point of reference.
[1]

The data that is mastered may include:
reference data - the business objects for transactions, and the dimensions for analysis
analytical data - supports decision making
[2][3]

In computing, an MDM tool can be used to support master data management by removing
duplicates, standardizing data (mass maintaining), and incorporating rules to eliminate incorrect
data from entering the system in order to create an authoritative source of master data. Master
data are the products, accounts and parties for which the business transactions are completed.
The root cause problem stems from business unit and product line segmentation, in which the
same customer will be serviced by different product lines, with redundant data being entered
about the customer (aka party in the role of customer) and account in order to process the
transaction. The redundancy of party and account data is compounded in the front to back office
life cycle, where the authoritative single source for the party, account and product data is needed
but is often once again redundantly entered or augmented.
MDM has the objective of providing processes for collecting, aggregating, matching,
consolidating, quality-assuring, persisting and distributing such data throughout an organization
to ensure consistency and control in the ongoing maintenance and application use of this
information.
The term recalls the concept of a master file from an earlier computing era.
Definition: Master data management (MDM) is a comprehensive method of enabling an
enterprise to link all of its critical data to one file, called a master file, that provides a common
point of reference. When properly done, MDM streamlines data sharing among personnel and
departments. In addition, MDM can facilitate computing in multiple system architectures,
platforms and applications.
Contents
1 Issues
2 Solutions
3 Transmission of Master Data
4 See also
5 References
6 External links
Issues
At a basic level, MDM seeks to ensure that an organization does not use multiple (potentially
inconsistent) versions of the same master data in different parts of its operations, which can
occur in large organizations. A common example of poor MDM is the scenario of a bank at
which a customer has taken out a mortgage and the bank begins to send mortgage solicitations to
that customer, ignoring the fact that the person already has a mortgage account relationship with
the bank. This happens because the customer information used by the marketing section within
the bank lacks integration with the customer information used by the customer services section
of the bank. Thus the two groups remain unaware that an existing customer is also considered a
sales lead. The process of record linkage is used to associate different records that correspond to
the same entity, in this case the same person.
Other problems include (for example) issues with the quality of data, consistent classification
and identification of data, and data-reconciliation issues. Master data management of disparate
data systems requires data transformations as the data extracted from the disparate source data
system is transformed and loaded into the master data management hub. To synchronize the
disparate source master data, the managed master data extracted from the master data
management hub is again transformed and loaded into the disparate source data system as the
master data is updated. As with other Extract, Transform, Load-based data movement, these
processes are expensive and inefficient to develop and to maintain which greatly reduces the
return on investment for the master data management product.
One of the most common reasons some large corporations experience massive issues with MDM
is growth through mergers or acquisitions. Two organizations which merge will typically create
an entity with duplicate master data (since each likely had at least one master database of its own
prior to the merger). Ideally, database administrators resolve this problem through deduplication
of the master data as part of the merger. In practice, however, reconciling several master data
systems can present difficulties because of the dependencies that existing applications have on
the master databases. As a result, more often than not the two systems do not fully merge, but
remain separate, with a special reconciliation process defined that ensures consistency between
the data stored in the two systems. Over time, however, as further mergers and acquisitions
occur, the problem multiplies, more and more master databases appear, and data-reconciliation
processes become extremely complex, and consequently unmanageable and unreliable. Because
of this trend, one can find organizations with 10, 15, or even as many as 100 separate, poorly
integrated master databases, which can cause serious operational problems in the areas of
customer satisfaction, operational efficiency, decision-support, and regulatory compliance.
Solutions
Processes commonly seen in MDM include source identification, data collection, data
transformation, normalization, rule administration, error detection and correction, data
consolidation, data storage, data distribution, data classification, taxonomy services, item master
creation, schema mapping, product codification, data enrichment and data governance
The selection of entities considered for MDM depends somewhat on the nature of an
organization. In the common case of commercial enterprises, MDM may apply to such entities as
customer (customer data integration), product (product information management), employee, and
vendor. MDM processes identify the sources from which to collect descriptions of these entities.
In the course of transformation and normalization, administrators adapt descriptions to conform
to standard formats and data domains, making it possible to remove duplicate instances of any
entity. Such processes generally result in an organizational MDM repository, from which all
requests for a certain entity instance produce the same description, irrespective of the originating
sources and the requesting destination.
The tools include data networks, file systems, a data warehouse, data marts, an operational data
store, data mining, data analysis, data visualization, Data federation and data virtualization. One
of the newest tools, virtual master data management utilizes data virtualization and a persistent
metadata server to implement a multi-level automated MDM hierarchy.
Transmission of Master Data
There are several ways in which Master Data may be collated and distributed to other systems.
[4]

This includes:
Data consolidation: The process of capturing master data from multiple sources and
integrating into a single hub (operational data store) for replication to other destination
systems.
Data federation: The process of providing a single virtual view of master data from one
or more sources to one or more destination systems.
Data propagation: The process of copying master data from one system to another,
typically through point-to-point interfaces in legacy systems.
Data profiling

[hide]This article has multiple issues. Please help improve it or discuss these issues on the
talk page.
The topic of this article may not meet Wikipedia's general notability guideline. (August
2010)
This article needs additional citations for verification. (August 2010)

Data profiling is the process of examining the data available in an existing data source (e.g. a
database or a file) and collecting statistics and information about that data. The purpose of these
statistics may be to:
1. Find out whether existing data can easily be used for other purposes
2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to
a category
3. Give metrics on data quality including whether the data conforms to particular standards or
patterns
4. Assess the risk involved in integrating data for new applications, including the challenges of joins
5. Assess whether metadata accurately describes the actual values in the source database
6. Understanding data challenges early in any data intensive project, so that late project surprises
are avoided. Finding data problems late in the project can lead to delays and cost overruns.
7. Have an enterprise view of all data, for uses such as master data management where key data is
needed, or data governance for improving data quality.
Contents
1 Data Profiling in Relation to Data Warehouse/Business Intelligence Development
o 1.1 Introduction
o 1.2 How to do Data Profiling
o 1.3 When to Conduct Data Profiling
o 1.4 Benefits of Data Profiling
2 See also
3 References
Data Profiling in Relation to Data Warehouse/Business
Intelligence Development
Introduction
Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the
structure, content, relationships and derivation rules of the data.
[1]
Profiling helps not only to
understand anomalies and to assess data quality, but also to discover, register, and assess
enterprise metadata.
[2]
Thus the purpose of data profiling is both to validate metadata when it is
available and to discover metadata when it is not.
[3]
The result of the analysis is used both
strategically, to determine suitability of the candidate source systems and give the basis for an
early go/no-go decision, and tactically, to identify problems for later solution design, and to level
sponsors expectations.
[1]

How to do Data Profiling
Data profiling utilizes different kinds of descriptive statistics such as minimum, maximum,
mean, mode, percentile, standard deviation, frequency, and variation as well as other aggregates
such as count and sum. Additional metadata information obtained during data profiling could be
data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns,
and abstract type recognition.
[2][4][5]
The metadata can then be used to discover problems such as
illegal values, misspelling, missing values, varying value representation, and duplicates.
Different analyses are performed for different structural levels. E.g. single columns could be
profiled individually to get an understanding of frequency distribution of different values, type,
and use of each column. Embedded value dependencies can be exposed in cross-columns
analysis. Finally, overlapping value sets possibly representing foreign key relationships between
entities can be explored in an inter-table analysis.
[2]
Normally purpose-built tools are used for
data profiling to ease the process.
[1][2][4][5][6][7]
The computation complexity increases when going
from single column, to single table, to cross-table structural profiling. Therefore, performance is
an evaluation criterion for profiling tools.
[3]

When to Conduct Data Profiling
According to Kimball,
[1]
data profiling is performed several times and with varying intensity
throughout the data warehouse developing process. A light profiling assessment should be
undertaken as soon as candidate source systems have been identified right after the acquisition of
the business requirements for the DW/BI. The purpose is to clarify at an early stage if the right
data is available at the right detail level and that anomalies can be handled subsequently. If this is
not the case the project might have to be canceled.
[1]
More detailed profiling is done prior to the
dimensional modeling process in order to see what it will require to convert data into the
dimensional model, and extends into the ETL system design process to establish what data to
extract and which filters to apply.
[1]
An additional time to conduct data in the data warehouse
development process after data has been loaded into staging, the data marts, etc. Doing so at
these points in time helps assure that data cleaning and transformations have been done correctly
according to requirements.
Benefits of Data Profiling
The benefits of data profiling is to improve data quality, shorten the implementation cycle of
major projects, and improve understanding of data for the users.
[7]
Discovering business
knowledge embedded in data itself is one of the significant benefits derived from data
profiling.
[3]
Data profiling is one of the most effective technologies for improving data accuracy
in corporate databases.
[7]
Although data profiling is effective, then do remember to find a suitable
balance and do not slip into analysis paralysis
Sectin b
Data modeling
(Redirected from Data modelling)

The data modeling process. The figure illustrates the way data models are developed and used today. A
conceptual data model is developed based on the data requirements for the application that is being
developed, perhaps in the context of an activity model. The data model will normally consist of entity
types, attributes, relationships, integrity rules, and the definitions of those objects. This is then used as
the start point for interface or database design.
[1]

Data modeling in software engineering is the process of creating a data model for an
information system by applying formal data modeling techniques.
Contents
1 Overview
2 Data modeling topics
o 2.1 Data models
o 2.2 Conceptual, logical and physical schemas
o 2.3 Data modeling process
o 2.4 Modeling methodologies
o 2.5 Entity relationship diagrams
o 2.6 Generic data modeling
o 2.7 Semantic data modeling
3 See also
4 References
5 Further reading
6 External links
Overview
Data modeling is a process used to define and analyze data requirements needed to support the
business processes within the scope of corresponding information systems in organizations.
Therefore, the process of data modeling involves professional data modelers working closely
with business stakeholders, as well as potential users of the information system.
According to Hoberman, data modeling is the process of learning about the data, and the data
model is the end result of the data modeling process.
[2]

There are three different types of data models produced while progressing from requirements to
the actual database to be used for the information system.
[3]
The data requirements are initially
recorded as a conceptual data model which is essentially a set of technology independent
specifications about the data and is used to discuss initial requirements with the business
stakeholders. The conceptual model is then translated into a logical data model, which
documents structures of the data that can be implemented in databases. Implementation of one
conceptual data model may require multiple logical data models. The last step in data modeling
is transforming the logical data model to a physical data model that organizes the data into
tables, and accounts for access, performance and storage details. Data modeling defines not just
data elements, but also their structures and the relationships between them.
[4]

Data modeling techniques and methodologies are used to model data in a standard, consistent,
predictable manner in order to manage it as a resource. The use of data modeling standards is
strongly recommended for all projects requiring a standard means of defining and analyzing data
within an organization, e.g., using data modeling:
to assist business analysts, programmers, testers, manual writers, IT package selectors,
engineers, managers, related organizations and clients to understand and use an agreed semi-
formal model the concepts of the organization and how they relate to one another
to manage data as a resource
for the integration of information systems
for designing databases/data warehouses (aka data repositories)
Data modeling may be performed during various types of projects and in multiple phases of
projects. Data models are progressive; there is no such thing as the final data model for a
business or application. Instead a data model should be considered a living document that will
change in response to a changing business. The data models should ideally be stored in a
repository so that they can be retrieved, expanded, and edited over time. Whitten et al. (2004)
determined two types of data modeling:
[5]

Strategic data modeling: This is part of the creation of an information systems strategy, which
defines an overall vision and architecture for information systems is defined. Information
engineering is a methodology that embraces this approach.
Data modeling during systems analysis: In systems analysis logical data models are created as
part of the development of new databases.
Data modeling is also used as a technique for detailing business requirements for specific
databases. It is sometimes called database modeling because a data model is eventually
implemented in a database.
[5]

Data modeling topics
Data models
Main article: Data model

How data models deliver benefit.
[1]

Data models provide a structure for data used within information systems by providing specific
definition and format. If a data model is used consistently across systems then compatibility of
data can be achieved. If the same data structures are used to store and access data then different
applications can share data seamlessly. The results of this are indicated in the diagram. However,
systems and interfaces often cost more than they should, to build, operate, and maintain. They
may also constrain the business rather than support it. This may occur when the quality of the
data models implemented in systems and interfaces is poor.
[1]

Business rules, specific to how things are done in a particular place, are often fixed in the
structure of a data model. This means that small changes in the way business is conducted lead
to large changes in computer systems and interfaces. So, business rules need to be implemented
in a flexible way that does not result in complicated dependencies, rather the data model should
be flexible enough so that changes in the business can be implemented within the data model in
a relatively quick and efficient way.
Entity types are often not identified, or are identified incorrectly. This can lead to replication of
data, data structure and functionality, together with the attendant costs of that duplication in
development and maintenance.Therefore, data definitions should be made as explicit and easy
to understand as possible to minimize misinterpretation and duplication.
Data models for different systems are arbitrarily different. The result of this is that complex
interfaces are required between systems that share data. These interfaces can account for
between 25-70% of the cost of current systems. Required interfaces should be considered
inherently while designing a data model, as a data model on its own would not be usable
without interfaces within different systems.
Data cannot be shared electronically with customers and suppliers, because the structure and
meaning of data has not been standardised. To obtain optimal value from an implemented data
model, it is very important to define standards that will ensure that data models will both meet
business needs and be consistent.
[1]

Conceptual, logical and physical schemas

The ANSI/SPARC three level architecture. This shows that a data model can be an external model (or
view), a conceptual model, or a physical model. This is not the only way to look at data models, but it is a
useful way, particularly when comparing models.
[1]

In 1975 ANSI described three kinds of data-model instance:
[6]

Conceptual schema: describes the semantics of a domain (the scope of the model). For example,
it may be a model of the interest area of an organization or of an industry. This consists of entity
classes, representing kinds of things of significance in the domain, and relationships assertions
about associations between pairs of entity classes. A conceptual schema specifies the kinds of
facts or propositions that can be expressed using the model. In that sense, it defines the allowed
expressions in an artificial "language" with a scope that is limited by the scope of the model.
Simply described, a conceptual schema is the first step in organizing the data requirements.
Logical schema: describes the structure of some domain of information. This consists of
descriptions of (for example) tables, columns, object-oriented classes, and XML tags. The logical
schema and conceptual schema are sometimes implemented as one and the same.
[3]

Physical schema: describes the physical means used to store data. This is concerned with
partitions, CPUs, tablespaces, and the like.
According to ANSI, this approach allows the three perspectives to be relatively independent of
each other. Storage technology can change without affecting either the logical or the conceptual
schema. The table/column structure can change without (necessarily) affecting the conceptual
schema. In each case, of course, the structures must remain consistent across all schemas of the
same data model.
Data modeling process

Data modeling in the context of Business Process Integration.
[7]

In the context of business process integration (see figure), data modeling complements business
process modeling, and ultimately results in database generation.
[7]

The process of designing a database involves producing the previously described three types of
schemas - conceptual, logical, and physical. The database design documented in these schemas
are converted through a Data Definition Language, which can then be used to generate a
database. A fully attributed data model contains detailed attributes (descriptions) for every entity
within it. The term "database design" can describe many different parts of the design of an
overall database system. Principally, and most correctly, it can be thought of as the logical design
of the base data structures used to store the data. In the relational model these are the tables and
views. In an object database the entities and relationships map directly to object classes and
named relationships. However, the term "database design" could also be used to apply to the
overall process of designing, not just the base data structures, but also the forms and queries used
as part of the overall database application within the Database Management System or DBMS.
In the process, system interfaces account for 25% to 70% of the development and support costs
of current systems. The primary reason for this cost is that these systems do not share a common
data model. If data models are developed on a system by system basis, then not only is the same
analysis repeated in overlapping areas, but further analysis must be performed to create the
interfaces between them. Most systems within an organization contain the same basic data,
redeveloped for a specific purpose. Therefore, an efficiently designed basic data model can
minimize rework with minimal modifications for the purposes of different systems within the
organization
[1]

Modeling methodologies
Data models represent information areas of interest. While there are many ways to create data
models, according to Len Silverston (1997)
[8]
only two modeling methodologies stand out, top-
down and bottom-up:
Bottom-up models or View Integration models are often the result of a reengineering effort.
They usually start with existing data structures forms, fields on application screens, or reports.
These models are usually physical, application-specific, and incomplete from an enterprise
perspective. They may not promote data sharing, especially if they are built without reference
to other parts of the organization.
[8]

Top-down logical data models, on the other hand, are created in an abstract way by getting
information from people who know the subject area. A system may not implement all the
entities in a logical model, but the model serves as a reference point or template.
[8]

Sometimes models are created in a mixture of the two methods: by considering the data needs
and structure of an application and by consistently referencing a subject-area model.
Unfortunately, in many environments the distinction between a logical data model and a physical
data model is blurred. In addition, some CASE tools dont make a distinction between logical
and physical data models.
[8]

Entity relationship diagrams
Main article: Entity-relationship model

Example of an IDEF1X Entity relationship diagrams used to model IDEF1X itself. The name of the view is
mm. The domain hierarchy and constraints are also given. The constraints are expressed as sentences in
the formal theory of the meta model.
[9]

There are several notations for data modeling. The actual model is frequently called "Entity
relationship model", because it depicts data in terms of the entities and relationships described in
the data.
[5]
An entity-relationship model (ERM) is an abstract conceptual representation of
structured data. Entity-relationship modeling is a relational schema database modeling method,
used in software engineering to produce a type of conceptual data model (or semantic data
model) of a system, often a relational database, and its requirements in a top-down fashion.
These models are being used in the first stage of information system design during the
requirements analysis to describe information needs or the type of information that is to be stored
in a database. The data modeling technique can be used to describe any ontology (i.e. an
overview and classifications of used terms and their relationships) for a certain universe of
discourse i.e. area of interest.
Several techniques have been developed for the design of data models. While these
methodologies guide data modelers in their work, two different people using the same
methodology will often come up with very different results. Most notable are:
Bachman diagrams
Barker's notation
Chen's Notation
Data Vault Modeling
Extended BackusNaur form
IDEF1X
Object-relational mapping
Object-Role Modeling
Relational Model
Relational Model/Tasmania
Generic data modeling
Main article: Generic data model

Example of a Generic data model.
[10]

Generic data models are generalizations of conventional data models. They define standardized
general relation types, together with the kinds of things that may be related by such a relation
type. The definition of generic data model is similar to the definition of a natural language. For
example, a generic data model may define relation types such as a 'classification relation', being
a binary relation between an individual thing and a kind of thing (a class) and a 'part-whole
relation', being a binary relation between two things, one with the role of part, the other with the
role of whole, regardless the kind of things that are related.
Given an extensible list of classes, this allows the classification of any individual thing and to
specify part-whole relations for any individual object. By standardization of an extensible list of
relation types, a generic data model enables the expression of an unlimited number of kinds of
facts and will approach the capabilities of natural languages. Conventional data models, on the
other hand, have a fixed and limited domain scope, because the instantiation (usage) of such a
model only allows expressions of kinds of facts that are predefined in the model.
Semantic data modeling
Main article: Semantic data model
The logical data structure of a DBMS, whether hierarchical, network, or relational, cannot totally
satisfy the requirements for a conceptual definition of data because it is limited in scope and
biased toward the implementation strategy employed by the DBMS.

Semantic data models.
[9]

Therefore, the need to define data from a conceptual view has led to the development of
semantic data modeling techniques. That is, techniques to define the meaning of data within the
context of its interrelationships with other data. As illustrated in the figure the real world, in
terms of resources, ideas, events, etc., are symbolically defined within physical data stores. A
semantic data model is an abstraction which defines how the stored symbols relate to the real
world. Thus, the model must be a true representation of the real world.
[9]

A semantic data model can be used to serve many purposes, such as:
[9]

planning of data resources
building of shareable databases
evaluation of vendor software
integration of existing databases
The overall goal of semantic data models is to capture more meaning of data by integrating
relational concepts with more powerful abstraction concepts known from the Artificial
Intelligence field. The idea is to provide high level modeling primitives as integral part of a data
model in order to facilitate the representation of real world situations.
[11]

Types of data models
Database model
Main article: Database model
A database model is a specification describing how a database is structured and used.
Several such models have been suggested. Common models include:

Flat model

Hierarchical model

Network model

Relational model
Flat model: This may not strictly qualify as a data model. The flat (or table) model consists of a
single, two-dimensional array of data elements, where all members of a given column are
assumed to be similar values, and all members of a row are assumed to be related to one
another.
Hierarchical model: In this model data is organized into a tree-like structure, implying a single
upward link in each record to describe the nesting, and a sort field to keep the records in a
particular order in each same-level list.
Network model: This model organizes data using two fundamental constructs, called records
and sets. Records contain fields, and sets define one-to-many relationships between records:
one owner, many members.
Relational model: is a database model based on first-order predicate logic. Its core idea is to
describe a database as a collection of predicates over a finite set of predicate variables,
describing constraints on the possible values and combinations of values.

Concept-oriented model

Star schema
Object-relational model: Similar to a relational database model, but objects, classes and
inheritance are directly supported in database schemas and in the query language.
Star schema is the simplest style of data warehouse schema. The star schema consists of a few
"fact tables" (possibly only one, justifying the name) referencing any number of "dimension
tables". The star schema is considered an important special case of the snowflake schema.
Tecniqes
Bachman diagram

Illustration of set type using a Bachman diagram
A Bachman diagram is a certain type of data structure diagram,
[2]
and is used to design the data
with a network or relational "logical" model, separating the data model from the way the data is
stored in the system. The model is named after database pioneer Charles Bachman, and mostly
used in computer software design.
In a relational model, a relation is the cohesion of attributes that are fully and not transitive
functional dependent
[clarify]
of every key in that relation. The coupling between the relations is
based on accordant attributes. For every relation, a rectangle has to be drawn and every coupling
is illustrated by a line that connects the relations. On the edge of each line, arrows indicate the
cardinality. We have 1-to-n, 1-to-1 and n-to-n. The latter has to be avoided and must be replaced
by two 1-to-n couplings.
See also
Barker's notation
Barker's notation refers to the ERD notation developed by Richard Barker, Ian Palmer, Harry
Ellis et al. whilst working at the British consulting firm CACI around 1981. The notation was
adopted by Barker when he joined Oracle and is effectively defined in his book Entity
Relationship Modelling as part of the CASE Method series of books. This notation was and still
is used by the Oracle CASE modelling tools. It is a variation of the "crows foot" style of data
modelling that was favoured by many over the original Chen style of ERD modelling because of
its readability and efficient use of drawing space.
The notation has features that represent the properties of relationships including cardinality and
optionality (the crows foot and dashing of lines), exclusion (the exclusion arc), recursion
(looping structures) and use of abstraction (nested boxes).
Object-relational mapping

removed. (May 2009)
Not to be confused with Object-Role Modeling.
Object-relational mapping (ORM, O/RM, and O/R mapping) in computer science is a
programming technique for converting data between incompatible type systems in object-
oriented programming languages. This creates, in effect, a "virtual object database" that can be
used from within the programming language. There are both free and commercial packages
available that perform object-relational mapping, although some programmers opt to create their
own ORM tools.

In object-oriented programming, data management tasks act on object-oriented (OO) objects that
are almost always non-scalar values. For example, consider an address book entry that represents
a single person along with zero or more phone numbers and zero or more addresses. This could
be modeled in an object-oriented implementation by a "Person object" with attributes/fields to
hold each data item that the entry comprises: the person's name, a list of phone numbers, and a
list of addresses. The list of phone numbers would itself contain "PhoneNumber objects" and so
on. The address book entry is treated as a single object by the programming language (it can be
referenced by a single variable containing a pointer to the object, for instance). Various methods
can be associated with the object, such as a method to return the preferred phone number, the
home address, and so on.
However, many popular database products such as structured query language database
management systems (SQL DBMS) can only store and manipulate scalar values such as integers
and strings organized within tables. The programmer must either convert the object values into
groups of simpler values for storage in the database (and convert them back upon retrieval), or
only use simple scalar values within the program. Object-relational mapping is used to
implement the first approach.
[1]

The heart of the problem is translating the logical representation of the objects into an atomized
form that is capable of being stored in the database, while preserving the properties of the objects
and their relationships so that they can be reloaded as objects when needed. If this storage and
retrieval functionality is implemented, the objects are said to be persistent.
[1]

Object-role modeling
(Redirected from Object-Role Modeling)
Not to be confused with Object-relational mapping.

example of an ORM2 diagram
Object-role modeling (ORM) is used to model the semantics of a universe of discourse. ORM is
often used for data modeling and software engineering.
An object-role model uses graphical symbols that are based on first order predicate logic and set
theory to enable the modeler to create an unambiguous definition of an arbitrary universe of
discourse.
The term "object-role model" was coined in the 1970s and ORM based tools have been used for
more than 30 years principally for data modeling. More recently ORM has been used to model
business rules, XML-Schemas, data warehouses, requirements engineering and web forms.
[1]

Concepts

Overview of object-role model notation, Stephen M. Richard (1999).
[2]

Facts
Object-role models are based on elementary facts, and expressed in diagrams that can be
verbalised into natural language. A fact is a proposition such as "John Smith was hired on 5
January 1995" or "Mary Jones was hired on 3 March 2010".
With ORM, propositions such as these, are abstracted into "fact types" for example "Person was
hired on Date" and the individual propositions are regarded as sample data. The difference
between a "fact" and an "elementary fact" is that an elementary fact cannot be simplified without
loss of meaning. This "fact-based" approach facilitates modeling, transforming, and querying
information from any domain.
[3]

Attribute-free
ORM is attribute-free : unlike models in the entity relationship (ER) and Unified Modeling
Language (UML) methods, ORM treats all elementary facts as relationships and so treats
decisions for grouping facts into structures (e.g. attribute-based entity types, classes, relation
schemes, XML schemas) as implementation concerns irrelevant to semantics. By avoiding
attributes in the base model, ORM improves semantic stability and enables verbalization into
natural language.
Fact-based modeling
Fact-based modeling includes procedures for mapping facts to attribute-based structures, such as
those of ER or UML.
[3]

Fact-based textual representations are based on formal subsets of native languages. ORM
proponents argue that ORM models are easier to understand by people without a technical
education. For example, proponents argue that object-role models are easier to understand than
declarative languages such as Object Constraint Language (OCL) and other graphical languages
such as UML class models.
[3]
Fact-based graphical notations are more expressive than those of
ER and UML. An object-role model can be automatically mapped to relational and deductive
databases (such as datalog).
[4]

ORM 2 graphical notation
ORM2 is the latest generation of object-role modeling . The main objectives for the ORM 2
graphical notation are:
[5]

More compact display of ORM models without compromising clarity
Improved internationalization (e.g. avoid English language symbols)
Simplified drawing rules to facilitate creation of a graphical editor
Extended use of views for selectively displaying/suppressing detail
Support for new features (e.g. role path delineation, closure aspects, modalities)
Design procedure

Example of the application of Object Role Modeling in a "Schema for Geologic Surface", Stephen M.
Richard (1999).
[2]

System development typically involves several stages such as: feasibility study; requirements
analysis; conceptual design of data and operations; logical design; external design; prototyping;
internal design and implementation; testing and validation; and maintenance. The seven steps of
the conceptual schema design procedure are:
[6]

1. Transform familiar information examples into elementary facts, and apply quality checks
2. Draw the fact types, and apply a population check
3. Check for entity types that should be combined, and note any arithmetic derivations
4. Add uniqueness constraints, and check arity of fact types
5. Add mandatory role constraints, and check for logical derivations
6. Add value, set comparison and subtyping constraints
7. Add other constraints and perform final checks
ORM's conceptual schema design procedure (CSDP) focuses on the analysis and design of data.

Data warehouse

[hide]This article has multiple issues. Please help improve it or discuss these issues on the
talk page.
This article needs additional citations for verification. (February 2008)
This article has an unclear citation style. (September 2013)

Data Warehouse Overview
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a
system used for reporting and data analysis. Integrating data from one or more disparate sources
creates a central repository of data, a data warehouse (DW). Data warehouses store current and
historical data and are used for creating trending reports for senior management reporting such as
annual and quarterly comparisons.
The data stored in the warehouse is uploaded from the operational systems (such as marketing,
sales, etc., shown in the figure to the right). The data may pass through an operational data store
for additional operations before it is used in the DW for reporting.
Contents
1 Types of systems
2 Software tools
3 Benefits
4 Generic data warehouse environment
5 History
6 Information storage
o 6.1 Facts
o 6.2 Dimensional vs. normalized approach for storage of data
7 Top-down versus bottom-up design methodologies
o 7.1 Bottom-up design
o 7.2 Top-down design
o 7.3 Hybrid design
8 Data warehouses versus operational systems
9 Evolution in organization use
10 See also
11 References
12 Further reading
13 External links
Types of systems
Data mart
A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as sales, finance or marketing. Data marts are often built and controlled
by a single department within an organization. Given their single-subject focus, data marts
usually draw data from only a few sources. The sources could be internal operational systems, a
central data warehouse, or external data.
[1]

Online analytical processing (OLAP)
Is characterized by a relatively low volume of transactions. Queries are often very complex and
involve aggregations. For OLAP systems, response time is an effectiveness measure. OLAP
applications are widely used by Data Mining techniques. OLAP databases store aggregated,
historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have
data latency of a few hours, as opposed to data marts, where latency is expected to be closer to
one day.
Online Transaction Processing (OLTP)
Is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). OLTP
systems emphasize very fast query processing and maintaining data integrity in multi-access
environments. For OLTP systems, effectiveness is measured by the number of transactions per
second. OLTP databases contain detailed and current data. The schema used to store
transactional databases is the entity model (usually 3NF).
[2]

Predictive analysis
Predictive analysis is about finding and quantifying hidden patterns in the data using complex
mathematical models that can be used to predict future outcomes. Predictive analysis is
different from OLAP in that OLAP focuses on historical data analysis and is reactive in nature,
while predictive analysis focuses on the future. These systems are also used for CRM (Customer
Relationship Management).
[3]

Software tools
The typical extract-transform-load (ETL)-based data warehouse uses staging, data integration,
and access layers to house its key functions. The staging layer or staging database stores raw data
extracted from each of the disparate source data systems. The integration layer integrates the
disparate data sets by transforming the data from the staging layer often storing this transformed
data in an operational data store (ODS) database. The integrated data are then moved to yet
another database, often called the data warehouse database, where the data is arranged into
hierarchical groups often called dimensions and into facts and aggregate facts. The combination
of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve
data.
[4]

This definition of the data warehouse focuses on data storage. The main source of the data is
cleaned, transformed, cataloged and made available for use by managers and other business
professionals for data mining, online analytical processing, market research and decision
support.
[5]
However, the means to retrieve and analyze data, to extract, transform and load data,
and to manage the data dictionary are also considered essential components of a data
warehousing system. Many references to data warehousing use this broader context. Thus, an
expanded definition for data warehousing includes business intelligence tools, tools to extract,
transform and load data into the repository, and tools to manage and retrieve metadata.
Benefits
A data warehouse maintains a copy of information from the source transaction systems. This
architectural complexity provides the opportunity to :
Congregate data from multiple sources into a single database so a single query engine can be
used to present data.
Mitigate the problem of database isolation level lock contention in transaction processing
systems caused by attempts to run large, long running, analysis queries in transaction
processing databases.
Maintain data history, even if the source transaction systems do not.
Integrate data from multiple source systems, enabling a central view across the enterprise. This
benefit is always valuable, but particularly so when the organization has grown by merger.
Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad
data.
Present the organization's information consistently.
Provide a single common data model for all data of interest regardless of the data's source.
Restructure the data so that it makes sense to the business users.
Restructure the data so that it delivers excellent query performance, even for complex analytic
queries, without impacting the operational systems.
Add value to operational business applications, notably customer relationship management
(CRM) systems.
Making decisionsupport queries easier to write.
Generic data warehouse environment
The environment for data warehouses and marts includes the following:
Source systems that provide data to the warehouse or mart;
Data integration technology and processes that are needed to prepare the data for use;
Different architectures for storing data in an organization's data warehouse or data marts;
Different tools and applications for the variety of users;
Metadata, data quality, and governance processes must be in place to ensure that the
warehouse or mart meets its purposes.
In regards to source systems listed above, Rainer states, A common source for the data in data
warehouses is the companys operational databases, which can be relational databases.
[6]

Regarding data integration, Rainer states, It is necessary to extract data from source systems,
transform them, and load them into a data mart or warehouse.
[6]

Rainer discusses storing data in an organizations data warehouse or data marts..
[6]

Metadata are data about data. IT personnel need information about data sources; database, table,
and column names; refresh schedules; and data usage measures.
[6]

Today, the most successful companies are those that can respond quickly and flexibly to market
changes and opportunities. A key to this response is the effective and efficient use of data and
information by analysts and managers.
[6]
A data warehouse is a repository of historical data
that are organized by subject to support decision makers in the organization.
[6]
Once data are
stored in a data mart or warehouse, they can be accessed.
History
The concept of data warehousing dates back to the late 1980s
[7]
when IBM researchers Barry
Devlin and Paul Murphy developed the "business data warehouse". In essence, the data
warehousing concept was intended to provide an architectural model for the flow of data from
operational systems to decision support environments. The concept attempted to address the
various problems associated with this flow, mainly the high costs associated with it. In the
absence of a data warehousing architecture, an enormous amount of redundancy was required to
support multiple decision support environments. In larger corporations it was typical for multiple
decision support environments to operate independently. Though each environment served
different users, they often required much of the same stored data. The process of gathering,
cleaning and integrating data from various sources, usually from long-term existing operational
systems (usually referred to as legacy systems), was typically in part replicated for each
environment. Moreover, the operational systems were frequently reexamined as new decision
support requirements emerged. Often new requirements necessitated gathering, cleaning and
integrating new data from "data marts" that were tailored for ready access by users.
Key developments in early years of data warehousing were:
1960s General Mills and Dartmouth College, in a joint research project, develop the terms
dimensions and facts.
[8]

1970s ACNielsen and IRI provide dimensional data marts for retail sales.
[8]

1970s Bill Inmon begins to define and discuss the term: Data Warehouse.
[citation needed]

1975 Sperry Univac introduces MAPPER (MAintain, Prepare, and Produce Executive Reports)
is a database management and reporting system that includes the world's first 4GL. First
platform designed for building Information Centers (a forerunner of contemporary Enterprise
Data Warehousing platforms)
1983 Teradata introduces a database management system specifically designed for decision
support.
1983 Sperry Corporation Martyn Richard Jones
[9]
defines the Sperry Information Center
approach, which while not being a true DW in the Inmon sense, did contain many of the
characteristics of DW structures and process as defined previously by Inmon, and later by
Devlin. First used at the TSB England & Wales A subset of this work found its way into the much
later papers of Devlin and Murphy.
1984 Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data
Interpretation System (DIS). DIS was a hardware/software package and GUI for business users to
create a database management and analytic system.
1988 Barry Devlin and Paul Murphy publish the article An architecture for a business and
information system where they introduce the term "business data warehouse".
[10]

1990 Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a
database management system specifically for data warehousing.
1991 Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software
for developing a data warehouse.
1992 Bill Inmon publishes the book Building the Data Warehouse.
[11]

1995 The Data Warehousing Institute, a for-profit organization that promotes data
warehousing, is founded.
1996 Ralph Kimball publishes the book The Data Warehouse Toolkit.
[12]

2000 Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses
warehouse.
In 2012 Bill developed and made public technology known as textual disambiguation. Textual
disambiguation applies context to raw text and reformats the raw text and context into a
standard data base format. Once raw text is passed through textual disambiguation, it can easily
and efficiently be accessed and analyzed by standard business intelligence technology. Textual
disambiguation is accomplished through the execution of textual ETL. Textual disambiguation is
useful wherever raw text is found, such as in documents, Hadoop, email, and so forth.
Information storage
Facts
A fact is a value or measurement, which represents a fact about the managed entity or system.
Facts as reported by the reporting entity are said to be at raw level.
E.g. if a BTS received 1,000 requests for traffic channel allocation, it allocates for 820 and
rejects the remaining then it would report 3 facts or measurements to a management system:
tch_req_total = 1000
tch_req_success = 820
tch_req_fail = 180
Facts at raw level are further aggregated to higher levels in various dimensions to extract more
service or business-relevant information out of it. These are called aggregates or summaries or
aggregated facts.
E.g. if there are 3 BTSs in a city, then facts above can be aggregated from BTS to city level in
network dimension. E.g.

Dimensional vs. normalized approach for storage of data
There are three or more leading approaches to storing data in a data warehouse the most
important approaches are the dimensional approach and the normalized approach.
The dimensional approach refers to Ralph Kimballs approach in which it is stated that the data
warehouse should be modeled using a Dimensional Model/star schema. The normalized
approach, also called the 3NF model (Third Normal Form) refers to Bill Inmon's approach in
which it is stated that the data warehouse should be modeled using an E-R model/normalized
model.
In a dimensional approach, transaction data are partitioned into "facts", which are generally
numeric transaction data, and "dimensions", which are the reference information that gives
context to the facts. For example, a sales transaction can be broken up into facts such as the
number of products ordered and the price paid for the products, and into dimensions such as
order date, customer name, product number, order ship-to and bill-to locations, and salesperson
responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the user to
understand and to use. Also, the retrieval of data from the data warehouse tends to operate very
quickly.
[citation needed]
Dimensional structures are easy to understand for business users, because the
structure is divided into measurements/facts and context/dimensions. Facts are related to the
organizations business processes and operational system whereas the dimensions surrounding
them contain context about the measurement (Kimball, Ralph 2008).
The main disadvantages of the dimensional approach are the following:
1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data
from different operational systems is complicated.
2. It is difficult to modify the data warehouse structure if the organization adopting the
dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree,
database normalization rules. Tables are grouped together by subject areas that reflect general
data categories (e.g., data on customers, products, finance, etc.). The normalized structure
divides data into entities, which creates several tables in a relational database. When applied in
large enterprises the result is dozens of tables that are linked together by a web of joins.
Furthermore, each of the created entities is converted into separate physical tables when the
database is implemented (Kimball, Ralph 2008)
[citation needed]
. The main advantage of this approach
is that it is straightforward to add information into the database. Some disadvantages of this
approach are that, because of the number of tables involved, it can be difficult for users to join
data from different sources into meaningful information and to access the information without a
precise understanding of the sources of data and of the data structure of the data warehouse.
Both normalized and dimensional models can be represented in entity-relationship diagrams as
both contain joined relational tables. The difference between the two models is the degree of
normalization (also known as Normal Forms). These approaches are not mutually exclusive, and
there are other approaches. Dimensional approaches can involve normalizing data to a degree
(Kimball, Ralph 2008).
In Information-Driven Business,
[13]
Robert Hillard proposes an approach to comparing the two
approaches based on the information needs of the business problem. The technique shows that
normalized models hold far more information than their dimensional equivalents (even when the
same fields are used in both models) but this extra information comes at the cost of usability. The
technique measures information quantity in terms of information entropy and usability in terms
of the Small Worlds data transformation measure.
[14]

Top-down versus bottom-up design methodologies

This section appears to be written like an advertisement. Please help improve it by rewriting
promotional content from a neutral point of view and removing any inappropriate external links.
(November 2012)
Bottom-up design
Ralph Kimball
[15]
created an approach to data warehouse design known as bottom-up.
[16]
In the
bottom-up approach, data marts are first created to provide reporting and analytical capabilities
for specific business processes.
These data marts can eventually be integrated to create a comprehensive data warehouse. The
data warehouse bus architecture is primarily an implementation of "the bus", a collection of
conformed dimensions and conformed facts, which are dimensions that are shared (in a specific
way) between facts in two or more data marts.
Top-down design
Bill Inmon has defined a data warehouse as a centralized repository for the entire enterprise.
[17]

The top-down approach is designed using a normalized enterprise data model. "Atomic" data,
that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts
containing data needed for specific business processes or specific departments are created from
the data warehouse. In the Inmon vision, the data warehouse is at the center of the "Corporate
Information Factory" (CIF), which provides a logical framework for delivering business
intelligence (BI) and business management capabilities. Gartner released a research note
confirming Inmon's definition in 2005
[18]
with additional clarity. They also added one attribute.
Hybrid design
Data warehouse (DW) solutions often resemble the hub and spokes architecture. Legacy systems
feeding the DW/BI solution often include customer relationship management (CRM) and
enterprise resource planning solutions (ERP), generating large amounts of data. To consolidate
these various data models, and facilitate the extract transform load (ETL) process, DW solutions
often make use of an operational data store (ODS). The information from the ODS is then parsed
into the actual DW. To reduce data redundancy, larger systems will often store the data in a
normalized way. Data marts for specific reports can then be built on top of the DW solution.
The DW database in a hybrid solution is kept on third normal form to eliminate data redundancy.
A normal relational database, however, is not efficient for business intelligence reports where
dimensional modelling is prevalent. Small data marts can shop for data from the consolidated
warehouse and use the filtered, specific data for the fact tables and dimensions required. The DW
effectively provides a single source of information from which the data marts can read, creating a
highly flexible solution from a BI point of view. The hybrid architecture allows a DW to be
replaced with a master data management solution where operational, not static information could
reside.
The Data Vault Modeling components follow hub and spokes architecture. This modeling style is
a hybrid design, consisting of the best practices from both 3rd normal form and star schema. The
Data Vault model is not a true 3rd normal form, and breaks some of the rules that 3NF dictates
be followed. It is however, a top-down architecture with a bottom up design. The Data Vault
model is geared to be strictly a data warehouse. It is not geared to be end-user accessible, which
when built, still requires the use of a data mart or star schema based release area for business
purposes.
Data warehouses versus operational systems
Operational systems are optimized for preservation of data integrity and speed of recording of
business transactions through use of database normalization and an entity-relationship model.
Operational system designers generally follow the Codd rules of database normalization in order
to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully
normalized database designs (that is, those satisfying all five Codd rules) often result in
information from a business transaction being stored in dozens to hundreds of tables. Relational
databases are efficient at managing the relationships between these tables. The databases have
very fast insert/update performance because only a small amount of data in those tables is
affected each time a transaction is processed. Finally, in order to improve performance, older
data are usually periodically purged from operational systems.
Data warehouses are optimized for analytic access patterns. Analytic access patterns generally
involve selecting specific fields and rarely if ever 'select *' as is more common in operational
databases. Because of these differences in access patterns, operational databases (loosely, OLTP)
benefit from the use of a row-oriented DBMS whereas analytics databases (loosely, OLAP)
benefit from the use of a column-oriented DBMS. Unlike operational systems which maintain a
snapshot of the business, data warehouses generally maintain an infinite history which is
implemented through ETL processes that periodically migrate data from the operational systems
over to the data warehouse.
Evolution in organization use
These terms refer to the level of sophistication of a data warehouse:
Offline operational data warehouse
Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily,
weekly or monthly) from the operational systems and the data is stored in an integrated
reporting-oriented data
Offline data warehouse
Data warehouses at this stage are updated from data in the operational systems on a regular
basis and the data warehouse data are stored in a data structure designed to facilitate
reporting.
On time data warehouse
Online Integrated Data Warehousing represent the real time Data warehouses stage data in the
warehouse is updated for every transaction performed on the source data
Integrated data warehouse
These data warehouses assemble data from different areas of business, so users can look up the
information they need across other systems.
Data mart
A data mart is the access layer of the data warehouse environment that is used to get data out to
the users. The data mart is a subset of the data warehouse that is usually oriented to a specific
business line or team. Data marts are small slices of the data warehouse. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single
department. In some deployments, each department or business unit is considered the owner of
its data mart including all the hardware, software and data.
[1]
This enables each department to
use, manipulate and develop their data any way they see fit; without altering information inside
other data marts or the data warehouse. In other deployments where conformed dimensions are
used, this business unit ownership will not hold true for shared dimensions like customer,
product, etc.
The reasons why organizations are building data warehouses and data marts are because the
information in the database is not organized in a way that makes it easy for organizations to find
what they need. Also complicated queries might take a long time to answer what people want to
know since the database systems are designed to process millions of transactions per day.
Transactional database are designed to be updated, however, data warehouses or marts are read
only. Data warehouses are designed to access large groups of related records.
Data marts improve end-user response time by allowing users to have access to the specific type
of data they need to view most often by providing the data in a way that supports the collective
view of a group of users.
A data mart is basically a condensed and more focused version of a data warehouse that reflects
the regulations and process specifications of each business unit within an organization. Each data
mart is dedicated to a specific business function or region. This subset of data may span across
many or all of an enterprises functional subject areas. It is common for multiple data marts to be
used in order to serve the needs of each individual business unit (different data marts can be used
to obtain specific information for various enterprise departments, such as accounting, marketing,
sales, etc.).
The related term spreadmart is a derogatory label describing the situation that occurs when one
or more business analysts develop a system of linked spreadsheets to perform a business
analysis, then grow it to a size and degree of complexity that makes it nearly impossible to
maintain.
Contents
1 Data Mart vs Data Warehouse
2 Design schemas
3 Reasons for creating a data mart
4 Dependent data mart
5 See also
6 References
7 Bibliography
8 External links
Data Mart vs Data Warehouse
Data Warehouse:
Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional model but feeds dimensional models.
Data Mart:
Often holds only one subject area- for example, Finance, or Sales
May hold more summarized data (although many hold full detail)
Concentrates on integrating information from a given subject area or set of source
systems
Is built focused on a dimensional model using a star schema.

Design schemas
Star schema - fairly popular design choice; enables a relational database to emulate the
analytical functionality of a multidimensional database
Snowflake schema
Reasons for creating a data mart
Easy access to frequently needed data
Creates collective view by a group of users
Improves end-user response time
Ease of creation
Lower cost than implementing a full data warehouse
Potential users are more clearly defined than in a full data warehouse
Contains only business essential data and is less cluttered.
Dependent data mart
According to the Inmon school of data warehousing, a dependent data mart is a logical subset
(view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following
reasons:
A need refreshment for a special data model or schema: e.g., to restructure for OLAP
Performance: to offload the data mart to a separate computer for greater efficiency or to
obviate the need to manage that workload on the centralized data warehouse.
Security: to separate an authorized data subset selectively
Expediency: to bypass the data governance and authorizations required to incorporate a
new application on the Enterprise Data Warehouse
Proving Ground: to demonstrate the viability and ROI (return on investment) potential of
an application prior to migrating it to the Enterprise Data Warehouse
Politics: a coping strategy for IT (Information Technology) in situations where a user
group has more influence than funding or is not a good citizen on the centralized data
warehouse.
Politics: a coping strategy for consumers of data in situations where a data warehouse
team is unable to create a usable data warehouse.
According to the Inmon school of data warehousing, tradeoffs inherent with data marts include
limited scalability, duplication of data, data inconsistency with other silos of information, and
inability to leverage enterprise sources of data.
The alternative school of data warehousing is that of Ralph Kimball. In his view, a data
warehouse is nothing more than the union of all the data marts. This view helps to reduce costs
and provides fast development, but can create an inconsistent data warehouse, especially in large
organizations. Therefore, Kimball's approach is more suitable for small-to-medium
corporations.
[2]

Data mart
A data mart is the access layer of the data warehouse environment that is used to get data out to
the users. The data mart is a subset of the data warehouse that is usually oriented to a specific
business line or team. Data marts are small slices of the data warehouse. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single
department. In some deployments, each department or business unit is considered the owner of
its data mart including all the hardware, software and data.
[1]
This enables each department to
use, manipulate and develop their data any way they see fit; without altering information inside
other data marts or the data warehouse. In other deployments where conformed dimensions are
used, this business unit ownership will not hold true for shared dimensions like customer,
product, etc.
The reasons why organizations are building data warehouses and data marts are because the
information in the database is not organized in a way that makes it easy for organizations to find
what they need. Also complicated queries might take a long time to answer what people want to
know since the database systems are designed to process millions of transactions per day.
Transactional database are designed to be updated, however, data warehouses or marts are read
only. Data warehouses are designed to access large groups of related records.
Data marts improve end-user response time by allowing users to have access to the specific type
of data they need to view most often by providing the data in a way that supports the collective
view of a group of users.
A data mart is basically a condensed and more focused version of a data warehouse that reflects
the regulations and process specifications of each business unit within an organization. Each data
mart is dedicated to a specific business function or region. This subset of data may span across
many or all of an enterprises functional subject areas. It is common for multiple data marts to be
used in order to serve the needs of each individual business unit (different data marts can be used
to obtain specific information for various enterprise departments, such as accounting, marketing,
sales, etc.).
The related term spreadmart is a derogatory label describing the situation that occurs when one
or more business analysts develop a system of linked spreadsheets to perform a business
analysis, then grow it to a size and degree of complexity that makes it nearly impossible to
maintain.
Contents
1 Data Mart vs Data Warehouse
2 Design schemas
3 Reasons for creating a data mart
4 Dependent data mart
5 See also
6 References
7 Bibliography
8 External links
Data Mart vs Data Warehouse
Data Warehouse:
Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional model but feeds dimensional models.
Data Mart:
Often holds only one subject area- for example, Finance, or Sales
May hold more summarized data (although many hold full detail)
Concentrates on integrating information from a given subject area or set of source
systems
Is built focused on a dimensional model using a star schema.

Design schemas
Star schema - fairly popular design choice; enables a relational database to emulate the
analytical functionality of a multidimensional database
Snowflake schema
Reasons for creating a data mart
Easy access to frequently needed data
Creates collective view by a group of users
Improves end-user response time
Ease of creation
Lower cost than implementing a full data warehouse
Potential users are more clearly defined than in a full data warehouse
Contains only business essential data and is less cluttered.
Dependent data mart
According to the Inmon school of data warehousing, a dependent data mart is a logical subset
(view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following
reasons:
A need refreshment for a special data model or schema: e.g., to restructure for OLAP
Performance: to offload the data mart to a separate computer for greater efficiency or to
obviate the need to manage that workload on the centralized data warehouse.
Security: to separate an authorized data subset selectively
Expediency: to bypass the data governance and authorizations required to incorporate a
new application on the Enterprise Data Warehouse
Proving Ground: to demonstrate the viability and ROI (return on investment) potential of
an application prior to migrating it to the Enterprise Data Warehouse
Politics: a coping strategy for IT (Information Technology) in situations where a user
group has more influence than funding or is not a good citizen on the centralized data
warehouse.
Politics: a coping strategy for consumers of data in situations where a data warehouse
team is unable to create a usable data warehouse.
According to the Inmon school of data warehousing, tradeoffs inherent with data marts include
limited scalability, duplication of data, data inconsistency with other silos of information, and
inability to leverage enterprise sources of data.
The alternative school of data warehousing is that of Ralph Kimball. In his view, a data
warehouse is nothing more than the union of all the data marts. This view helps to reduce costs
and provides fast development, but can create an inconsistent data warehouse, especially in large
organizations. Therefore, Kimball's approach is more suitable for small-to-medium
corporations.
[2]

Data integration
Data integration involves combining data residing in different sources and providing users with
a unified view of these data.
[1]
This process becomes significant in a variety of situations, which
include both commercial (when two similar companies need to merge their databases) and
scientific (combining research results from different bioinformatics repositories, for example)
domains. Data integration appears with increasing frequency as the volume and the need to share
existing data explodes.
[2]
It has become the focus of extensive theoretical work, and numerous
open problems remain unsolved. In management circles, people frequently refer to data
integration as "Enterprise Information Integration" (EII).
Contents
1 History
2 Example
3 Theory of data integration
o 3.1 Definitions
o 3.2 Query processing
4 Data Integration in the Life Sciences
5 See also
6 References
7 Further reading
History

Figure 1: Simple schematic for a data warehouse. The ETL process extracts information from the source
databases, transforms it and then loads it into the data warehouse.

Figure 2: Simple schematic for a data-integration solution. A system designer constructs a mediated
schema against which users can run queries. The virtual database interfaces with the source databases
via wrapper code if required.
Issues with combining heterogeneous data sources under a single query interface have existed for
some time. The rapid adoption of databases after the 1960s naturally led to the need to share or
to merge existing repositories. This merging can take place at several levels in the database
architecture.
One popular solution is implemented based on data warehousing (see figure 1). The warehouse
system extracts, transforms, and loads data from heterogeneous sources into a single view
schema so data becomes compatible with each other. This approach offers a tightly coupled
architecture because the data are already physically reconciled in a single queryable repository,
so it usually takes little time to resolve queries. However, problems lie in the data freshness, that
is, information in warehouse is not always up-to-date. Thus updating an original data source may
outdate the warehouse, accordingly, the ETL process needs re-execution for synchronization.
Difficulties also arise in constructing data warehouses when one has only a query interface to
summary data sources and no access to the full data. This problem frequently emerges when
integrating several commercial query services like travel or classified advertisement web
applications.
As of 2009 the trend in data integration has favored loosening the coupling between data
[citation
needed]
and providing a unified query-interface to access real time data over a mediated schema
(see figure 2), which allows information to be retrieved directly from original databases. This
approach relies on mappings between the mediated schema and the schema of original sources,
and transform a query into specialized queries to match the schema of the original databases.
Such mappings can be specified in 2 ways : as a mapping from entities in the mediated schema to
entities in the original sources (the "Global As View" (GAV) approach), or as a mapping from
entities in the original sources to the mediated schema (the "Local As View" (LAV) approach).
The latter approach requires more sophisticated inferences to resolve a query on the mediated
schema, but makes it easier to add new data sources to a (stable) mediated schema.
As of 2010 some of the work in data integration research concerns the semantic integration
problem. This problem addresses not the structuring of the architecture of the integration, but
how to resolve semantic conflicts between heterogeneous data sources. For example if two
companies merge their databases, certain concepts and definitions in their respective schemas
like "earnings" inevitably have different meanings. In one database it may mean profits in dollars
(a floating-point number), while in the other it might represent the number of sales (an integer).
A common strategy for the resolution of such problems involves the use of ontologies which
explicitly define schema terms and thus help to resolve semantic conflicts. This approach
represents ontology-based data integration. On the other hand, the problem of combining
research results from different bioinformatics repositories requires bench-marking of the
similarities, computed from different data sources, on a single criterion such as positive
predictive value. This enables the data sources to be directly comparable and can be integrated
even when the natures of experiments are distinct.
[3]

As of 2011 it was determined that current data modeling methods were imparting data isolation
into every data architecture in the form of islands of disparate data and information silos each of
which represents a disparate system. This data isolation is an unintended artifact of the data
modeling methodology that results in the development of disparate data models.
[4]
Disparate data
models, when instantiated as databases, form disparate databases. Enhanced data model
methodologies have been developed to eliminate the data isolation artifact and to promote the
development of integrated data models.
[5]

[6]
One enhanced data modeling method recasts data
models by augmenting them with structural metadata in the form of standardized data entities.
As a result of recasting multiple data models, the set of recast data models will now share one or
more commonality relationships that relate the structural metadata now common to these data
models. Commonality relationships are a peer-to-peer type of entity relationships that relate the
standardized data entities of multiple data models. Multiple data models that contain the same
standard data entity may participate in the same commonality relationship. When integrated data
models are instantiated as databases and are properly populated from a common set of master
data, then these databases are integrated.
Example
Consider a web application where a user can query a variety of information about cities (such as
crime statistics, weather, hotels, demographics, etc.). Traditionally, the information must be
stored in a single database with a single schema. But any single enterprise would find
information of this breadth somewhat difficult and expensive to collect. Even if the resources
exist to gather the data, it would likely duplicate data in existing crime databases, weather
websites, and census data.
A data-integration solution may address this problem by considering these external resources as
materialized views over a virtual mediated schema, resulting in "virtual data integration". This
means application-developers construct a virtual schema the mediated schema to best
model the kinds of answers their users want. Next, they design "wrappers" or adapters for each
data source, such as the crime database and weather website. These adapters simply transform
the local query results (those returned by the respective websites or databases) into an easily
processed form for the data integration solution (see figure 2). When an application-user queries
the mediated schema, the data-integration solution transforms this query into appropriate queries
over the respective data sources. Finally, the virtual database combines the results of these
queries into the answer to the user's query.
This solution offers the convenience of adding new sources by simply constructing an adapter or
an application software blade for them. It contrasts with ETL systems or with a single database
solution, which require manual integration of entire new dataset into the system. The virtual ETL
solutions leverage virtual mediated schema to implement data harmonization; whereby the data
are copied from the designated "master" source to the defined targets, field by field. Advanced
Data virtualization is also built on the concept of object-oriented modeling in order to construct
virtual mediated schema or virtual metadata repository, using hub and spoke architecture.
Each data source is disparate and as such is not designed to support reliable joins between data
sources. Therefore, data virtualization as well as data federation depends upon accidental data
commonality to support combining data and information from disparate data sets. Because of this
lack of data value commonality across data sources, the return set may be inaccurate, incomplete,
and impossible to validate.
One solution is to recast disparate databases to integrate these databases without the need for
ETL. The recast databases support commonality constraints where referential integrity may be
enforced between databases. The recast databases provide designed data access paths with data
value commonality across databases.
Theory of data integration
The theory of data integration
[1]
forms a subset of database theory and formalizes the underlying
concepts of the problem in first-order logic. Applying the theories gives indications as to the
feasibility and difficulty of data integration. While its definitions may appear abstract, they have
sufficient generality to accommodate all manner of integration systems.
[citation needed]

Definitions
Data integration systems are formally defined as a triple where is the global (or
mediated) schema, is the heterogeneous set of source schemas, and is the mapping that
maps queries between the source and the global schemas. Both and are expressed in
languages over alphabets composed of symbols for each of their respective relations. The
mapping consists of assertions between queries over and queries over . When
users pose queries over the data integration system, they pose queries over and the mapping
then asserts connections between the elements in the global schema and the source schemas.
A database over a schema is defined as a set of sets, one for each relation (in a relational
database). The database corresponding to the source schema would comprise the set of sets
of tuples for each of the heterogeneous data sources and is called the source database. Note that
this single source database may actually represent a collection of disconnected databases. The
database corresponding to the virtual mediated schema is called the global database. The
global database must satisfy the mapping with respect to the source database. The legality of
this mapping depends on the nature of the correspondence between and . Two popular
ways to model this correspondence exist: Global as View or GAV and Local as View or LAV.

Figure 3: Illustration of tuple space of the GAV and LAV mappings.
[7]
In GAV, the system is constrained to
the set of tuples mapped by the mediators while the set of tuples expressible over the sources may be
much larger and richer. In LAV, the system is constrained to the set of tuples in the sources while the set
of tuples expressible over the global schema can be much larger. Therefore LAV systems must often deal
with incomplete answers.
GAV systems model the global database as a set of views over . In this case associates
to each element of as a query over . Query processing becomes a straightforward
operation due to the well-defined associations between and . The burden of complexity
falls on implementing mediator code instructing the data integration system exactly how to
retrieve elements from the source databases. If any new sources join the system, considerable
effort may be necessary to update the mediator, thus the GAV approach appears preferable when
the sources seem unlikely to change.
In a GAV approach to the example data integration system above, the system designer would
first develop mediators for each of the city information sources and then design the global
schema around these mediators. For example, consider if one of the sources served a weather
website. The designer would likely then add a corresponding element for weather to the global
schema. Then the bulk of effort concentrates on writing the proper mediator code that will
transform predicates on weather into a query over the weather website. This effort can become
complex if some other source also relates to weather, because the designer may need to write
code to properly combine the results from the two sources.
On the other hand, in LAV, the source database is modeled as a set of views over . In this
case associates to each element of a query over . Here the exact associations
between and are no longer well-defined. As is illustrated in the next section, the burden
of determining how to retrieve elements from the sources is placed on the query processor. The
benefit of an LAV modeling is that new sources can be added with far less work than in a GAV
system, thus the LAV approach should be favored in cases where the mediated schema is more
stable and unlikely to change.
[1]

In an LAV approach to the example data integration system above, the system designer designs
the global schema first and then simply inputs the schemas of the respective city information
sources. Consider again if one of the sources serves a weather website. The designer would add
corresponding elements for weather to the global schema only if none existed already. Then
programmers write an adapter or wrapper for the website and add a schema description of the
website's results to the source schemas. The complexity of adding the new source moves from
the designer to the query processor.
Query processing
The theory of query processing in data integration systems is commonly expressed using
conjunctive queries and Datalog, a purely declarative logic programming language.
[8]
One can
loosely think of a conjunctive query as a logical function applied to the relations of a database
such as " where ". If a tuple or set of tuples is substituted into the rule and satisfies it
(makes it true), then we consider that tuple as part of the set of answers in the query. While
formal languages like Datalog express these queries concisely and without ambiguity, common
SQL queries count as conjunctive queries as well.
In terms of data integration, "query containment" represents an important property of conjunctive
queries. A query contains another query (denoted ) if the results of applying
are a subset of the results of applying for any database. The two queries are said to be
equivalent if the resulting sets are equal for any database. This is important because in both GAV
and LAV systems, a user poses conjunctive queries over a virtual schema represented by a set of
views, or "materialized" conjunctive queries. Integration seeks to rewrite the queries represented
by the views to make their results equivalent or maximally contained by our user's query. This
corresponds to the problem of answering queries using views (AQUV).
[9]

In GAV systems, a system designer writes mediator code to define the query-rewriting. Each
element in the user's query corresponds to a substitution rule just as each element in the global
schema corresponds to a query over the source. Query processing simply expands the subgoals
of the user's query according to the rule specified in the mediator and thus the resulting query is
likely to be equivalent. While the designer does the majority of the work beforehand, some GAV
systems such as Tsimmis involve simplifying the mediator description process.
In LAV systems, queries undergo a more radical process of rewriting because no mediator exists
to align the user's query with a simple expansion strategy. The integration system must execute a
search over the space of possible queries in order to find the best rewrite. The resulting rewrite
may not be an equivalent query but maximally contained, and the resulting tuples may be
incomplete. As of 2009 the MiniCon algorithm
[9]
is the leading query rewriting algorithm for
LAV data integration systems.
In general, the complexity of query rewriting is NP-complete.
[9]
If the space of rewrites is
relatively small this does not pose a problem even for integration systems with hundreds of
sources.
Data Integration in the Life Sciences
Large-scale questions in science, such as global warming, invasive species spread, and resource
depletion, are increasingly requiring the collection of disparate data sets for meta-analysis. This
type of data integration is especially challenging for ecological and environmental data because
metadata standards are not agreed upon and there are many different data types produced in these
fields. National Science Foundation initiatives such as Datanet are intended to make data
integration easier for scientists by providing cyberinfrastructure and setting standards. The five
funded Datanet initiatives are DataONE,
[10]
led by William Michener at the University of New
Mexico; The Data Conservancy,
[11]
led by Sayeed Choudhury of Johns Hopkins University;
SEAD: Sustainable Environment through Actionable Data,
[12]
led by Margaret Hedstrom of the
University of Michigan; the DataNet Federation Consortium,
[13]
led by Reagan Moore of the
University of North Carolina; and Terra Populus,
[14]
led by Steven Ruggles of the University of
Minnesota. The Research Data Alliance,
[15]
has more recently explored creating global data
integration frameworks.
Online transaction processing
(Redirected from OLTP)

This article relies largely or entirely upon a single source. Relevant discussion may be
found on the talk page. Please help improve this article by introducing citations to
additional sources. (March 2013)
Online transaction processing, or OLTP, is a class of information systems that facilitate and
manage transaction-oriented applications, typically for data entry and retrieval transaction
processing. The term is somewhat ambiguous; some understand a "transaction" in the context of
computer or database transactions, while others (such as the Transaction Processing Performance
Council) define it in terms of business or commercial transactions.
[1]
OLTP has also been used to
refer to processing in which the system responds immediately to user requests. An automated
teller machine (ATM) for a bank is an example of a commercial transaction processing
application. Online transaction processing applications are high throughput and insert or update-
intensive in database management. These applications are used concurrently by hundreds of
users. The key goals of OLTP applications are availability, speed, concurrency and
recoverability.
[2]
Reduced paper trails and the faster, more accurate forecast for revenues and
expenses are both examples of how OLTP makes things simpler for businesses. However, like
many modern online information technology solutions, some systems require offline
maintenance, which further affects the cost-benefit analysis of online transaction processing
system.
Contents
1 What Is an OLTP System?
2 Online Transaction Processing Systems Design
3 Contrasted to
4 See also
5 References
6 External links
What Is an OLTP System?
OLTP system is a popular data processing system in today's enterprises. Some examples of
OLTP systems include order entry, retail sales, and financial transaction systems.
[3]
Online
transaction processing system increasingly requires support for transactions that span a network
and may include more than one company. For this reason, modern online transaction processing
software use client or server processing and brokering software that allows transactions to run on
different computer platforms in a network.
In large applications, efficient OLTP may depend on sophisticated transaction management
software (such as CICS) and/or database optimization tactics to facilitate the processing of large
numbers of concurrent updates to an OLTP-oriented database.
For even more demanding decentralized database systems, OLTP brokering programs can
distribute transaction processing among multiple computers on a network. OLTP is often
integrated into service-oriented architecture (SOA) and Web services.
Online Transaction Processing (OLTP) involves gathering input information, processing the
information and updating existing information to reflect the gathered and processed information.
As of today, most organizations use a database management system to support OLTP. OLTP is
carried in a client server system.
Online Transaction Process concerns about concurrency and atomicity. Concurrency controls
guarantee that two users accessing the same data in the database system will not be able to
change that data or the user has to wait until the other user has finished processing, before
changing that piece of data. Atomicity controls guarantee that all the steps in transaction are
completed successfully as a group. That is, if any steps between the transaction fail, all other
steps must fail also.
[4]

Online Transaction Processing Systems Design
To build an OLTP system, designer must know that the large number of concurrent users does
not interfere with the system's performance. To increase the performance of OLTP system,
designer must avoid the excessive use of indexes and clusters.
The following elements are crucial for the performance of OLTP systems:
[5]

Rollback segments
Rollback segments are the portions of database that record the actions of transactions in
the event that a transaction is rolled back. Rollback segments provide read consistency,
roll back transactions, and recover the database.
[6]

Clusters
A cluster is a schema that contains one or more tables that have one or more columns in
common. Clustering tables in database improves the performance of join operation.
[7]

Discrete transactions
All changes to the data are deferred until the transaction commits during a discrete
transaction. It can improve the performance of short, non-distributed transaction.
[8]

Block (data storage) size
The data block size should be a multiple of the operating system's block size within the
maximum limit to avoid unnecessary I/O.
[9]

Buffer cache size
To avoid unnecessary resource consumption, tune SQL statements to use the database
buffer cache.
[10]

Dynamic allocation of space to tables and rollback segments
Transaction processing monitors and the multi-threaded server
A transaction processing monitor is used for coordination of services. It is like an
operating system and does the coordination at a high level of granularity and can span
multiple computing devices.
[11]

Partition (database)
Partition increases performance for sites that have regular transactions while still
maintain availability and security.
[12]

database tuning
With database tuning, OLTP system can maximize its performance as efficiently and
rapidly as possible.
Batch processing

This article includes a list of references, but its sources remain unclear because it has
insufficient inline citations. Please help to improve this article by introducing more
precise citations. (March 2013)
Batch processing is the execution of a series of programs ("jobs") on a computer without
manual intervention.
Jobs are set up so they can be run to completion without human interaction. All input parameters
are predefined through scripts, command-line arguments, control files, or job control language.
This is in contrast to "online" or interactive programs which prompt the user for such input. A
program takes a set of data files as input, processes the data, and produces a set of output data
files. This operating environment is termed as "batch processing" because the input data are
collected into batches or sets of records and each batch is processed as a unit. The output is
another batch that can be reused for computation.
Data mining
Not to be confused with analytics, information extraction, or data analysis.
Machine learning and
data mining

Problems
Classification
Clustering
Regression
Anomaly detection
Association rules
Reinforcement learning
Structured prediction
Feature learning
Online learning
Semi-supervised learning
Grammar induction
Supervised learning
(classification regression)
Decision trees
Ensembles (Bagging, Boosting, Random forest)
k-NN
Linear regression
Naive Bayes
Neural networks
Logistic regression
Perceptron
Support vector machine (SVM)
Relevance vector machine (RVM)
Clustering
BIRCH
Hierarchical
k-means
Expectation-maximization (EM)

DBSCAN
OPTICS
Mean-shift
Dimensionality reduction
Factor analysis
CCA
ICA
LDA
NMF
PCA
t-SNE
Structured prediction
Graphical models (Bayes net, CRF, HMM)
Anomaly detection
k-NN
Local outlier factor
Neural nets
Autoencoder
Deep learning
Multilayer perceptron
RNN
Restricted Boltzmann machine
SOM
Convolutional neural network
Theory
Bias-variance dilemma
Computational learning theory
Empirical risk minimization
PAC learning
Statistical learning
VC theory
Computer science portal
Statistics portal
v
t
e
Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD),
[1]

an interdisciplinary subfield of computer science,
[2][3][4]
is the computational process of
discovering patterns in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems.
[2]
The overall goal of the data
mining process is to extract information from a data set and transform it into an understandable
structure for further use.
[2]
Aside from the raw analysis step, it involves database and data
management aspects, data pre-processing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered structures, visualization, and
online updating.
[2]

The term is a misnomer, because the goal is the extraction of patterns and knowledge from large
amount of data, not the extraction of data itself.
[5]
It also is a buzzword,
[6]
and is frequently also
applied to any form of large-scale data or information processing (collection, extraction,
warehousing, analysis, and statistics) as well as any application of computer decision support
system, including artificial intelligence, machine learning, and business intelligence. The popular
book "Data mining: Practical machine learning tools and techniques with Java"
[7]
(which covers
mostly machine learning material) was originally to be named just "Practical machine learning",
and the term "data mining" was only added for marketing reasons.
[8]
Often the more general
terms "(large scale) data analysis", or "analytics" or when referring to actual methods, artificial
intelligence and machine learning are more appropriate.
The actual data mining task is the automatic or semi-automatic analysis of large quantities of
data to extract previously unknown interesting patterns such as groups of data records (cluster
analysis), unusual records (anomaly detection) and dependencies (association rule mining). This
usually involves using database techniques such as spatial indices. These patterns can then be
seen as a kind of summary of the input data, and may be used in further analysis or, for example,
in machine learning and predictive analytics. For example, the data mining step might identify
multiple groups in the data, which can then be used to obtain more accurate prediction results by
a decision support system. Neither the data collection, data preparation, nor result interpretation
and reporting are part of the data mining step, but do belong to the overall KDD process as
additional steps.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining
methods to sample parts of a larger population data set that are (or may be) too small for reliable
statistical inferences to be made about the validity of any patterns discovered. These methods
can, however, be used in creating new hypotheses to test against the larger data populations.
Contents
1 Etymology
2 Background
o 2.1 Research and evolution
3 Process
o 3.1 Pre-processing
o 3.2 Data mining
o 3.3 Results validation
4 Standards
5 Notable uses
o 5.1 Games
o 5.2 Business
o 5.3 Science and engineering
o 5.4 Human rights
o 5.5 Medical data mining
o 5.6 Spatial data mining
o 5.7 Temporal data mining
o 5.8 Sensor data mining
o 5.9 Visual data mining
o 5.10 Music data mining
o 5.11 Surveillance
o 5.12 Pattern mining
o 5.13 Subject-based data mining
o 5.14 Knowledge grid
6 Privacy concerns and ethics
o 6.1 Situation in the United States
o 6.2 Situation in Europe
7 Software
o 7.1 Free open-source data mining software and applications
o 7.2 Commercial data-mining software and applications
o 7.3 Marketplace surveys
8 See also
9 References
10 Further reading
11 External links
Etymology
In the 1960s, statisticians used terms like "Data Fishing" or "Data Dredging" to refer to what
they considered the bad practice of analyzing data without an a-priori hypothesis. The term "Data
Mining" appeared around 1990 in the database community. For a short time in 1980s, a phrase
"database mining", was used, but since it was trademarked by HNC, a San Diego-based
company (now merged into FICO), to pitch their Database Mining Workstation;
[9]
researchers
consequently turned to "data mining". Other terms used include Data Archaeology, Information
Harvesting, Information Discovery, Knowledge Extraction, etc. Gregory Piatetsky-Shapiro
coined the term "Knowledge Discovery in Databases" for the first workshop on the same topic
(KDD-1989) and this term became more popular in AI and Machine Learning Community.
However, the term data mining became more popular in the business and press communities.
[10]

Currently, Data Mining and Knowledge Discovery are used interchangeably. Since about 2007,
"Predictive Analytics" and since 2011, "Data Science" terms were also used to describe this
field.
Background
The manual extraction of patterns from data has occurred for centuries. Early methods of
identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The
proliferation, ubiquity and increasing power of computer technology has dramatically increased
data collection, storage, and manipulation ability. As data sets have grown in size and
complexity, direct "hands-on" data analysis has increasingly been augmented with indirect,
automated data processing, aided by other discoveries in computer science, such as neural
networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s),
and support vector machines (1990s). Data mining is the process of applying these methods with
the intention of uncovering hidden patterns
[11]
in large data sets. It bridges the gap from applied
statistics and artificial intelligence (which usually provide the mathematical background) to
database management by exploiting the way data is stored and indexed in databases to execute
the actual learning and discovery algorithms more efficiently, allowing such methods to be
applied to ever larger data sets.
Research and evolution
The premier professional body in the field is the Association for Computing Machinery's (ACM)
Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD).
[12][13]
Since
1989 this ACM SIG has hosted an annual international conference and published its
proceedings,
[14]
and since 1999 it has published a biannual academic journal titled "SIGKDD
Explorations".
[15]

Computer science conferences on data mining include:
CIKM Conference ACM Conference on Information and Knowledge Management
DMIN Conference International Conference on Data Mining
DMKD Conference Research Issues on Data Mining and Knowledge Discovery
ECDM Conference European Conference on Data Mining
ECML-PKDD Conference European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases
EDM Conference International Conference on Educational Data Mining
ICDM Conference IEEE International Conference on Data Mining
KDD Conference ACM SIGKDD Conference on Knowledge Discovery and Data
Mining
MLDM Conference Machine Learning and Data Mining in Pattern Recognition
PAKDD Conference The annual Pacific-Asia Conference on Knowledge Discovery and
Data Mining
PAW Conference Predictive Analytics World
SDM Conference SIAM International Conference on Data Mining (SIAM)
SSTD Symposium Symposium on Spatial and Temporal Databases
WSDM Conference ACM Conference on Web Search and Data Mining
Data mining topics are also present on many data management/database conferences such as the
ICDE Conference, SIGMOD Conference and International Conference on Very Large Data
Bases
Process
The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.
[1]

It exists, however, in many variations on this theme, such as the Cross Industry Standard Process
for Data Mining (CRISP-DM) which defines six phases:
(1) Business Understanding
(2) Data Understanding
(3) Data Preparation
(4) Modeling
(5) Evaluation
(6) Deployment
or a simplified process such as (1) pre-processing, (2) data mining, and (3) results validation.
Polls conducted in 2002, 2004, and 2007 show that the CRISP-DM methodology is the leading
methodology used by data miners.
[16][17][18]
The only other data mining standard named in these
polls was SEMMA. However, 3-4 times as many people reported using CRISP-DM. Several
teams of researchers have published reviews of data mining process models,
[19][20]
and Azevedo
and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.
[21]

Pre-processing
Before data mining algorithms can be used, a target data set must be assembled. As data mining
can only uncover patterns actually present in the data, the target data set must be large enough to
contain these patterns while remaining concise enough to be mined within an acceptable time
limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to
analyze the multivariate data sets before data mining. The target set is then cleaned. Data
cleaning removes the observations containing noise and those with missing data.
Data mining
Data mining involves six common classes of tasks:
[1]

Anomaly detection (Outlier/change/deviation detection) The identification of unusual
data records, that might be interesting or data errors that require further investigation.
Association rule learning (Dependency modeling) Searches for relationships between
variables. For example a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
Classification is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression attempts to find a function which models the data with the least error.
Summarization providing a more compact representation of the data set, including
visualization and report generation.
Results validation
Data mining can unintentionally be misused, and can then produce results which appear to be
significant; but which do not actually predict future behavior and cannot be reproduced on a new
sample of data and bear little use. Often this results from investigating too many hypotheses and
not performing proper statistical hypothesis testing. A simple version of this problem in machine
learning is known as overfitting, but the same problem can arise at different phases of the
process and thus a train/test split - when applicable at all - may not be sufficient to prevent this
from happening.
[citation needed]

This section is missing information about non-classification tasks in data mining. It
only covers machine learning. Please expand the section to include this information.
Further details may exist on the talk page. (September 2011)
The final step of knowledge discovery from data is to verify that the patterns produced by the
data mining algorithms occur in the wider data set. Not all patterns found by the data mining
algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in
the training set which are not present in the general data set. This is called overfitting. To
overcome this, the evaluation uses a test set of data on which the data mining algorithm was not
trained. The learned patterns are applied to this test set, and the resulting output is compared to
the desired output. For example, a data mining algorithm trying to distinguish "spam" from
"legitimate" emails would be trained on a training set of sample e-mails. Once trained, the
learned patterns would be applied to the test set of e-mails on which it had not been trained. The
accuracy of the patterns can then be measured from how many e-mails they correctly classify. A
number of statistical methods may be used to evaluate the algorithm, such as ROC curves.
If the learned patterns do not meet the desired standards, subsequently it is necessary to re-
evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the
desired standards, then the final step is to interpret the learned patterns and turn them into
knowledge.
Standards
There have been some efforts to define standards for the data mining process, for example the
1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004
Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-
DM 2.0 and JDM 2.0) was active in 2006, but has stalled since. JDM 2.0 was withdrawn without
reaching a final draft.
For exchanging the extracted models in particular for use in predictive analytics the key
standard is the Predictive Model Markup Language (PMML), which is an XML-based language
developed by the Data Mining Group (DMG) and supported as exchange format by many data
mining applications. As the name suggests, it only covers prediction models, a particular data
mining task of high importance to business applications. However, extensions to cover (for
example) subspace clustering have been proposed independently of the DMG.
[22]

Notable uses

It has been suggested that this section be split into a new article titled Examples of data
mining. (Discuss) Proposed since January 2014.
See also Category:Applied data mining.
Games
Since the early 1960s, with the availability of oracles for certain combinatorial games, also called
tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes,
small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data
mining has been opened. This is the extraction of human-usable strategies from these oracles.
Current pattern recognition approaches do not seem to fully acquire the high level of abstraction
required to be applied successfully. Instead, extensive experimentation with the tablebases
combined with an intensive study of tablebase-answers to well designed problems, and with
knowledge of prior art (i.e., pre-tablebase knowledge) is used to yield insightful patterns.
Berlekamp (in dots-and-boxes, etc.) and John Nunn (in chess endgames) are notable examples of
researchers doing this work, though they were not and are not involved in tablebase
generation.
Business
In business, data mining is the analysis of historical business activities, stored as static data in
data warehouse databases. The goal is to reveal hidden patterns and trends. Data mining software
uses advanced pattern recognition algorithms to sift through large amounts of data to assist in
discovering previously unknown strategic business information. Examples of what businesses
use data mining for include performing market analysis to identify new product bundles, finding
the root cause of manufacturing problems, to prevent customer attrition and acquire new
customers, cross-sell to existing customers, and profile customers with more accuracy.
[23]

In todays world raw data is being collected by companies at an exploding rate. For
example, Walmart processes over 20 million point-of-sale transactions every day. This
information is stored in a centralized database, but would be useless without some type of
data mining software to analyze it. If Walmart analyzed their point-of-sale data with data
mining techniques they would be able to determine sales trends, develop marketing
campaigns, and more accurately predict customer loyalty.
[24]

Every time a credit card or a store loyalty card is being used, or a warranty card is being
filled, data is being collected about the users behavior. Many people find the amount of
information stored about us from companies, such as Google, Facebook, and Amazon,
disturbing and are concerned about privacy. Although there is the potential for our
personal data to be used in harmful, or unwanted, ways it is also being used to make our
lives better. For example, Ford and Audi hope to one day collect information about
customer driving patterns so they can recommend safer routes and warn drivers about
dangerous road conditions.
[25]

Data mining in customer relationship management applications can contribute
significantly to the bottom line.
[citation needed]
Rather than randomly contacting a prospect or
customer through a call center or sending mail, a company can concentrate its efforts on
prospects that are predicted to have a high likelihood of responding to an offer. More
sophisticated methods may be used to optimize resources across campaigns so that one
may predict to which channel and to which offer an individual is most likely to respond
(across all potential offers). Additionally, sophisticated applications could be used to
automate mailing. Once the results from data mining (potential prospect/customer and
channel/offer) are determined, this "sophisticated application" can either automatically
send an e-mail or a regular mail. Finally, in cases where many people will take an action
without an offer, "uplift modeling" can be used to determine which people have the
greatest increase in response if given an offer. Uplift modeling thereby enables marketers
to focus mailings and offers on persuadable people, and not to send offers to people who
will buy the product without an offer. Data clustering can also be used to automatically
discover the segments or groups within a customer data set.
Businesses employing data mining may see a return on investment, but also they
recognize that the number of predictive models can quickly become very large. For
example, rather than using one model to predict how many customers will churn, a
business may choose to build a separate model for each region and customer type. In
situations where a large number of models need to be maintained, some businesses turn
to more automated data mining methodologies.
Data mining can be helpful to human resources (HR) departments in identifying the
characteristics of their most successful employees. Information obtained such as
universities attended by highly successful employees can help HR focus recruiting
efforts accordingly. Additionally, Strategic Enterprise Management applications help a
company translate corporate-level goals, such as profit and margin share targets, into
operational decisions, such as production plans and workforce levels.
[26]

Market basket analysis, relates to data-mining use in retail sales. If a clothing store
records the purchases of customers, a data mining system could identify those customers
who favor silk shirts over cotton ones. Although some explanations of relationships may
be difficult, taking advantage of it is easier. The example deals with association rules
within transaction-based data. Not all data are transaction based and logical, or inexact
rules may also be present within a database.
Market basket analysis has been used to identify the purchase patterns of the Alpha
Consumer. Analyzing the data collected on this type of user has allowed companies to
predict future buying trends and forecast supply demands.
[citation needed]

Data mining is a highly effective tool in the catalog marketing industry.
[citation needed]

Catalogers have a rich database of history of their customer transactions for millions of
customers dating back a number of years. Data mining tools can identify patterns among
customers and help identify the most likely customers to respond to upcoming mailing
campaigns.
Data mining for business applications can be integrated into a complex modeling and
decision making process.
[27]
Reactive business intelligence (RBI) advocates a "holistic"
approach that integrates data mining, modeling, and interactive visualization into an end-
to-end discovery and continuous innovation process powered by human and automated
learning.
[28]

In the area of decision making, the RBI approach has been used to mine knowledge that
is progressively acquired from the decision maker, and then self-tune the decision method
accordingly.
[29]
The relation between the quality of a data mining system and the amount
of investment that the decision maker is willing to make was formalized by providing an
economic perspective on the value of extracted knowledge in terms of its payoff to the
organization
[27]
This decision-theoretic classification framework
[27]
was applied to a real-
world semiconductor wafer manufacturing line, where decision rules for effectively
monitoring and controlling the semiconductor wafer fabrication line were developed.
[30]

An example of data mining related to an integrated-circuit (IC) production line is
described in the paper "Mining IC Test Data to Optimize VLSI Testing."
[31]
In this paper,
the application of data mining and decision analysis to the problem of die-level functional
testing is described. Experiments mentioned demonstrate the ability to apply a system of
mining historical die-test data to create a probabilistic model of patterns of die failure.
These patterns are then utilized to decide, in real time, which die to test next and when to
stop testing. This system has been shown, based on experiments with historical test data,
to have the potential to improve profits on mature IC products. Other examples
[32][33]
of
the application of data mining methodologies in semiconductor manufacturing
environments suggest that data mining methodologies may be particularly useful when
data is scarce, and the various physical and chemical parameters that affect the process
exhibit highly complex interactions. Another implication is that on-line monitoring of the
semiconductor manufacturing process using data mining may be highly effective.
Science and engineering
In recent years, data mining has been used widely in the areas of science and engineering, such
as bioinformatics, genetics, medicine, education and electrical power engineering.
In the study of human genetics, sequence mining helps address the important goal of
understanding the mapping relationship between the inter-individual variations in human
DNA sequence and the variability in disease susceptibility. In simple terms, it aims to
find out how the changes in an individual's DNA sequence affects the risks of developing
common diseases such as cancer, which is of great importance to improving methods of
diagnosing, preventing, and treating these diseases. One data mining method that is used
to perform this task is known as multifactor dimensionality reduction.
[34]

In the area of electrical power engineering, data mining methods have been widely used
for condition monitoring of high voltage electrical equipment. The purpose of condition
monitoring is to obtain valuable information on, for example, the status of the insulation
(or other important safety-related parameters). Data clustering techniques such as the
self-organizing map (SOM), have been applied to vibration monitoring and analysis of
transformer on-load tap-changers (OLTCS). Using vibration monitoring, it can be
observed that each tap change operation generates a signal that contains information
about the condition of the tap changer contacts and the drive mechanisms. Obviously,
different tap positions will generate different signals. However, there was considerable
variability amongst normal condition signals for exactly the same tap position. SOM has
been applied to detect abnormal conditions and to hypothesize about the nature of the
abnormalities.
[35]

Data mining methods have been applied to dissolved gas analysis (DGA) in power
transformers. DGA, as a diagnostics for power transformers, has been available for many
years. Methods such as SOM has been applied to analyze generated data and to determine
trends which are not obvious to the standard DGA ratio methods (such as Duval
Triangle).
[35]

In educational research, where data mining has been used to study the factors leading
students to choose to engage in behaviors which reduce their learning,
[36]
and to
understand factors influencing university student retention.
[37]
A similar example of
social application of data mining is its use in expertise finding systems, whereby
descriptors of human expertise are extracted, normalized, and classified so as to facilitate
the finding of experts, particularly in scientific and technical fields. In this way, data
mining can facilitate institutional memory.
Data mining methods of biomedical data facilitated by domain ontologies,
[38]
mining
clinical trial data,
[39]
and traffic analysis using SOM.
[40]

In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998,
used data mining methods to routinely screen for reporting patterns indicative of
emerging drug safety issues in the WHO global database of 4.6 million suspected adverse
drug reaction incidents.
[41]
Recently, similar methodology has been developed to mine
large collections of electronic health records for temporal patterns associating drug
prescriptions to medical diagnoses.
[42]

Data mining has been applied to software artifacts within the realm of software
engineering: Mining Software Repositories.
Human rights
Data mining of government records particularly records of the justice system (i.e., courts,
prisons) enables the discovery of systemic human rights violations in connection to generation
and publication of invalid or fraudulent legal records by various government agencies.
[43][44]

Medical data mining
In 2011, the case of Sorrell v. IMS Health, Inc., decided by the Supreme Court of the United
States, ruled that pharmacies may share information with outside companies. This practice was
authorized under the 1st Amendment of the Constitution, protecting the "freedom of speech."
[45]

However, the passage of the Health Information Technology for Economic and Clinical Health
Act (HITECH Act) helped to initiate the adoption of the electronic health record (EHR) and
supporting technology in the United States.
[46]
The HITECH Act was signed into law on
February 17, 2009 as part of the American Recovery and Reinvestment Act (ARRA) and helped
to open the door to medical data mining.
[47]
Prior to the signing of this law, estimates of only
20% of United States based physician were utilizing electronic patient records.
[46]
Sren Brunak
notes that the patient record becomes as information-rich as possible and thereby maximizes
the data mining opportunities.
[46]
Hence, electronic patient records further expands the
possibilities regarding medical data mining thereby opening the door to a vast source of medical
data analysis.
Spatial data mining
Spatial data mining is the application of data mining methods to spatial data. The end objective
of spatial data mining is to find patterns in data with respect to geography. So far, data mining
and Geographic Information Systems (GIS) have existed as two separate technologies, each with
its own methods, traditions, and approaches to visualization and data analysis. Particularly, most
contemporary GIS have only very basic spatial analysis functionality. The immense explosion in
geographically referenced data occasioned by developments in IT, digital mapping, remote
sensing, and the global diffusion of GIS emphasizes the importance of developing data-driven
inductive approaches to geographical analysis and modeling.
Data mining offers great potential benefits for GIS-based applied decision-making. Recently, the
task of integrating these two technologies has become of critical importance, especially as
various public and private sector organizations possessing huge databases with thematic and
geographically referenced data begin to realize the huge potential of the information contained
therein. Among those organizations are:
offices requiring analysis or dissemination of geo-referenced statistical data
public health services searching for explanations of disease clustering
environmental agencies assessing the impact of changing land-use patterns on climate
change
geo-marketing companies doing customer segmentation based on spatial location.
Challenges in Spatial mining: Geospatial data repositories tend to be very large. Moreover,
existing GIS datasets are often splintered into feature and attribute components that are
conventionally archived in hybrid data management systems. Algorithmic requirements differ
substantially for relational (attribute) data management and for topological (feature) data
management.
[48]
Related to this is the range and diversity of geographic data formats, which
present unique challenges. The digital geographic data revolution is creating new types of data
formats beyond the traditional "vector" and "raster" formats. Geographic data repositories
increasingly include ill-structured data, such as imagery and geo-referenced multi-media.
[49]

There are several critical research challenges in geographic knowledge discovery and data
mining. Miller and Han
[50]
offer the following list of emerging research topics in the field:
Developing and supporting geographic data warehouses (GDW's): Spatial properties
are often reduced to simple aspatial attributes in mainstream data warehouses. Creating
an integrated GDW requires solving issues of spatial and temporal data interoperability
including differences in semantics, referencing systems, geometry, accuracy, and
position.
Better spatio-temporal representations in geographic knowledge discovery: Current
geographic knowledge discovery (GKD) methods generally use very simple
representations of geographic objects and spatial relationships. Geographic data mining
methods should recognize more complex geographic objects (i.e., lines and polygons)
and relationships (i.e., non-Euclidean distances, direction, connectivity, and interaction
through attributed geographic space such as terrain). Furthermore, the time dimension
needs to be more fully integrated into these geographic representations and relationships.
Geographic knowledge discovery using diverse data types: GKD methods should be
developed that can handle diverse data types beyond the traditional raster and vector
models, including imagery and geo-referenced multimedia, as well as dynamic data types
(video streams, animation).
Temporal data mining
Data may contain attributes generated and recorded at different times. In this case finding
meaningful relationships in the data may require considering the temporal order of the attributes.
A temporal relationship may indicate a causal relationship, or simply an association.
[citation needed]

Sensor data mining
Wireless sensor networks can be used for facilitating the collection of data for spatial data
mining for a variety of applications such as air pollution monitoring.
[51]
A characteristic of such
networks is that nearby sensor nodes monitoring an environmental feature typically register
similar values. This kind of data redundancy due to the spatial correlation between sensor
observations inspires the techniques for in-network data aggregation and mining. By measuring
the spatial correlation between data sampled by different sensors, a wide class of specialized
algorithms can be developed to develop more efficient spatial data mining algorithms.
[52]

Visual data mining
In the process of turning from analogical into digital, large data sets have been generated,
collected, and stored discovering statistical patterns, trends and information which is hidden in
data, in order to build predictive patterns. Studies suggest visual data mining is faster and much
more intuitive than is traditional data mining.
[53][54][55]
See also Computer vision.
Music data mining
Data mining techniques, and in particular co-occurrence analysis, has been used to discover
relevant similarities among music corpora (radio lists, CD databases) for purposes including
classifying music into genres in a more objective manner.
[56]

Surveillance
Data mining has been used by the U.S. government. Programs include the Total Information
Awareness (TIA) program, Secure Flight (formerly known as Computer-Assisted Passenger
Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization, Insight, Semantic
Enhancement (ADVISE),
[57]
and the Multi-state Anti-Terrorism Information Exchange
(MATRIX).
[58]
These programs have been discontinued due to controversy over whether they
violate the 4th Amendment to the United States Constitution, although many programs that were
formed under them continue to be funded by different organizations or under different names.
[59]

In the context of combating terrorism, two particularly plausible methods of data mining are
"pattern mining" and "subject-based data mining".
Pattern mining
"Pattern mining" is a data mining method that involves finding existing patterns in data. In this
context patterns often means association rules. The original motivation for searching association
rules came from the desire to analyze supermarket transaction data, that is, to examine customer
behavior in terms of the purchased products. For example, an association rule "beer potato
chips (80%)" states that four out of five customers that bought beer also bought potato chips.
In the context of pattern mining as a tool to identify terrorist activity, the National Research
Council provides the following definition: "Pattern-based data mining looks for patterns
(including anomalous data patterns) that might be associated with terrorist activity these
patterns might be regarded as small signals in a large ocean of noise."
[60][61][62]
Pattern Mining
includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the
temporal and non temporal domains are imported to classical knowledge discovery search
methods.
Subject-based data mining
"Subject-based data mining" is a data mining method involving the search for associations
between individuals in data. In the context of combating terrorism, the National Research
Council provides the following definition: "Subject-based data mining uses an initiating
individual or other datum that is considered, based on other information, to be of high interest,
and the goal is to determine what other persons or financial transactions or movements, etc., are
related to that initiating datum."
[61]

Knowledge grid
Knowledge discovery "On the Grid" generally refers to conducting knowledge discovery in an
open environment using grid computing concepts, allowing users to integrate data from various
online data sources, as well make use of remote resources, for executing their data mining tasks.
The earliest example was the Discovery Net,
[63][64]
developed at Imperial College London, which
won the "Most Innovative Data-Intensive Application Award" at the ACM SC02
(Supercomputing 2002) conference and exhibition, based on a demonstration of a fully
interactive distributed knowledge discovery application for a bioinformatics application. Other
examples include work conducted by researchers at the University of Calabria, who developed a
Knowledge Grid architecture for distributed knowledge discovery, based on grid
computing.
[65][66]

Privacy concerns and ethics
While the term "data mining" itself has no ethical implications, it is often associated with the
mining of information in relation to peoples' behavior (ethical and otherwise).
[67]

The ways in which data mining can be used can in some cases and contexts raise questions
regarding privacy, legality, and ethics.
[68]
In particular, data mining government or commercial
data sets for national security or law enforcement purposes, such as in the Total Information
Awareness Program or in ADVISE, has raised privacy concerns.
[69][70]

Data mining requires data preparation which can uncover information or patterns which may
compromise confidentiality and privacy obligations. A common way for this to occur is through
data aggregation. Data aggregation involves combining data together (possibly from various
sources) in a way that facilitates analysis (but that also might make identification of private,
individual-level data deducible or otherwise apparent).
[71]
This is not data mining per se, but a
result of the preparation of data before and for the purposes of the analysis. The threat to an
individual's privacy comes into play when the data, once compiled, cause the data miner, or
anyone who has access to the newly compiled data set, to be able to identify specific individuals,
especially when the data were originally anonymous.
[72][73][74]

It is recommended that an individual is made aware of the following before data are collected:
[71]

the purpose of the data collection and any (known) data mining projects;
how the data will be used;
who will be able to mine the data and use the data and their derivatives;
the status of security surrounding access to the data;
how collected data can be updated.
Data may also be modified so as to become anonymous, so that individuals may not readily be
identified.
[71]
However, even "de-identified"/"anonymized" data sets can potentially contain
enough information to allow identification of individuals, as occurred when journalists were able
to find several individuals based on a set of search histories that were inadvertently released by
AOL.
[75]

Situation in the United States
In the United States, privacy concerns have been addressed to some
[weasel words]
extent by the US
Congress via the passage of regulatory controls such as the Health Insurance Portability and
Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent"
regarding information they provide and its intended present and future uses. According to an
article in Biotech Business Week', "'[i]n practice, HIPAA may not offer any greater protection
than the longstanding regulations in the research arena,' says the AAHC. More importantly, the
rule's goal of protection through informed consent is undermined by the complexity of consent
forms that are required of patients and participants, which approach a level of
incomprehensibility to average individuals."
[76]
This underscores the necessity for data
anonymity in data aggregation and mining practices.
U.S. information privacy legislation such as HIPAA and the Family Educational Rights and
Privacy Act (FERPA) applies only to the specific areas that each such law addresses. Use of data
mining by the majority of businesses in the U.S. is not controlled by any legislation.
Situation in Europe
Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights
of the consumers. However, the U.S.-E.U. Safe Harbor Principles currently effectively expose
European users to privacy exploitation by U.S. companies. As a consequence of Edward
Snowden's Global surveillance disclosure, there has been increased discussion to revoke this
agreement, as in particular the data will be fully exposed to the National Security Agency, and
attempts to reach an agreement have failed.
[citation needed]

Software
See also Category:Data mining and machine learning software.
Free open-source data mining software and applications
Carrot2: Text and search results clustering framework.
Chemicalize.org: A chemical structure miner and web search engine.
ELKI: A university research project with advanced cluster analysis and outlier detection
methods written in the Java language.
GATE: a natural language processing and language engineering tool.
KNIME: The Konstanz Information Miner, a user friendly and comprehensive data
analytics framework.
ML-Flex: A software package that enables users to integrate with third-party machine-
learning packages written in any programming language, execute classification analyses
in parallel across multiple computing nodes, and produce HTML reports of classification
results.
MLPACK library: a collection of ready-to-use machine learning algorithms written in the
C++ language.
Massive Online Analysis (MOA): a real-time big data stream mining with concept drift
tool in the Java programming language.
NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and
statistical natural language processing (NLP) for the Python language.
OpenNN: Open neural networks library.
Orange: A component-based data mining and machine learning software suite written in
the Python language.
R: A programming language and software environment for statistical computing, data
mining, and graphics. It is part of the GNU Project.
RapidMiner: An environment for machine learning and data mining experiments.
SCaViS: Java cross-platform data analysis framework developed at Argonne National
Laboratory.
SenticNet API: A semantic and affective resource for opinion mining and sentiment
analysis.
Tanagra: A visualisation-oriented data mining software, also for teaching.
Torch: An open source deep learning library for the Lua programming language and
scientific computing framework with wide support for machine learning algorithms.
SPMF: A data mining framework and application written in Java with implementations of
a variety of algorithms.
UIMA: The UIMA (Unstructured Information Management Architecture) is a component
framework for analyzing unstructured content such as text, audio and video originally
developed by IBM.
Weka: A suite of machine learning software applications written in the Java
programming language.
Commercial data-mining software and applications
Angoss KnowledgeSTUDIO: data mining tool provided by Angoss.
Clarabridge: enterprise class text analytics solution.
HP Vertica Analytics Platform: data mining software provided by HP.
IBM SPSS Modeler: data mining software provided by IBM.
KXEN Modeler: data mining tool provided by KXEN.
LIONsolver: an integrated software application for data mining, business intelligence,
and modeling that implements the Learning and Intelligent OptimizatioN (LION)
approach.
Microsoft Analysis Services: data mining software provided by Microsoft.
NetOwl: suite of multilingual text and entity analytics products that enable data mining.
Neural Designer: data mining software provided by Intelnics.
Oracle Data Mining: data mining software by Oracle.
QIWare: data mining software by Forte Wares.
SAS Enterprise Miner: data mining software provided by the SAS Institute.
STATISTICA Data Miner: data mining software provided by StatSoft.
Marketplace surveys
Several researchers and organizations have conducted reviews of data mining tools and surveys
of data miners. These identify some of the strengths and weaknesses of the software packages.
They also provide an overview of the behaviors, preferences and views of data miners. Some of
these reports include:
2011 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
[77]

Rexer Analytics Data Miner Surveys (20072013)
[78]

Forrester Research 2010 Predictive Analytics and Data Mining Solutions report
[79]

Gartner 2008 "Magic Quadrant" report
[80]

Robert A. Nisbet's 2006 Three Part Series of articles "Data Mining Tools: Which One is
Best For CRM?"
[81]

Haughton et al.'s 2003 Review of Data Mining Software Packages in The American
Statistician
[82]

Goebel & Gruenwald 1999 "A Survey of Data Mining a Knowledge Discovery Software
Tools" in SIGKDD Explorations
[83]

See also

Business Analytics

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Business Analytics

Hochgeladen von

Copyright:

Verfügbare Formate

Business analytics

From Wikipedia, the free encyclopedia

Das könnte Ihnen auch gefallen