Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and

Analysis of data is a process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering useful information, suggesting

conclusions, and supporting decision-making. Data analysis has multiple facets and
approaches, encompassing diverse techniques under a variety of names, in
different business, science, and social science domains.
Data mining is a particular data analysis technique that focuses on modeling and
knowledge discovery for predictive rather than purely descriptive
purposes. Business intelligencecovers data analysis that relies heavily on
aggregation, focusing on business information. Instatistical applications, some
people divide data analysis into descriptive statistics,exploratory data
analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering
new features in the data and CDA on confirming or falsifying existing
hypotheses. Predictive analytics focuses on application of statistical models for
predictive forecasting or classification, while text analytics applies statistical,
linguistic, and structural techniques to extract and classify information from textual
sources, a species ofunstructured data. All are varieties of data analysis.
Data integration is a precursor to data analysis, and data analysis is closely linked
to data visualization and data dissemination. The term data analysis is sometimes
used as a synonym for data modeling.
The process of data analysis[edit]
Data science process flowchart

Data analysis is a process for obtaining raw data and converting it into information
useful for decision-making by users. Data is collected and analyzed to answer
questions, test hypotheses or disprove theories. [1]
There are several phases that can be distinguished. The phases are iterative, in that
feedback from later phases may result in additional work in earlier phases. [2]
Data requirements[edit]
The data necessary as inputs to the analysis are specified based upon the
requirements of those directing the analysis or customers who will use the finished
product of the analysis. The general type of entity upon which the data will be
collected is referred to as an experimental unit (e.g., a person or population of
people). Specific variables regarding a population (e.g., age and income) may be
specified and obtained. Data may be numerical or categorical (i.e., a text label for
numbers).[2]
Data collection[edit]
Data is collected from a variety of sources. The requirements may be communicated
by analysts to custodians of the data, such as information technology personnel
within an organization. The data may also be collected from sensors in the
environment, such as traffic cameras, satellites, recording devices, etc. It may also
be obtained through interviews, downloads from online sources, or reading
documentation.[2]
Data processing[edit]
The phases of the intelligence cycle used to convert raw information into actionable
intelligence or knowledge are conceptually similar to the phases in data analysis.
Data initially obtained must be processed or organized for analysis. For instance,
this may involve placing data into rows and columns in a table format for further
analysis, such as within a spreadsheet or statistical software. [2]
Data cleaning[edit]
Once processed and organized, the data may be incomplete, contain duplicates, or
contain errors. The need for data cleaning will arise from problems in the way that
data is entered and stored. Data cleaning is the process of preventing and
correcting these errors. Common tasks include record matching, deduplication, and
column segmentation.[3] Such data problems can also be identified through a variety
of analytical techniques. For example, with financial information, the totals for
particular variables may be compared against separately published numbers
believed to be reliable. Unusual amounts above or below pre-determined thresholds
may also be reviewed. There are several types of data cleaning that depend on the
type of data. Quantitative data methods for outlier detection can be used to get rid
of likely incorrectly entered data. Textual data spellcheckers can be used to lessen
the amount of mistyped words, but it is harder to tell if the words themselves are
correct.[4]
Exploratory data analysis[edit]
Once the data is cleaned, it can be analyzed. Analysts may apply a variety of
techniques referred to as exploratory data analysis to begin understanding the
messages contained in the data. The process of exploration may result in additional
data cleaning or additional requests for data, so these activities may be iterative in
nature. Descriptive statistics such as the average or median may be generated to
help understand the data. Data visualization may also be used to examine the data
in graphical format, to obtain additional insight regarding the messages within the
data.[2]
Modeling and algorithms[edit]
Mathematical formulas or models called algorithms may be applied to the data to
identify relationships among the variables, such ascorrelation or causation. In
general terms, models may be developed to evaluate a particular variable in the
data based on other variable(s) in the data, with some residual error depending on
model accuracy (i.e., Data = Model + Error). [1]
Inferential statistics includes techniques to measure relationships between
particular variables. For example, regression analysis may be used to model
whether a change in advertising (independent variable X) explains the variation in
sales (dependent variable Y). In mathematical terms, Y (sales) is a function of X
(advertising). It may be described as Y = aX + b + error, where the model is
designed such that a and b minimize the error when the model predicts Y for a
given range of values of X. Analysts may attempt to build models that are
descriptive of the data to simplify analysis and communicate results. [1]
Data product[edit]
A data product is a computer application that takes data inputs and generates
outputs, feeding them back into the environment. It may be based on a model or
algorithm. An example is an application that analyzes data about customer

purchasing history and recommends other purchases the customer might enjoy. [2]
Communication[edit]
Once the data is analyzed, it may be reported in many formats to the users of the
analysis to support their requirements. The users may have feedback, which results
in additional analysis. As such, much of the analytical cycle is iterative. [2] When
determining how to communicate the results, the analyst may consider data
visualization techniques to help clearly and efficiently communicate the message to
the audience.
Quantitative messages[edit]
Main article: Data visualization
A time series illustrated with a line chart demonstrating trends in U.S. federal
spending and revenue over time.
A scatterplot illustrating correlation between two variables (inflation and

unemployment) measured at points in time.
Author Stephen Few described eight types of quantitative messages that users may
attempt to understand or communicate from a set of data and the associated
graphs used to help communicate the message. Customers specifying requirements
and analysts performing the data analysis may consider these messages during the
course of the process.
1. Time-series: A single variable is captured over a period of time, such as the

unemployment rate over a 10-year period. A line chart may be used to
demonstrate the trend.
2. Ranking: Categorical subdivisions are ranked in ascending or descending
order, such as a ranking of sales performance (the measure) by sales persons
(the category, with each sales person a categorical subdivision) during a
single period. A bar chart may be used to show the comparison across the
sales persons.
3. Part-to-whole: Categorical subdivisions are measured as a ratio to the whole
(i.e., a percentage out of 100%). A pie chart or bar chart can show the
comparison of ratios, such as the market share represented by competitors in
a market.
4. Deviation: Categorical subdivisions are compared again a reference, such as
a comparison of actual vs. budget expenses for several departments of a
business for a given time period. A bar chart can show comparison of the
actual versus the reference amount.
5. Frequency distribution: Shows the number of observations of a particular
variable for given interval, such as the number of years in which the stock
market return is between intervals such as 0-10%, 11-20%, etc. A histogram,
a type of bar chart, may be used for this analysis.
6. Correlation: Comparison between observations represented by two variables
(X,Y) to determine if they tend to move in the same or opposite directions.
For example, plotting unemployment (X) and inflation (Y) for a sample of
months. A scatter plot is typically used for this message.
7. Nominal comparison: Comparing categorical subdivisions in no particular
order, such as the sales volume by product code. A bar chart may be used for
this comparison.
8. Geographic or geospatial: Comparison of a variable across a map or layout,
such as the unemployment rate by state or the number of persons on the
various floors of a building. A cartogram is a typical graphic used.[5][6]
Analytical activities of data users[edit]
Users may have particular data points of interest within a data set, as opposed to
general messaging outlined above. Such low-level user analytic activities are
presented in the following table. The taxonomy can also be organized by three poles
of activities: retrieving values, finding data points, and arranging data points. [7][8][9]
# Task
General
Pro Forma
Examples
Retrieve
Value
2 Filter
Description
Abstract
Given a set of
specific cases,
find attributes of
those cases.
What are the values

of attributes {X, Y,
Z, ...} in the data
cases {A, B, C, ...}?
Given some
concrete
conditions on
Which data cases
attribute values, satisfy conditions {A,
find data cases B, C...}?
satisfying those
conditions.
Given a set of
data cases,
compute an
Compute
aggregate
3 Derived
numeric
Value
representation
of those data
cases.
Find
4 Extremu
m
Find data cases

possessing an
extreme value
of an attribute
over its range
within the data
set.
5 Sort
Given a set of
data cases, rank
them according
to some ordinal
metric.
6 Determin Given a set of

e Range data cases and
What is the value of

aggregation function
F over a given set S
of data cases?
- What is the mileage per

gallon of the Audi TT?
- How long is the movie
Gone with the Wind?
- What Kellogg's cereals
have high fiber?
- What comedies have won
awards?
- Which funds
underperformed the SP500?
- What is the average
calorie content of Post
cereals?
- What is the gross income
of all stores combined?
- How many manufacturers
of cars are there?
- What is the car with the
highest MPG?
What are the

- What director/film has won
top/bottom N data
the most awards?
cases with respect to
attribute A?
- What Robin Williams film
has the most recent release
date?
What is the sorted
- Order the cars by weight.
order of a set S of
data cases according - Rank the cereals by
to their value of
calories.
attribute A?
What is the range of
values of attribute A
- What is the range of film
lengths?
an attribute of
interest, find the in a set S of data
span of values
cases?
within the set.
Given a set of
data cases and
a quantitative
Characte attribute of
rize
interest,
7
Distributi characterize the
on
distribution of
that attributes
values over the
set.
What is the
distribution of values
of attribute A in a set
S of data cases?
Identify any
anomalies
within a given
set of data
Find
cases with
8 Anomalie respect to a
s
given
relationship or
expectation,
e.g. statistical
outliers.
Which data cases in a

set S of data cases
have
unexpected/exceptio
nal values?
9 Cluster
Given a set of
data cases, find
clusters of
similar attribute
values.
1 Correlate Given a set of

0
data cases and
two attributes,
determine
useful
relationships
between the
- What is the range of car

horsepowers?
- What actresses are in the
data set?
- What is the distribution of

carbohydrates in cereals?
- What is the age
distribution of shoppers?
- Are there exceptions to

the relationship between
horsepower and
acceleration?
- Are there any outliers in
protein?
Which data cases in a - Are there groups of cereals

w/ similar
set S of data cases
fat/calories/sugar?
are similar in value
for attributes {X, Y,
- Is there a cluster of typical
Z, }?
film lengths?
What is the
correlation between
attributes X and Y
over a given set S of
data cases?
- Is there a correlation
between carbohydrates and
fat?
- Is there a correlation
between country of origin
and MPG?
values of those
attributes.
- Do different genders have

a preferred payment
method?
- Is there a trend of
increasing film length over
the years?
Barriers to effective analysis[edit]

Barriers to effective analysis may exist among the analysts performing the data
analysis or among the audience. Distinguishing fact from opinion, cognitive biases,
and innumeracy are all challenges to sound data analysis.
Confusing fact and opinion[edit]
You are entitled to your own opinion, but you are not entitled to your own facts.
Daniel Patrick Moynihan
Effective analysis requires obtaining relevant facts to answer questions, support a
conclusion or formal opinion, or test hypotheses. Facts by definition are irrefutable,
meaning that any person involved in the analysis should be able to agree upon
them. For example, in August 2010, the Congressional Budget Office (CBO)
estimated that extending the Bush tax cuts of 2001 and 2003 for the 2011-2020
time period would add approximately $3.3 trillion to the national debt. [10] Everyone
should be able to agree that indeed this is what CBO reported; they can all examine
the report. This makes it a fact. Whether persons agree or disagree with the CBO is
their own opinion.
As another example, the auditor of a public company must arrive at a formal
opinion on whether financial statements of publicly traded corporations are "fairly
stated, in all material respects." This requires extensive analysis of factual data and
evidence to support their opinion. When making the leap from facts to opinions,
there is always the possibility that the opinion is erroneous.
Cognitive biases[edit]
There are a variety of cognitive biases that can adversely effect analysis. For
example, confirmation bias is the tendency to search for or interpret information in
a way that confirms one's preconceptions. In addition, individuals may discredit
information that does not support their views. Analysts may be trained specifically
to be aware of these biases and how to overcome them.
Innumeracy[edit]
Effective analysts are generally adept with a variety of numerical techniques.

However, audiences may not have such literacy with numbers or numeracy; they
are said to be innumerate. Persons communicating the data may also be attempting
to mislead or misinform, deliberately using bad numerical techniques. [11]
For example, whether a number is rising or falling may not be the key factor. More
important may be the number relative to another number, such as the size of
government revenue or spending relative to the size of the economy (GDP) or the
amount of cost relative to revenue in corporate financial statements. This numerical
technique is referred to as common-sizing. There are many such techniques
employed by analysts, whether adjusting for inflation (i.e., comparing real vs.
nominal data) or considering population increases, demographics, etc. Analysts
apply a variety of techniques to address the various quantitative messages
described in the section above.
Analysts may also analyze data under different assumptions or scenarios. For
example, when analysts perform financial statement analysis, they will often recast
the financial statements under different assumptions to help arrive at an estimate
of future cash flow, which they then discount to present value based on some
interest rate, to determine the valuation of the company or its stock. Similarly, the
CBO analyzes the effects of various policy options on the government's revenue,
outlays and deficits, creating alternative future scenarios for key measures.
ata analytics (DA) is the science of examining raw data with the purpose of
drawing conclusions about that information. Data analytics is used in many
industries to allow companies and organization to make better business
decisions and in the sciences to verify or disprove existing models or theories.
Data analytics is distinguished from data mining by the scope, purpose and
focus of the analysis. Data miners sort through huge data sets using
sophisticated software to identify undiscovered patterns and establish hidden
relationships. Data analytics focuses on inference, the process of deriving a
conclusion based solely on what is already known by the researcher.
The science is generally divided into exploratory data analysis (EDA), where
new features in the data are discovered, and confirmatory data analysis
(CDA), where existing hypotheses are proven true or false. Qualitative data
analysis (QDA) is used in the social sciences to draw conclusions from nonnumerical data like words, photographs or video. In information technology,
the term has a special meaning in the context of IT audits, when the controls
for an organization's information systems, operations and processes are

examined. Data analysis is used to determine whether the systems in place
effectively protect data, operate efficiently and succeed in accomplishing an
organization's overall goals.
The term "analytics" has been used by many business intelligence (BI)
software vendors as a buzzword to describe quite different functions. Data
analytics is used to describe everything from online analytical processing
(OLAP) to CRM analytics in call centers. Banks and credit cards companies,
for instance, analyze withdrawal and spending patterns to prevent fraud
oridentity theft. Ecommerce companies examine Web site traffic or navigation
patterns to determine which customers are more or less likely to buy a product
or service based upon prior purchases or viewing trends. Modern data
analytics often use information dashboards supported by real-time data
streams. So-called real-time analytics involves dynamic analysis and
reporting, based on data entered into a system less than one minute before
the actual time of use.
Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate,
condense and recap, and evaluate data. According to Shamoo and Resnik (2003) various analytic procedures
provide a way of drawing inductive inferences from data and distinguishing the signal (the phenomenon of interest)
from the noise (statistical fluctuations) present in the data..
While data analysis in qualitative research can include statistical procedures, many times analysis becomes an
ongoing iterative process where data is continuously collected and analyzed almost simultaneously. Indeed,
researchers generally analyze for patterns in observations through the entire data collection phase (Savenye,
Robinson, 2004). The form of the analysis is determined by the specific qualitative approach taken (field study,
ethnography content analysis, oral history, biography, unobtrusive research) and the form of the data (field notes,
documents, audiotape, videotape).
An essential component of ensuring data integrity is the accurate and appropriate analysis of research findings.
Improper statistical analyses distort scientific findings, mislead casual readers (Shepard, 2002), and may negatively
influence the public perception of research. Integrity issues are just as relevant to analysis of non-statistical data as
well.
Considerations/issues in data analysis

There are a number of issues that researchers should be cognizant of with respect to data analysis. These include:
Having the necessary skills to analyze
Concurrently selecting data collection methods and appropriate analysis
Drawing unbiased inference
Inappropriate subgroup analysis
Following acceptable norms for disciplines
Determining statistical significance
Lack of clearly defined and objective outcome measurements
Providing honest and accurate analysis
Manner of presenting data
Environmental/contextual issues
Data recording method
Partitioning text when analyzing qualitative data
Training of staff conducting analyses
Reliability and Validity
Extent of analysis
Having necessary skills to analyze

A tacit assumption of investigators is that they have received training sufficient to demonstrate a high standard of
research practice. Unintentional scientific misconduct' is likely the result of poor instruction and follow-up. A number
of studies suggest this may be the case more often than believed (Nowak, 1994; Silverman, Manson, 2003). For
example, Sica found that adequate training of physicians in medical schools in the proper design, implementation and
evaluation of clinical trials is abysmally small (Sica, cited in Nowak, 1994). Indeed, a single course in biostatistics is
the most that is usually offered (Christopher Williams, cited in Nowak, 1994).
A common practice of investigators is to defer the selection of analytic procedure to a research team statistician.
Ideally, investigators should have substantially more than a basic understanding of the rationale for selecting one
method of analysis over another. This can allow investigators to better supervise staff who conduct the data analyses
process and make informed decisions
Concurrently selecting data collection methods and appropriate analysis

While methods of analysis may differ by scientific discipline, the optimal stage for determining appropriate analytic
procedures occurs early in the research process and should not be an afterthought. According to Smeeton and Goda
(2003), Statistical advice should be obtained at the stage of initial planning of an investigation so that, for example,
the method of sampling and design of questionnaire are appropriate.
Drawing unbiased inference

The chief aim of analysis is to distinguish between an event occurring as either reflecting a true effect versus a false
one. Any bias occurring in the collection of the data, or selection of method of analysis, will increase the likelihood of
drawing a biased inference. Bias can occur when recruitment of study participants falls below minimum number
required to demonstrate statistical power or failure to maintain a sufficient follow-up period needed to demonstrate an
effect (Altman, 2001).

Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and

Hochgeladen von

Copyright:

Verfügbare Formate

Analysis of data is a process of inspecting, cleaning, transforming, and

modeling data with the goal of discovering useful information, suggesting

Data science process flowchart

algorithm. An example is an application that analyzes data about customer

A scatterplot illustrating correlation between two variables (inflation and

1. Time-series: A single variable is captured over a period of time, such as the

What are the values

Find data cases

6 Determin Given a set of

What is the value of

- What is the mileage per

What are the

- What is the range of film

Which data cases in a

1 Correlate Given a set of

- What is the range of car

- What is the distribution of

- Are there exceptions to

Which data cases in a - Are there groups of cereals

- Do different genders have

Barriers to effective analysis[edit]

Effective analysts are generally adept with a variety of numerical techniques.

for an organization's information systems, operations and processes are

Considerations/issues in data analysis

Having the necessary skills to analyze

Concurrently selecting data collection methods and appropriate analysis

Drawing unbiased inference

Inappropriate subgroup analysis

Following acceptable norms for disciplines

Determining statistical significance

Lack of clearly defined and objective outcome measurements

Providing honest and accurate analysis

Manner of presenting data

Data recording method

Partitioning text when analyzing qualitative data

Training of staff conducting analyses

Reliability and Validity

Having necessary skills to analyze

Concurrently selecting data collection methods and appropriate analysis

Drawing unbiased inference

Das könnte Ihnen auch gefallen