Sie sind auf Seite 1von 11

Introduction to

IBM Watson Analytics


Data Loading and Data Quality

December 16, 2014


Document version 2.0

This document applies to IBM Watson Analytics.


Licensed Materials - Property of IBM
Copyright IBM Corporation 2014.
US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP
Schedule Contract with IBM Corp.

Contents
IBM Watson Analytics needs your data to start helping you drive insights! .................................. 4
Data loading and file characteristics................................................................................................ 4
Loading data files......................................................................................................................... 4
Data file sizes and types............................................................................................................... 4
Data file structure......................................................................................................................... 4
Microsoft Excel file restrictions ............................................................................................... 5
CSV file restrictions ................................................................................................................. 6
Data quality ..................................................................................................................................... 6
Data quality improvements .......................................................................................................... 7
Review the Data Quality Report .............................................................................................. 8
Add to the breadth and depth of the data ................................................................................. 9
Use your domain knowledge to determine if the results are making sense.............................. 9
Viewing and changing the properties of a field in Predict........................................................... 9
Changing the role of a field in Predict ................................................................................... 10
Changing the measurement level of a field in Predict............................................................ 10
Browsers currently supported in Watson Analytics ...................................................................... 11

IBM Watson Analytics needs your data to start helping you


drive insights!
When you load data into IBM Watson Analytics the service automatically starts analyzing and
interrogating it for interestingness and quality. It then determines what you might want to
analyze. This document will help you to understand the ins and outs of using your data with
Watson Analytics.

Data loading and file characteristics


You can load comma-separated values (.csv) and Microsoft Excel spreadsheet (.xls, .xlsx) files
into IBM Watson Analytics. Currently, Watson Analytics supports structured data files.
Additionally, your data files must meet size and structure requirements.

Loading data files


You must load a data set to IBM Watson Analytics before you can analyze your data.

1. Log in to IBM Watson Analytics.


2. On the Welcome page, click Add.

3. In the Add your data area, add your data set. You can add .csv and Microsoft Excel
spreadsheet files.

Important: If your data is filtered in a Microsoft Excel spreadsheet, the data is only
hidden in the spreadsheet and the full original data set is imported into Watson Analytics.
You can use the filtering options available in the Explore and Assemble capabilities to
filter your dataset. Even if you filter the data in an exploration or view, the full data set is
still available if you create a new exploration or view.
After your file loads, it appears on the Welcome page as a data set. Choose a data set to create a
prediction or exploration based on it.

Data file sizes and types


Each individual file that you upload must be smaller than 50 MB with a maximum of 100,000
rows and 50 columns. If you upload a file that exceeds these limits, you will receive an error
message and the file is not loaded.
The overall capacity for all data sets and other assets in your account is 500 MB.
Supported file types are:

Microsoft Excel 97 2003 spreadsheet files (.xls).


Microsoft Excel 2007 and later spreadsheet files (.xlsx).
Comma-separated values files (.csv).

Data file structure


Data files must meet the following structural requirements:

Headers: Because IBM Watson Analytics relies on natural language and matches elements from
the question you ask to elements in the data, files with descriptive column headers are preferred.
Watson Analytics assumes that the first row of your file contains headers.
List Files: List files work best. List files are tabular data, with columns and rows. In Watson
Analytics, we refer to columns as fields and to rows as records. The first row is a header. Watson
Analytics does not currently work with nested headings or row headings.
The following example of a list file works well in Watson Analytics:

The following example of a nested file does not work in Watson Analytics because it contains
row headings and nested headings.

Additionally, data files must meet the following characteristics:

You cannot have empty columns inserted before the data.


You must have a header for every column. The number of columns in the header row is
assumed by Watson Analytics to be the number of columns of data. For example, if the
first six columns have headers but there are eight columns of data, the last two columns
of data are ignored.
You can have empty rows above the data. Empty rows preceding the data are ignored.
You cannot have textual rows above the header row. For example, if you have a title or
description of what the data is about above the header row, the file is not read
appropriately.
You cannot have textual rows following the data. For example, a row following the data
that says This information came from is considered to be part of the data.

Microsoft Excel file restrictions


Specific conditions apply to Microsoft Excel files:

.xlsx files saved in OpenOffice are currently not supported


Password-protected Microsoft Excel files are not supported

Only the first sheet in a Microsoft Excel file is imported, and remaining sheets are
ignored

CSV file restrictions


Specific conditions apply to comma-separated values (.csv) files:

The file extension must be .csv.


Delimiter symbols must separate the fields. Comma, tab, semi-colon, and pipe (|) are
supported.
Quote characters escape literal values. Single quotes and double quotes are supported.
Record separators separate rows. Newline (\n), carriage return (\r) and carriage return
followed by newline (\r\n) are supported.

Data quality
When a data set is loaded, Watson Analytics creates a data quality report, which includes an
overall average data quality score. The data quality score indicates how ready the data is for
analysis and does not necessarily indicate whether Watson Analytics will provide good predictive
or explorative results. In other words, a low data quality score just indicates that your data is not
suitable for analysis but Watson Analytics might still provide useful insights and answers about
your data. The most problematic fields that cause the average data quality score to be low are
usually excluded from analysis. Additionally, some data preparation steps are taken when Watson
Analytics creates a prediction.
Watson Analytics will compute a data quality score based on the original data, before any
cleansing or transformation has occurred. The score is an average of the data quality score for
every field in the data set, as determined by missing values, constant values, imbalance,
influential categories, outliers, and skewness. Skewness is a measure of the asymmetry of a
distribution. Symmetry describes how values are distributed on either side of the central value.
There are some things you can do to your data that can help improve the data quality score before
you load the data and the score is calculated.
Before loading your data set, clean your data as much as possible, in the following ways:

Eliminate blank rows.

Exclude summary rows and columns.

Avoid column headings and row headings in the same cell.

Avoid look up tables.

Avoid subtotals and aggregations.

You can see the score associated with each data set in the list of assets on the Welcome page. In
the following example, 68 is the score assigned to the IBM Sales Sample data set and represents
the datas readiness for analysis.
A score of 68 indicates a data set of medium quality. The score is an
average of the data quality score for every field in the data set, as
determined by missing values, constant values, imbalance, influential
categories, outliers and skewness. The lower the score, the higher the
number of outliers or missing values and other issues associated with some
of the fields in the data set. It is worth mentioning again that a poor data
score is only indicative of how suitable the data is for analysis and not
indicative of the quality of answers you will get for your queries.

You can access the Data Quality Report in the menu on the Main Insight page in the Predict
capability.

Data quality improvements


In general, there are three things that you can do to improve the quality of your data:

Review the Data Quality Report.


Add to the breadth and depth of the data.
Use your domain knowledge to determine if the results are making sense.

Review the Data Quality Report


In the Predict capability, review the Data Quality Report. It highlights areas where the source data
needs to be cleaned. You can access the Data Quality Report from the menu in Predict:

For example, while looking at the Analysis Details of your prediction, you may see that some
input fields are omitted. Use the Data Quality Report to determine why they were removed and
perhaps, more importantly, determine if you should be including them.

Watson Analytics might exclude a field from use for various reasons. Use your domain
knowledge to determine whether an excluded field should be included.

Too many categories in the field: If a field contains 50 or more categories, Watson
Analytics will ignore it and does not include it in the subsequent analyses even if you
set the field role to Input.
Constant or near-constant fields: If a field contains a single value over 95% of
valid values, Watson Analytics will set its field role to None.
However, if you set the field role to Input or Target, Watson Analytics will use it in
subsequent analyses.
For example, lets say that you have a Churn field which is extremely unbalanced in
that only 4% of people would be included in the data. In this case, Watson Analytics
excludes Churn from analysis. However, you know it is an important target field, so
you set its role to Target.

Missing values: Watson Analytics ignores a field when the number of missing
values is greater than 25%. However, it will use the field if the user sets it as Input or
Target. Currently, Watson Analytics does not impute missing values for such a field,
so records with missing values for the field are excluded in subsequent analyses.
Alternatively, you can change the default threshold from 25% to another value in the
dropdown box in the Data Quality Report. Watson Analytics would impute missing

values for these fields with missing values that represent less than the threshold value
and use the imputed values in the subsequent analyses.
For example, lets say that you have an Age field with 30% of the values missing.
By default, Watson Analytics excludes it because more than 25% of the values are
missing. However, you know that the Age field is an interesting input field that
might explain the new program preference in a viewer survey. So, you might decide
to include it to see how it will affect the predictive results.

Add to the breadth and depth of the data


Adding more rows and columns to the data will often improve the quality of the data. The more
data that IBM Watson Analytics has available to choose from, the more accurate its predictive
and explorative results will be.
Make sure that you follow the appropriate data structures and cleansing procedures before you
add the new data in.

Use your domain knowledge to determine if the results are making sense
You will always need to bring your domain knowledge with you to the analysis part of your
prediction or exploration. IBM Watson Analytics provides you with recommended analytical
starting points and predictive models based on the data you provide it. However, you must
determine what to do with the analysis and recommendations in order to create an appropriate
response.
For example, lets say you are an HR professional trying to analyze employee attrition. In this
case, Watson Analytics may initially determine that whether an employee had an exit interview is
a near-perfect predictor of whether that they have left the company. However, with your domain
knowledge, you know that exit interviews are not a useful predictor of future attrition. In this
situation, you could choose to change the role of the Exit Interview input field from Input to None
and exclude it completely from the analysis.
Similarly, while Watson Analytics does its best to determine what questions you want to answer
with your data, there is no substitute for your own expertise. For example, if you are examining
payments received from customer accounts, Watson Analytics may initially determine that you
want to be able to predict the amount on the invoice. However, in fact you want to predict
whether a customer will pay the invoice by the due date. You can change the Targets identified
by Watson Analytics in order to influence how it interprets the data.

Viewing and changing the properties of a field in Predict


In the Predict capability, you can view and change the properties of a field in the Field Properties
area to specify its role in a prediction. Additionally, you can specify the measurement level for a
field. You can also view the interestingness of a field. You can access the Field Properties area
from the menu in Predict:

Changing the role of a field in Predict


You can change the role of one or more input fields in the Predict capability by selecting a new
role for the data.

A field can have one of several roles:

Input: Most fields are input fields. Input fields are fields whose values might
influence another field. For example, if you were conducting a study to analyze the
effect of salary on overall happiness, salary is an input field.
Target: Although input fields are the most common, target fields are the most
important. Target fields are the fields whose outcome you are interested in predicting.
Target fields are influenced by input fields. You cannot have more than five targets in
a workbook.
Record ID: Record ID fields are not used in the analysis. These fields are used for
labeling but do not provide any analytical substance.
None: Fields with a role of None are those fields that are not used in a prediction.
These fields might have too much missing data or might be fields that you choose to
exclude as they include standard data, such as counts that are identical through the
length of the column. A field might have a role of None because it was excluded
automatically by Watson Analytics. Alternatively, you might decide to exclude a
field because of your domain knowledge that it is not important for the analysis that
you want to perform.

Changing the measurement level of a field in Predict


You can change the measurement level of a field in the Predict capability to improve the accuracy
of your prediction.

10

A field can have one of several measurement levels:

Nominal: A nominal field is a field with a limited number of distinct values that have
no inherent order or ranking. Examples of nominal fields include department, region,
postal code, and religious affiliation.
Ordinal: An ordinal field is a field with a limited number of distinct values that have
an inherent order or ranking. Examples of ordinal fields include attitude scores that
represent the degree of satisfaction or confidence and preference rating scores. Like
continuous fields, ordinal fields can be measured numerically. However, unlike
continuous fields, distance comparisons between values are not appropriate.
Continuous: A continuous field is measured numerically so that distance
comparisons between values are appropriate. Examples of continuous fields include
age in years and income in thousands of dollars.

Browsers currently supported in Watson Analytics


The following table lists browsers, browsers operating systems, and browser versions that are
currently supported by IBM Watson Analytics.
Browser
Browser OS
Versions
Chrome *
Windows
37 and later
Mozilla Firefox
Windows
31and later, 31 ESR
Internet Explorer
Windows
11
Safari
MacOS
6 and 7
* Chrome is the recommended browser for use with IBM Watson Analytics.

11

Das könnte Ihnen auch gefallen