Assignment 2

3.
Data quality depends on the intended use of the data. Two different users may have very
different assessments of the quality of a given database. E.g. a marketing analyst may
need to access a database for a list of customer addresses. Some of the addresses are
outdated or incorrect, yet overall, 80% of the addresses are accurate. The marketing
analyst considers this to be an accurate large customer database for target marketing
purposes although a sales manager would have found the data inaccurate.
There are many possible reasons for inaccurate data : the data collection instruments used
maybe faulty. There may have been human or computer errors occurring at data entry.
As an example, users may purposely submit incorrect data values for mandatory elds
when they do not wish to submit personal information, which would lead to inaccurate
data. E.g. Entering the 1st of January as the birthdate would mean that several users would
receive birthday wishes or offers on that day inaccurately, however the rest of the
database may be of use for a marketing campaign or a sales campaign. Incorrect data may
also result from inconsistencies in naming conventions or data codes, or inconsistent
formats for input elds.
Incomplete data can occur for a number of reasons. Attributes of interest may not always
be available. E.g. customer information for sales transaction data. Other data may not be
included simply because they were not considered important at the time of entry.
Relevant data may not be recorded due to a misunderstanding or because of equipment
malfunctions. In this scenario, the data may not be of value for one team, but might be
valuable for another. The missing sales attributes might make the data inaccurate for the
sales manager, but the recorded transactions are valuable and accurate for the accounting
team.
Two other dimensions that can be used to assess the quality of data are: timeliness
and believability.
Timeliness: Data must be available within a time frame, so that it is useful in the
decision making process.
Believability: Data values must be within the range of possible results in order to
be useful for decision making.
3.2
When we are faced with the situation of tuples with missing values, one solution is to just
ignore the tuple. This is usually done when the class label is missing. This method works
best if the tuple contains several attributes with missing values. It is very ineffective when
the percentage of missing values per attribute changes considerably.
Manually filling in the missing value: In general, this approach is time-consuming and
may not be a reasonable task for large data sets with many missing values, especially
when the value to be filled in is not easily determined.
3.3
(a)
The steps to perform smoothing by bin means with a depth of 3 are as follows:
Step a: sort the data
Step b: divide the data into frequency bins of size=3
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
Step c: calculate the arithmetic mean of each bin

Step d: replace the values by the arithmetic mean calculated for the bin
Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3
Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56
(b) The outliers are the values that do not fall within a set of clusters. They can be
determined using clustering, where similar values are organized into groups/clusters. We
may also identify possible outliers using a pre-determined data distribution by using a
computer to sort through the data. The possible outliers are then verified by a person, and
this technique requires less effort than sorting through the entire initial data set.
(c) Data smoothing can be accomplished using smoothing by bin medians and smoothing
by bin boundaries. Equiwidth bins can also be used to implement binning, where the
interval range of values in each bin is kept constant.
We can also use regression techniques such as linear or multiple regression to smooth the
data by making it align to a function. Concept hierarchies that use classification
techniques can also be used to smooth data by rolling-up lower level concepts to higher-
level concepts.
3.4
Issues that must be considered during data integration include:
Schema integration: Data from different data sources must be integrated to match up to
equivalent real-world entities. This is also known as the entity identification problem.
Dealing with redundant data: Data duplication at the tuple level may occur. Also derived
attributes may be redundant, as well as inconsistent attribute naming used may also lead
to redundancies in the resulting data set.
Detection and resolution of data conflicts: Differences in representation/encoding/scaling

may cause the same real-world entity attributes to differ in the data sources being
integrated.

Assignment 2

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Assignment 2

Hochgeladen von

Copyright:

Verfügbare Formate

3.

Step c: calculate the arithmetic mean of each bin

Detection and resolution of data conflicts: Differences in representation/encoding/scaling

Das könnte Ihnen auch gefallen