Beruflich Dokumente
Kultur Dokumente
Jessica Roper
Beijing
Tokyo
First Edition
First Release
978-1-491-97052-2
[LSI]
Table of Contents
1
3
4
5
6
7
9
10
11
13
14
19
21
29
31
32
iii
parse and store values in a format that is not compatible with the
databases used to store and process it, such as YAML (YAML Aint
Markup Language), which is not a valid data type in some databases
and is stored instead as a string. Because this format is intended to
work much like a hash with key-and-value pairs, searching with the
database language can be difficult.
Also, code design can inadvertently produce a table that holds data
for many different, unrelated models (such as categories, address,
name, and other profile information) that is also self-referential. For
example, the dataset in Table 1-1 is self-referential, wherein each
row has a parent ID representing the type or category of the row.
The value of the parent ID refers to the ID column of the same table.
In Table 1-1, all information around a User Profile is stored in the
same table, including labels for profile values, resulting in some val
ues representing labels, whereas others represent final values for
those labels. The data in Table 1-1 shows that Mexico is a Coun
try, part of the User Profile because the parent ID of Mexico is
11, the ID for Country, and so on. Ive seen this kind of example in
the real world, and this format can be difficult to query. I believe this
relationship was mostly the result of poor design. My guess is that, at
the time, the idea was to keep all profile-like things in one table
and, as a result, relationships between different parts of the profile
also needed to be stored in the same place.
Table 1-1. Self-referential data example (source: Jessica Roper and Brian
Johnson)
ID
16
11
9
Parent ID
11
9
NULL
Value
Mexico
Country
User Profile
Data quality is important for a lot of reasons, chiefly that its difficult
to draw valid conclusions from impartial or inaccurate data. With a
dataset that is too small, skewed, inaccurate, or incomplete, its easy
to draw invalid conclusions. Organizations that make data quality a
priority are said to be data driven; to be a data-driven company
means priorities, features, products used, staffing, and areas of focus
are all determined by data rather than intuition or personal experi
ence. The companys success is also measured by data. Other things
that might be measured include ad impression inventory, user
Figure 1-1. Ad service log example (source: Jessica Roper and Brian
Johnson)
Figure 1-2. Data layers (source: Jessica Roper and Brian Johnson)
10
When errors are found, understanding the sample size affected will
also help you to determine severity and priority of the errors. I gen
erally try to ensure that the amount of data ignored in each step
makes up less of the dataset than the allowable error so that the
combined error will still be within the acceptable rate. For example,
if we allow an error rate of one-tenth of a percent in each of 10
products, we can assume that the total error rate is still around or
less than 1 percent, which is the overall acceptable error rate.
After a problem is identified, the next goal is to find examples of
failures and identify patterns. Some patterns to look for are failures
for the same day of the week, time of day, or from only a small set of
applications or users. For example, we once found a pattern
in which a web pages load time increased significantly every Mon
day at the same time during the evening. After further digging, this
information led us to find that a database backup was locking large
tables and causing slow page loads. To account for this we added
extra servers and utilized one of them to be used for backups so that
performance would be retained even during backups. Any data
available that can be grouped and counted to check for patterns
in the problematic data can be helpful. This can assist in identifying
the source of issues and better estimate the impact the errors could
have.
Examples are key to providing developers, or whoever is doing data
processing, with insight into the cause of the problem and providing
a solution or way to account for the issue. A bug that cannot be
reproduced is very difficult to fix or understand. Determining pat
terns will also help you to identify how much data might be affected
and how much time should be spent investigating the problem.
11
be and the more likely the system will evolve to improve accuracy. It
will require time and effort to coordinate between those who con
sume data and those who produce it, and some executive direction
will be necessary to ensure that coordination. When managers or
executives utilize data, collection and accuracy will be easier to make
a priority for other teams helping to create and maintain that data.
This does not mean that collection and accuracy arent important
before higher-level adoption of data, but it can be a much longer
and difficult process to coordinate between teams and maintain
data.
One effective way to influence this coordination among team mem
bers is consistently showing metrics in a view thats relevant to the
audience. For example, to show the value of data to a developer, you
can compare usage of products before and after a new feature is
added to show how much of an effect that feature has had on usage.
Another way to use data is to connect unrelated products or data
sources together to be used in a new way.
As an example, a few years ago at Spiceworks, each productand
even different features within some productshad individual defi
nitions for categories. After weeks of work to consolidate the cate
gories and create a new way to manage and maintain them that was
more versatile, it took additional effort and coordination to educate
others on the team about the new system, and it took work with
individual teams to help enable and encourage them to apply the
new system. The key to getting others to adopt the new system was
showing value to them. In this case, my goal was to show value by
making it easier to connect different products such as how-tos and
groups for our online forums.
There were only a few adopters in the beginning, but each new
adopter helped to push others to use the same definitions and pro
cesses for categorizationvery slowly, but always in the same con
sistent direction. As we blended more data together, a unified
categorization grew in priority, making it more heavily adopted and
used. Now, the new system is widely used and used for the potential
seen when building it initially several years ago. It took time for
team members to see the value in the new system and to ultimately
adopt it, but as soon as the tipping point was crossed, the work
already put in made final adoption swift and easy in comparison.
12
13
14
Figure 1-4. UDP workflow (source: Jessica Roper and Brian Johnson)
We included tests such as creating page views from several IP
addresses, different locations, and from every product and critical
URL path. We also used grep (a Unix command used to search files
for the occurrence of a string of characters that matches a specified
pattern) to search the logs for more URLs, locations, and IP
addresses to smoke-test for data completeness and understand
how different fields were recorded. During this investigation, we saw
that several fields had values that looked like IP addresses. Some of
them actually had multiple listings, so we needed to investigate fur
ther to learn what each of those fields meant. In this case, it required
further research into how event headers are set and what the non
simple results meant.
15
When you cannot test or validate the raw data formally to under
stand the flow of the system, try to understand as much about the
raw data as possibleincluding the meaning behind different com
ponents and how everything connects together. One way to do this
is to observe the data relationships and determine outlier cases by
understanding boundaries. For example, if we had been unable to
observe how the logs were populated, my search would have
included looking at counts for different fields.
Even with source data testing, we manually dug through the logs to
investigate what each field might mean and defined a normal
request. We also looked at the counts of IP addresses to see which
might have had overly high counts or any other anomalies. We
looked for cases in which fields were null to understand why and
when this occurred and understand how different fields seemed to
correlate. This raw data testing was a big part of becoming familiar
with the data and defining parameters and expectations.
When digging in, you should try to identify and understand the
relationship between components; in other words, determine which
relationships are one-to-one versus one-to-many or many-to-many
between different tables and models. One relationship that was
important to keep in mind during the UDP was the many-to-many
relationship between application installations and users. Users can
have multiple installations and multiple users use each installation,
so overlap must be accounted for in testing. Edge case parameters of
the datasets are also important to understand, such as determining
the valid minimum and maximum values for different fields.
Finally, in this step you want to understand what is common for the
dataset, essentially identifying the normal cases. When investigat
ing the page views data for products, the normal case had to be
defined for each product separately. For example, in most cases
users were expected to be logged in, but in a couple of our products,
being a visitor was actually more common; so, the normal use case
for those products was quite different.
in fact exist (there should be several for each day) as well as check
ing that the numbers of log entries match the total rows in the initial
table.
For this test, two senior developers created a complex grep regular
expression command to search the logs. They then had both
searches compared and reviewed by other senior staff members,
working together to define the best clause. One key part of the exer
cise included determining which rows were included in one expres
sion result but not in others, and then use those to determine the
best patterns to employ.
We also dug into any rows the grep found that were not in the table,
and vice versa. The key goal in this first set of tests was to ensure
that the source data and final processing match before filtering out
invalid and outlier data.
For this project, tests included comparing totals such as unique user
identification counts, validating the null data found was acceptable,
checking for duplicates, and verifying proper formats of fields such
as IP addresses and the unique application IDs that are assigned to
each product installation. This required an understanding of which
data pieces are valid when null, how much data could be tolerated
to have null fields, and what is considered valid formatting for each
field.
In the advertising service logs, some fields should never be null,
such as IP address, timestamp, and URL. This set of tests identified
cases that excluded IP addresses completely by incorrectly parsing
the IP set. This prompted us to rework the IP parsing logic to get
better-quality data. Some fields, such as user ID, were valid in some
cases when null; only some applications require users to be logged
in, so we were able to use that information as a guide. We validated
that user ID was present in all products and URLs that are accessible
only by logged in users. We also ensured that visitor traffic came
only from applications for which login was not required.
You should carry out each test in isolation to ensure that your test
results are not corrupted by other values being checked. This helps
you to find smaller edge cases, such as small sections of null data.
For our purposes at Spiceworks, by individually testing for each
classification of data and each column of data, we found small edge
cases for which most of the data seemed normal but had only one
17
field that was invalid in a small, but significant way. All relationships
existed for the field, but when examined at the individual level, we
saw the field was invalid in some cases, which led us to find some
of the issues described later, such as invalid IP addresses and a sig
nificant but small number of development installations that needed
to be excluded because they skewed the data with duplicate and
invalid results.
into table format, more information was appended and the tables
were summarized for easier analysis. One of the goals for these more
processed tables, which included any categorizations and final filter
ing of invalid data such as that from development environments,
was to indicate total page views of a product per day by user. We
verified that rows were unique across user ID, page categorization,
and date.
Usually at this point, we had the data in some sort of database,
which allowed us to do this check by simply writing a query that
selected and counted the columns that should be unique and ensure
that the resulting counts equal 1. As Figure 1-5 illustrates, you also
can run this check in Excel by using the Remove Duplicates action,
which will return the number of duplicate rows that are removed.
The goal is for zero rows to be removed to show that the data is
unique.
Figure 1-5. Using Excel to test for duplicates (source: Jessica Roper)
19
20
The goal for this level of testing is to validate that aggregate and
appended data has as much integrity as the initial data set. Here are
some questions that you should ask during this process:
If values are split into columns, do the columns add up to the
total?
Are any values negative that should not be, such as a calculated
other count?
Is all the data from the source found in the final results?
Is there any data that should be filtered out but is still present?
Do appended counts match totals of the original source?
As an example of the last point, when dealing with Spiceworks
advertising service, there are a handful of sources and services that
can make a request for an ad and cause a log entry to be added.
Some different kinds of requests included new organic pageviews,
ads refreshing automatically, and for pages with ad-block software.
One test we included checked that the total requests equaled the
sum of the different request types. As we built this report and con
tinued to evolve our understanding of the data and requirements for
the final results, the reportable tables and tests also evolved. The test
process itself helped us to define some of this when outliers or unex
pected patterns were found.
21
22
23
end users are neither expected to interact directly with the applica
tion nor are they considered official users of it. We wanted to define
what assumptions were testable in the new data set from the ad ser
vice. For example, at least one user should be active on every instal
lation; otherwise no page views could be generated. Also, there
should not be more active users than the total number of users asso
ciated to an installation.
After we defined all the expectations for the data, we built several
tests. One tested that each installation had less active users than the
total associated with it. More important, however, we also tested that
the total active users trended over time was consistent with trends
for total active installations and total users registered to the applica
tion (Figure 1-6). We expected the trend to be consistent between
the two values and follow the same patterns of usage for time of day,
day of week, and so on. The key to this phase is having as much
understanding as possible of the data, its boundaries, how it is rep
resented, and how it is produced so that you know what to expect
from the test results. Trends usually will match generally, but in my
experience, its rare for them to match exactly.
25
Figure 1-7. Hypothetical example of page view counts over time vetting
(source: Jessica Roper and Brian Johnson)
For example, in the UDP, we looked at how page views by product
change over time by month compared to the growth of the most
recent month. We verified that the change we saw from month to
month as new data was stable over time, and dips or spikes were
seen only when expected (e.g., when a new product was launched).
We used the average over several months to account for anomalies
caused by months with several holidays during the week. We wanted
to identify thresholds and data existence expectations.
During this testing, we found several issues, including a failing log
copy process and products that stopped sending up data to the sys
tem. This test verified that each product was present in the final
dataset. Using this data, we were able to identify a problem with ad
server tracking in one of our products before it caused major prob
lems. This kind of issue was previously difficult to detect without
timeseries analysis.
We knew the number of active installations for different products
and total users associated with each of those installations, but we
could not determine which users were actually active before the new
data source was created. To validate the new data and these active
user counts, we ensured that the total number of users we saw mak
ing page views in each product was higher than the total number of
installations, but lower than the total associated users, because not
all users in an installation would be active.
26
27
for three months while in beta, but when the final product was
deployed, the tracking was removed. Our monitors discovered this
issue by detecting a drop in total page views for that product cate
gory, allowing us to dig in and correct the issue before it had a large
impact.
There are other monitors we also added that do not focus heavily on
trends over time. Rather, they ensured that we would see the
expected number of total categories and that the directory contain
ing all the files being processed had the minimum number of
expected files, each with the minimum expected size. This was
determined to be critical because we found one issue in which some
log files were not properly copied for parsing and therefore signifi
cant portions of data were missing for a day. Missing even only a few
hours of data can have large effects on different product results,
depending on what part of the day is missing from our data. These
monitors helped us to ensure data copy processes and sources were
updated correctly and provided high-level trackers to make sure the
system is maintained.
As with other testing, the monitors can change over time. In fact, we
did not start out with a monitor to ensure that all the files being pro
cessed were present and the correct sizes. The monitor was added
when we discovered data missing after running a very long process.
When new data or data processes are created it is important to use it
skeptically until no new issues or questions are found for a reason
able amount of time. This is usually related to how the processed
data is consumed and used.
Much of the data I work with at Spiceworks is produced and ana
lyzed monthly, so we closely and heavily manually monitor the sys
tem until the process has run fully successfully for several months.
This included working closely with our analysts as they worked with
the data to find any potential issues or remaining edge cases in the
data. Anytime we found a new issue or unexpected change, a new
monitor was added. Monitors were also updated over time to be
more tolerant of acceptable changes. Many of these monitors were
less around the system (there are different kinds of tests for that),
and more about the data integrity and ensuring reliability.
Finally, another way to monitor the system is to provide end users
with a dead-easy way to raise an issue the moment an inaccuracy is
discovered, and, even better, let them fix it. If you can provide a tool
28
Implementing Automation
At each layer of testing, automation can help ensure long-term relia
bility of the data and quickly identify problems during development
and process updates. This can include unit tests, trend alerts, or any
thing in between. These are valuable for products that are being
changed frequently or require heavy monitoring.
In the UDP, we automated almost all of the tests around transforma
tions and aggregations, which allowed for shorter test cycles while
iterating through process and provided long-term stability monitor
ing of the parsing process in case anything changes in the future or a
new system needs to be tested.
Not all tests need to be automated or created as monitors. To deter
mine which tests should be automated, I try to focus on three areas:
Overall totals that indicate system health and accuracy
Edge cases that have a large effect on the data
How much effect code changes can have on the data
Implementing Automation
29
There are four general levels of testing, and each of these levels gen
erally describes how the tests are implemented:
Unit
This tests focus on single complete components in isolation.
Integration
Integration tests focus on two components working together to
build a new or combined data set.
System
This level tests verify the infrastructure and overall process itself
as a whole.
Acceptance
Acceptance tests validate data as reasonable before publishing
or appending data sets.
In the UDP, because having complete sets of logs was critical, a sepa
rate system-level test was created to run before the rest of the pro
cess to ensure that data for each day and hour could be identified in
the log files. This approach further ensures that critical and difficultto-find errors would not go unnoticed. Other tests we focused on
were between transformations of the data such as comparing initial
parsed logs as well as aggregate counts of users and total page views.
Some tests, such as categorization verification, were only done man
ually because most changes to the process should not affect this data
and any change in categorization would require more manual test
ing either way. Different tests require different kinds of automation;
for example, we created an automated test to validate the final
reporting tables, which included a column for total impressions as
well as the breakdown for type of impression based on that impres
sion being caused by a new page view versus ad refresh, and so on.
This test was implemented as a unit test to ensure that at a low level
the total was equal to the sum of the page view types.
Another unit test included creating samples for the log parsing logic
including edge cases as well as both common and invalid examples.
These were fed through the parsing logic after each change to it as
we discovered new elements of the data. One integration test
included in the automation suite was the test to ensure country data
from the third-party geographical dataset was valid and present. The
automating tests for data integrity and reliability using monitors and
trends were done at the acceptance level after processing to ensure
30
valid data that followed the patterns expected before publishing it.
Usually when automated tests are needed, there will be some at
every level.
It is helpful to document test suites and coverage, even if they are
not automated immediately or at all. This makes it easy to review
tests and coverage as well as allow for new or inexperienced testers,
developers, and so on, to assist in automation and manual testing.
Usually, I just record tests as they are manually created and exe
cuted. This helps to document edge cases and other expectations
and attributes of the data.
As needed, when critical tests were identified, we worked to auto
mate those tests to allow for faster iterations working with the data.
Because almost all code changes required some regression testing,
covering critical and high-level tests automatically provided easy
smoke testing for the system and gave some confidence in the con
tinued integrity of the data when changes were made.
Conclusion
Having confidence in data accuracy and integrity can be a daunting
task, but it can be accomplished without having a Ph.D. or back
ground in data analysis. Although you cannot use some of these
strategies in every scenario or project, they should provide a guide
for how you think about data verification, analysis, and automation,
as well as give you the tools and ways to think about data to be able
to provide confidence that the data youre using is trustworthy. It is
important that you become familiar with the data at each layer and
create tests between each transformation to ensure consistency in
the data. Becoming familiar with the data will allow you to under
stand what edge cases to look for as well as trends and outliers to
expect. It will usually be necessary to work with other teams and
groups to improve and validate data accuracy (a quick drink never
hurts to build rapport). Some ways to make this collaboration easier
are to understand what the focus is for those being collaborated with
and to show how the data can be valuable to those teams to use
themselves. Finally, you can ensure and monitor reliability through
automation of process tests and acceptance tests that verify trends
and boundaries and also allow the data collection processes to be
converted and iterated on easily.
Conclusion
31
Further Reading
1. Peters, M. (2013). How Do You Know If Your Data is Accu
rate? Retrieved December 12, 2016, from http://bit.ly/2gJz84p.
2. Polovets, L. (2011). Data Testing Challenge. Retrieved Decem
ber 12, 2016 from http://bit.ly/2hfakCF.
3. Chen, W. (2010). How to Measure Data Accuracy? Retrieved
December 12, 2016 from http://bit.ly/2gj2wxp.
4. Chen, W. (2010). Whats the Root Cause of Bad Data?
Retrieved December 12, 2016 from http://bit.ly/2hnkm7x.
5. Jain, K. (2013). Being paranoid about data accuracy! Retrieved
December 12, 2016 from http://bit.ly/2hbS0Kh.
32