Bad Data Notes

-Values are missing! (Do I know what null or zero means?
Was data never collected or was there

none to report? Solution: Clarify with source
-Human-entered data. Example from the book: Dog registration data for Cook County, Illinois allowed
pet owners a text box to enter in the dog's breed. 250 different spellings of chihuaha. Another example
is race information in paternity tests. There is an option for caucasian/black/ yet very often people
circle other and will write white or Irish or Haitian. Imagine getting the aggregate of this data
and trying to sort through it to account for subjectivity, provenance, etc.
-The Spelling is inconsistent. Avoid looking at things like names alone, but use more definitively
spelled words like cities. If the spelling is inconsistent it is a sign that the data was edited by hand, and
is more likely to have mistakes. Proceed with the dataset cautiously, and mark any corrections or
changes you make. This is a good practice for data provenance or knowing the nature of where your
data came from and how it may have been chained with other datasets overtime.
-Be aware of the data's provenance and any sort of biases that may come with it. (Ex. If it was a survey
what was it's scope? What instruments were used to measure something, and what is their margin of
error? How was the data entered? Who is releasing the data?
-Be aware of dating and naming conventions. (American v. Japanese naming conventions, is the date
written 10/9/15 or 9/10/15? Be aware of this before you publish!)
-How are the categories of a set defined? (In a list of ethnicities, nationalities, what is other? How are
race and ethnicity or sex and gender defined? Is there meaning ambiguous?) Beware of datasets where
data may be artificially excluded or arbitrary (ex. Rape statistics and the changing definitions of rape
through time and in various states or the cost in USD of different commodities at different points in
history)
-Coarse Data: Data that is too aggregated or on the macro level to be useful relative to your specific
context. (You wanted traffic accidents for each month of 1976. They gave you the total number of
accidents in 1976.) There is no way to get more specific data on your own in this case, you need to go
to the source and ask for it. Depending on record keeping and organization willingness, you may or
may not get it. Don't try to average or estimate data in this case. Ex. Dividing the number of
accidents over 12 months and using that as an average per month number. This paints a misleading
picture of what actually occurred and implies you know something you actually don't. (What if all the
accidents happened in January?)
-Granular Data: Data that is too specific. If it is a small or medium-sized dataset you can use the pivot
table feature of Excel or Google Docs to aggregate the data. If data is unusual or exceptionally large
you might to code a custom tool to handle the task.
-BEWARE THESE VALUES!
If you see these numbers they can be indicative of common computing/system/program errors. Not
going to go in-depth for this presentation, but there are relative links in the github repo.
-65,535
-2,147.483,647
-4,294,967,295
-555-3485 (phone numbers)
-long sequences of 9s or 0s
Dates:
-1970-01-01T00:00:00Z
1969-12-31T23:59:59Z
January 1st, 1900
January 1st, 1904
Locations:
000'00.0"N+000'00.0"E or simply 0N 0E
US zip code 12345 (Schenectady, New York)
US zip code 90210 (Beverly Hills, CA)
-Your spreadsheet has 65536 rows. Older versions of excel have this as the limit. Newer versions allow
for 1,048,576 rows. If you receive a dataset with 65536 rows confirm with source that it isn't truncated.
-Line endings are garbled/weird. This often occurs when a document is saved on one operating system
and opened on another. All documents are formatted with hidden markup characters, and different
systems have different ideas of where they should be placed. A common fix is to open the file in any
generic text-editor and then resave it.
-Data that needs to be extracted from a PDF. A lot of government data (that isn't scanned) is available in
.pdf format. If there is textual data or a table you'd like to extract, a good tool to use is tabula or adobe
acrobat pro's 'export table to excel' feature.
-Methodology by which the data was collected was faulty:
-A sample was not random: When a survey is conducted, it failed to cover a representative size of the
population. Solution: Don't use it
-Margin of error is either larger than 10% (rule of thumb) or is completely unknown. You should be
wary of the former and avoid the later at all costs.
-Timeframe has either been manipulated or carelessly misrepresented to skew context. The book uses
an example of a nonexistant 2015 national crime wave. The study hihglighted spikes in violent crime
in select cities as compared to the last few years. The limited dataset heightened the occurrences,
whereas crime was much higher across the country ten years prior, and nearly double twenty years
before. Solution: If you have a limited dataset try to start several years/months/days into the data so as
to not have the whole set thrown off by single addition.
-Another way data can be manipulated is by the frame of reference. If I'd like to make literacy in my
school look good, I could compare it relative to one of the bottom school districts in the country or if I
wanted to make it look good, I could compare it to one of the top-scoring. Either way, when measuring
a change in data one has to ask what it is being compared against, and whether or not that it's an
appropriate choice. A good way around this is to compare something against several varied starting
points and see how the results change. This is also something to be wary of when you're researching a
politicized topic; it is very easy to succumb to confirmation bias and go with the first result that agrees
with your presuppositions.
-Was the collection process of the data transparent? It can be difficult to know exactly how data was
collected, but if you find data with an unlikely amount of precision (factory emissions with accuracy to
seven decimal points) or appears to measure or show something impossible (percentages on global
public opinion) chances are it is an estimate and not empirical data. Your best bet is to find one or
several experts to weigh in on the matter.
-Inexplicable outliers in a dataset usually indicate something went wrong in the collection or
production process. When using a dataset take a look at the smallest and largest values and make sure
the range between the two is reasonable.

Bad Data Notes

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bad Data Notes

Hochgeladen von

Copyright:

Verfügbare Formate

-Values are missing! (Do I know what null or zero means?

Was data never collected or was there

Das könnte Ihnen auch gefallen