Beruflich Dokumente
Kultur Dokumente
-4,294,967,295
-555-3485 (phone numbers)
-long sequences of 9s or 0s
Dates:
-1970-01-01T00:00:00Z
1969-12-31T23:59:59Z
January 1st, 1900
January 1st, 1904
Locations:
000'00.0"N+000'00.0"E or simply 0N 0E
US zip code 12345 (Schenectady, New York)
US zip code 90210 (Beverly Hills, CA)
-Your spreadsheet has 65536 rows. Older versions of excel have this as the limit. Newer versions allow
for 1,048,576 rows. If you receive a dataset with 65536 rows confirm with source that it isn't truncated.
-Line endings are garbled/weird. This often occurs when a document is saved on one operating system
and opened on another. All documents are formatted with hidden markup characters, and different
systems have different ideas of where they should be placed. A common fix is to open the file in any
generic text-editor and then resave it.
-Data that needs to be extracted from a PDF. A lot of government data (that isn't scanned) is available in
.pdf format. If there is textual data or a table you'd like to extract, a good tool to use is tabula or adobe
acrobat pro's 'export table to excel' feature.
-Methodology by which the data was collected was faulty:
-A sample was not random: When a survey is conducted, it failed to cover a representative size of the
population. Solution: Don't use it
-Margin of error is either larger than 10% (rule of thumb) or is completely unknown. You should be
wary of the former and avoid the later at all costs.
-Timeframe has either been manipulated or carelessly misrepresented to skew context. The book uses
an example of a nonexistant 2015 national crime wave. The study hihglighted spikes in violent crime
in select cities as compared to the last few years. The limited dataset heightened the occurrences,
whereas crime was much higher across the country ten years prior, and nearly double twenty years
before. Solution: If you have a limited dataset try to start several years/months/days into the data so as
to not have the whole set thrown off by single addition.
-Another way data can be manipulated is by the frame of reference. If I'd like to make literacy in my
school look good, I could compare it relative to one of the bottom school districts in the country or if I
wanted to make it look good, I could compare it to one of the top-scoring. Either way, when measuring
a change in data one has to ask what it is being compared against, and whether or not that it's an
appropriate choice. A good way around this is to compare something against several varied starting
points and see how the results change. This is also something to be wary of when you're researching a
politicized topic; it is very easy to succumb to confirmation bias and go with the first result that agrees
with your presuppositions.
-Was the collection process of the data transparent? It can be difficult to know exactly how data was
collected, but if you find data with an unlikely amount of precision (factory emissions with accuracy to
seven decimal points) or appears to measure or show something impossible (percentages on global
public opinion) chances are it is an estimate and not empirical data. Your best bet is to find one or
several experts to weigh in on the matter.
-Inexplicable outliers in a dataset usually indicate something went wrong in the collection or
production process. When using a dataset take a look at the smallest and largest values and make sure
the range between the two is reasonable.