Data Handling Best Practices

http://www.scribd.
com/brian_verdine
Page 1 of 4
Data Handling Best Practices
By Brian Verdine

Over the course of my graduate career I noticed that many of the same data handling issues
continuously came up as inexperienced researchers cycled into the lab and experienced researchers left.
I created this document as a training exercise for incoming graduate students to help avoid some of the
common pitfalls of data handling. I hope it will be educational for anyone doing research that requires
handling data sets and analyzing that data.

I am VERY open to input on this list and would like to continue to flesh this out. It is by no means
comprehensive or complete. This list was developed to serve psychology research labs, and may be
more or less applicable in other fields of study, although I did try to avoid suggestions that might not be
generally agreed upon. If you do have a disagreement or an addition, please comment and I will
continue to edit/revise this posting.

Unfortunately, storing data in Excel and exporting into SPSS for analysis has become the combo most
commonly used in Psychology and many other fields. These programs are common, relatively cheap,
and easy to train people to use with at least moderate proficiency. However, entering data and
manipulating it in excel, and then importing it into SPSS is also error prone. Consider other options if
they are available to you.

General Data Handling

Data is precious and irreplaceable
o Back it up and store data in multiple locations.
o Every time you open a datasheet and do something in it, there is the potential that you
could make a mistake that might inadvertently misalign your data with the correct
variable/participant.
At best this might invalidate the data of one participant.
At worst, this can be catastrophic to the hope of finding any interesting results
from the dataset. It also may be very difficult to even detect, let alone correct.
BE CAREFUL!!!!!!! RESPECT THE DATA!!!!!!
o Copying and pasting is often necessary but is typically not the safest way to move data.
If you find yourself repeatedly copying/pasting, there is probably a way to
automate the process that will be both faster and improve the accuracy of the
data transfer.
At the beginning of a study choose a short and unique name for the study and all of the
conditions of the study. Stick to them when labeling datasheets and any other piece of data
created by the study.
o Include these names at the start of every filename so that if a file ends up outside of a
folder or mixed into the data from another study, you can immediately identify where it
belongs and re-organize it.
You have no idea who else or how you might use data in the future. Do not take shortcuts in
documenting data sources, variable names and their meanings, or in organizing a dataset and
http://www.scribd.com/brian_verdine
Page 2 of 4
data files. These types of shortcuts are short-term and bad documentation can turn an easy
follow-up project into an impossible task.
Any time you are on a computer and you say to yourself there has to be a better way to do
this there probably is a better way. Take some time, Google a few things, and teach yourself.
It may take more time to solve the problem the first time, but every time you use that
knowledge later you are reaping the benefits.

Data Sheet Management

Using Excel formulas to calculate things is dangerous (see below). Unless you immediately NEED
the data you want to calculate, do not set up Excel sheets to do these calculations. It just adds
unnecessary columns and potential for error. Do calculations in a statistical package using
syntax so that you can run the calculation for the variable, document how you calculated it, and
recalculate it as needed with the press of a button.
o Why is this important?
Calculations in excel and other programs are really unstable. In excel,
calculation cells referencing another cell behave differently if you copy/paste vs.
cut/paste the cells being referenced. The people in the lab who did not set up
the database may not understand this distinction or may not even know that a
cell is being referenced. This often results in calculated cells accumulating
errors as people move data around in the sheet.
Depending on the version of Excel, how the sheet was initially set up, and the
number of calculations in the sheet, data may not update as expected when
data is changed/added.
Excel is really NOT a great tool for managing large datasets, especially in large team
environments, because of how easy it is to accidentally make changes to data you do not intend
to change.
o Imagine what would happen if someone sorted your database by participant number
but failed to select the rest of your variables
o There are other software programs and online solutions (e.g., REDCap) that address this
issue and ease many of the other problems associated with entering, formatting,
storing, exporting, and analyzing data. Use these when setting up a database for your
studies (it is always harder to change later).
o Did I mention REDCap is a great tool! (http://project-redcap.org/)

Database Setup
Document any and all variables:
o If using excel, part of the column heading for any data sheet should always be dedicated
to what type of coding was applied to the data and what the numbers within the
spreadsheet mean, even if you think it is painfully obvious. Do not do this in another file
or another location since the data is worthless without proper documentation. I would
also recommend NOT using comments in Excel to do this. If you put them in a row
above the variable names it will be easier to import them into other statistical packages
(e.g., to get these detailed variable labels into SPSS from Excel you can simply copy the
row, transpose it into a column, and copy/paste it into the variables tab of SPSS). It is
not easy to copy/paste the data from comments.
Page 3 of 4
o When analyzing data in SPSS or other statistical packages, take the time to enter
variable labels and variable codes as you are setting up the database. It may not seem
like it, but it will probably save time in the long run (especially if you are working hard,
as you should be, to ensure that the statistics you run are reproducible).
o Dont think youre going to know what SESvarRaw is going to mean 2 years from now
(or even 2 weeks from now), document it.
o Consistently name variables when using the same variables across different studies.
This allows you to re-use syntax or easily combine/compare data across studies
When you are setting up your datasheet, think about how you are going to have to analyze the
data and enter the data accordingly. You do not want to have to recode a lot of data before
putting it into a statistical package (e.g., do not type in yes/no, use 1s and 0s if possible)
Use unique subject numbers for every participant, preferably using a scheme that is unlikely to
accidentally be repeated (e.g., chronological numbering is easy to mess up if someone doesnt
write down a participant, but initials and the date of testing are much less likely to repeat or be
assigned incorrectly).
It is almost never a good idea to mix numerical data with string data (e.g., letters or words)
within the same variable column, since most statistical packages cant analyze the data without
it being coded into numbers.
0 should always be used to signify the null and should therefore always correspond to a
response of NO for any yes/no question (1 being equal to yes).
When coding data, unless the answer on a scale literally means the null (or not at all), it is
usually better to start numbering/coding a scale with 1 instead of 0 to avoid any confusion
about the meaning of the baseline of the scale.
Sex is most often quantified with 0s and 1s.
o The convention is generally 0 = female, 1 = male and you can clarify what the data
means by using Male? as the header for the data in the spreadsheet instead of Sex
or Gender.
Since we just established that 0 stands for no, answering 0 to the question,
Male? means they must be female.
Agree on a convention for labeling missing data if it is necessary to label it:
o It is typically ok to leave missing data as blanks in a datasheet unless it is important to
know why the data is missing so that it can be recovered/replaced/analyzed differently.
Missing data should be labeled with something that can easily be
found/replaced using software (i.e., use something unique and unlikely to
appear in the same format as part of another word in the datasheet).
NEVER use 0s for missing data since 0s almost always have a different meaning
in a dataset
o SPSS has the ability to specify codes for missing variables, which makes it low cost to
track them and remove them if necessary.
As of writing this, I think the codes have to be numeric if the data is numeric, so
choose a number combination that contains more numbers than other data
points in the data set and that is very unlikely to appear in the data (e.g.,
999999)
You can specify multiple codes for missing variables too, so you can retain more
information about why they are missing if it is important (i.e., 999999 can be set
to mean something different from 999998).

Page 4 of 4

Respect the Data
(Almost) NEVER recode continuous data (e.g., age in years) into categorical data (e.g., old vs.
young). If you do be sure to retain the original data and clearly mark what both sets of data
mean.
o Coding continuous data loses information and losing information is almost always a bad
idea.
o Lost information is often not easy or is impossible to reconstruct so always retain your
original data.
Deleting variables from a datasheet is (almost) NEVER a good idea unless the data is incorrect or
derived from raw data that you are retaining.
o Using syntax to call variables during analysis makes the size of a datasheet mostly
irrelevant and reduces the cost of having to sort through a long list of unnecessary
variables.
Dont delete variables from a main database to make analysis easier. Make
analysis easier by using syntax, which has the added benefit of documenting and
making your analyses replicable.
Syntax also allows you a shortcut to working with a smaller data sheet since you
can make a copy of the main datasheet, delete out variables you dont need,
write the syntax using a subset of the variables, and then apply that syntax to
the original (main) datasheet.
The one major benefit to SPSS is that more novice users can use the
user interface to set up an analysis and then view the syntax that it used
to create the analysis. By copying/pasting that syntax into your syntax
file you can document everything you have done and by replacing
variable names within chunks of the syntax you can add to or repeat
similar analyses very easily. Because SPSS will show you the syntax it is
using when you create an analysis in the user interface, the cost of
entry for learning to use syntax in SPSS incredibly low. Seriously if
you are using SPSS and you learn one thing from this document, please
start using syntax unless you REALLY like wasting your time. There is no
good excuse to ignore this advice. Not being good at computers is not a
good excuse if you can learn the statistical knowledge necessary to
use SPSS properly, you are more than smart enough that you can slowly
learn to master syntax. I promise!

Data Handling Best Practices

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Handling Best Practices

Hochgeladen von

Copyright:

Verfügbare Formate

http://www.scribd.

Das könnte Ihnen auch gefallen