Sie sind auf Seite 1von 48

Good data practices

A John Michael Raj


Lecturer
Department of Biostatistics
St. John’s Medical College
Bangalore
Department of Biostatistics, St.John's
1
Medical College, Bangalore
Data Collection
– What you need to know: numbers or stories

– Where the data reside: environment, files, people

– Resources and time available

– Complexity of the data to be collected

– Frequency of data collection

– Intended forms of data analysis

Department of Biostatistics, St.John's


2
Medical College, Bangalore
Data Collecting Methods

Primary data collection

Secondary data collection

Department of Biostatistics, St.John's


3
Medical College, Bangalore
Secondary Data
• Use available data, but need to know

– how the measures were defined

– how the data were collected and cleaned

– the extent of missing data

– how accuracy of the data was ensured

Department of Biostatistics, St.John's


4
Medical College, Bangalore
Primary Data
• Original data:
– be sensitive to burden on others
– pre-test, pre-test, pre-test
– establish procedures and follow them
(protocol)
– maintain accurate records of definitions
and coding
– verify accuracy of coding, data input

Department of Biostatistics, St.John's


5
Medical College, Bangalore
Case report form
Official clinical data-recording document or
tool used in a clinical study

PAPER RDC/RDE (Remote Data Capture,


Remote Data Entry)

Department of Biostatistics, St.John's


6
Medical College, Bangalore
CRF
Collects relevant data in a specific format
 in accordance with the protocol
Allows for efficient and complete data
processing, analysis and reporting

Facilitates the exchange of data across


projects and organizations esp. through
standardization

Department of Biostatistics, St.John's


7
Medical College, Bangalore
Developing CRF
– Collect data with all users in mind

– Collect data outlined in the protocol

– Be clear and concise with your data


questions

– Avoid duplication

– Request minimal free text responses

Department of Biostatistics, St.John's


8
Medical College, Bangalore
Contd..
– Provide units to ensure comparable values

– Provide instructions to reduce


misinterpretations

– Provide “choices” for each questions

• allows for computer summarization

– Use “None” and “Not done”

Department of Biostatistics, St.John's


9
Medical College, Bangalore
Sample CRF

Department of Biostatistics, St.John's


10
Medical College, Bangalore
DATA ENTRY
Department of Biostatistics, St.John's
11
Medical College, Bangalore
Processing of Questionnaires or CRF’s
before data entry
• Questionnaires should be labeled with a
unique case number (an ID)

• A code book describes the variable name,


meaning, and coding of each variable
categories
– With numerical information, don’t make any
calculations before data entry (BMI)
– Record dates; don’t calculate age before
data
Department of Biostatistics, St.John's
12
Medical College, Bangalore
Code Book
• The name assigned to the variable
• What the variable represents (i.e., its label)
• How the variable was measured (e.g. nominal,
ordinal, scale)
• For scale variables: The variable's units of
measurement
• For categorical variables: If coded numerically,
the numeric codes and what they represent
Department of Biostatistics, St.John's
13
Medical College, Bangalore
VARIABL LEGAL
VARIABLE LABEL VALUE LABEL TYPE MEASURE
E NAME RANGE
idno Identification number String

doa Date of assessment Date Date

age Age (years) Numeric 20-60 Continuous

sex Gender 1-male, 2- female Numeric 1,2 Nominal

1-unmarried, 2-
ms Marital status married,3- widow, 4- Numeric 1,4 Nominal
seperated/divorced

smk smoking 0-no, 1-yes Numeric 0-1 Nominal

number of cigarettes
smknum Numeric Discrete
smoked per day

alc alcohol consumption 0-no, 1-yes Numeric 0-1 Nominal

gravida Gravida Numeric Ordinal

ht Height(cm) Numeric 100 - 200 Continuous

wt Weight(kg) Numeric 30 - 100 Continuous


Department of Biostatistics, St. John's Medical College, Bangalore 14
DATABASE MANAGEMENT
SOFTWARES
• DBMS
• Open clinica
• Oracle Clinical
• SAS
• SPSS
• EPI DATA
• Microsoft Access, Excel
Department of Biostatistics, St.John's
15
Medical College, Bangalore
DATA ENTRY USING

MICROSOFT EXCEL
Department of Biostatistics, St.John's
16
Medical College, Bangalore
Errors in data entry
• Transposition (ex: 39 becomes 93)

• Copying errors (0 copy as an “o” letter)

• Consistency errors: two or more responses are


contradictory ( sex: man, pregnancy=yes)

• Range errors: answers outside of probable or


possible values (ex: height=3metres)

Department of Biostatistics, St.John's


17
Medical College, Bangalore
Requirements for a successful data preparation in
Excel
• Place all Variables name in the first row only

• Variable name in each column should be


– unique, simple, 1-word name
– 8 characters or less with no spaces, beginning with a letter

• Variables are either numeric or character

• Do not combine character data with numeric data in the same


column
– Do not put “NA”, “will get”, “<20”, or “?” in a numeric column
– Do not use a dot to represent missing data in a numeric column

Department of Biostatistics, St.John's


18
Medical College, Bangalore
Requirements for a successful data preparation in
Excel
• Missing numeric data should have blank cells

• Be sure Excel stores your numeric data as numbers and not as


text

• Delete ALL extraneous columns and rows (e.g. summary


statistics, notes, coding key)

• Check your date formats. It may look right in excel, but it will
be imported according to the internal representation, which may
not be in right format
– You can use DD/MM/YYYY

Department of Biostatistics, St.John's


19
Medical College, Bangalore
Recommendations for efficient data management and
analysis

• One row per case

• Don’t waste columns combining other columns (e.g. height,


weight, BMI)

• Keep variable names short & unique. Start with a letter and use
only letters, numbers, & underscore. No spaces. LowerCase is OK

• Be completely and utterly consistent (e.g. M, m, F, f=4 genders)

• For yes/no variables, it is helpful to use, 1 - yes and 0 - no

• Missing character and numeric data should have blank cells


Department of Biostatistics, St.John's
20
Medical College, Bangalore
Recommendations for efficient data management and
analysis

• Enter cases and controls or treatment groups in the same


spreadsheet and under one variable(column)
(TREATED: 0=no, 1=yes ; GROUP: 1=Drug A, 2=Drug B)

• Create a simple guide (or code key) using a word processor


to explain variables labels, value coding, and how missing
values were entered. Be consistent

• Think about the analysis before collecting any data

• Have a biostatistician review the coding before data entry


and again after the first 10 patients have been entered

Department of Biostatistics, St.John's


21
Medical College, Bangalore
Use of EpiData

Questionnaire
design and entry

Department of Biostatistics, St.John's


22
Medical College, Bangalore
Introduction to EpiData

– Data entry and documentation

– Free program

– Based on EpiInfo

– Windows format

– No limit on No. of observation


(tested with>100,000)
Department of Biostatistics, St.John's
23
Medical College, Bangalore
What is EpiData

• EpiData is a program for data entry and


documentation of data

• EpiData is limited in its ability to analyze


data

• EpiData creates a database as a .REC file


(and implements EpiInfo version 6 file
structure)

• The software is FREE!


-Available at http://www.epidata.dk

Department of Biostatistics, St.John's


24
Medical College, Bangalore
EpiData (I)

– Creating questionnaire

– Controlled data entry

– Documenting and printing data

– Comparing of 2 data files

– Importing and exporting data

– Simple analysis
Department of Biostatistics, St.John's
25
Medical College, Bangalore
EpiData (II)

• Simple surveys – one questionnaire

• Complicated surveys – few questionnaires

If there is ID – possible to merge data

Department of Biostatistics, St.John's


26
Medical College, Bangalore
EpiData files

• .QES file
- Questionnaire

• .REC file
- Actual data

• .CHK file
- Any defined checks

• Other notes or log files


Department of Biostatistics, St.John's
27
Medical College, Bangalore
EpiData workflow

1.Define Data 4.Enter Data


2.Make Data File 5.Document
3.Set up Checks 6.Export Data

Department of Biostatistics, St.John's


28
Medical College, Bangalore
Creating Questionnaire(1)
• Define data
• Can either open .QES file or create one

Department of Biostatistics, St.John's


29
Medical College, Bangalore
Creating questionnaire(2)

• Type in Microsoft word

• Cut and paste from Word documents

• Preview questionnaire
-(click Make data file ► preview data form)

Department of Biostatistics, St.John's


30
Medical College, Bangalore
Structure of questionnaire

Three sections:

- Field name (variable)

- Text describing field

- Input definition (data format)


(number/letters/date)

Department of Biostatistics, St.John's


31
Medical College, Bangalore
Field name (variable)

• No more than 8 characters

• Begin with a letter

• No spaces or punctuation marks

• No underspace

Department of Biostatistics, St.John's


32
Medical College, Bangalore
Field Name (II)
• Automatic:
• If we are not specifying the variable name, EpiData
automatically
• Generates the variable
name based on question

• It uses the first 10 letters


of the question

Department of Biostatistics, St.John's 33


Medical College, Bangalore
Automatic field names

• Text in curly brackets { } used in preference

• Common words skipped (what, the, and, etc.)

• If question starts with number, “N” is inserted


before the number

Department of Biostatistics, St.John's


34
Medical College, Bangalore
Automatic field names examples

Question: Field name:


Did you {eat ice cream} EATICECREAM
What is your name? ISYOURNAME
2.Age N2AGE

Department of Biostatistics, St.John's


35
Medical College, Bangalore
Variable type
• Define variable types using “Pick list” or “code
writer”

• Choose type of variable:


- Numeric
- Text
- Date
- Boolean (yes/no)
- Soundex
- Autonumber (ID no.)

Department of Biostatistics, St.John's


36
Medical College, Bangalore
Numeric & Text variables
Numeric:
• Numerical information

• Hold integers (whole numbers) or numbers with a decimal point

• Length (max. 14 digits and 12 decimals )

• <#>, <##.#>

Text:
• Information of text and/or numbers

• Holding information (e.g. names, addresses)

• No mathematical operations

• Length (80 characters)

• <_>
Department of Biostatistics, St.John's
37
Medical College, Bangalore
Other variables
• Boolean variables:
- These are logical variables; there are only two possible answers: yes / no
- <Y>
• Date variables:
- Hold information on dates
- <DD/MM/YYYY>
- System variable:
Today date: date of the data entry
- <Today-dmy>
• Soundex:
- Coding of words (anonymous, e.g. A-123)
- code to limit orthographic errors (e.g. Rome and Roma)
- <S >
• Auto identification number (system variable)
- Counts the records entered
- <IDNUM>

Department of Biostatistics, St.John's


38
Medical College, Bangalore
Make data file from
Questionnaire

• To create Data file


– Enter questionnaire file name
(trial.qes)
– Enter data file name (trial.rec)
– OK
– Creates .REC file

• To view data file


– Preview data form
– Enter OK

Department of Biostatistics, St.John's


39
Medical College, Bangalore
Checks (I)
• Once Data file is created:
Checks ► choose Trial.REC file
• Why do we need checks?
– Reduce errors in input by declaring range
– Checks help with data entry
– Save time by skipping unwanted entry of
characteristics
– Add value labels in the checks
– to prevent duplicates
• What are checks?
– Basic
– Advanced

Department of Biostatistics, St.John's


40
Medical College, Bangalore
Checks (Basic)
• Range, legal
– 1-3, 9 (9 for missing)
– First range then individual numbers

• Jumps
– Jumps is used to skip the current fields that
are not applicable

• Must Enter
– Data must be entered in field

• Repeat
– Show data from previous record

• Value Label
– Click “+” to add label
– Add text to explain label values
– Look at the screen forms as shown

Department of Biostatistics, St.John's


41
Medical College, Bangalore
Checks(Advanced)

• Advanced
– To prevent duplication in unique ID
(identification ) number, you can do as

Checks ► Edit ► Key Unique 1

• Conditional jumps can also be


used by

if… then… endif structure

Department of Biostatistics, St.John's


42
Medical College, Bangalore
Data Entry (I)
• Enter data ► Choose trial.REC file
• Dates:
- 24/08/2012 – type 240812 or 24/8/12
• Value Labels:
Press F9 to view the value
labels during data entry

Department of Biostatistics, St.John's


43
Medical College, Bangalore
Data Entry (II)
• Record navigation:

• Delete records:
- Click cross to delete
- Record marked for deletion, but can be recovered

Department of Biostatistics, St.John's


44
Medical College, Bangalore
Document Tools

• File Structure
• Data entry notes (.NOT file)
- Use to write comments during data
entry
- e.g. difficult to read handwriting etc
• View Data
• List Data
• Codebook
- Basic descriptive statistics on all
variables
• Validate duplicate files
- Check consistency after double entry

Department of Biostatistics, St.John's


45
Medical College, Bangalore
Export to standard statistical
software programs

• Click “Export data” button

• Choose Stat. software program

• Including Excel, Stata, SPSS

Department of Biostatistics, St.John's


46
Medical College, Bangalore
References

• Lauritsen JM & Bruus M. EpiData (version 3.1). A


comprehensive tool for validated entry and
documentation of data. The EpiData Association
Odense Denmark, 2004.

• Lauritsen JM, Bruus M. EpiTour – An


Introduction to data entry and documentation of
data by use of EpiData. The EpiData Association,
Odense Denmark, 2005.

Department of Biostatistics, St.John's


47
Medical College, Bangalore
www.epidata.com

Department of Biostatistics, St.John's


48
Medical College, Bangalore