Beruflich Dokumente
Kultur Dokumente
STATISTICS
It is science which deals with the methods in the collection, gathering, presentation,
analysis and interpretation of data.
Origin and development of Statistics
As early as 3500 B.C., Statistics had been Used in Egypt in recording the number of sheep
or cattle owned, the amount of people living in a Particular city.
In 3800 B.C., Babylonian government used Statistics to measure the number of men under
the king’s rule and the vast territory that he occupied.
In 700 B.C., Roman Empires used Statistics by conducting registration to Record
population for the purpose of Collecting taxes.
IMPORTANT PERSONS IN STATISTICS
John Graunt (1620 - 1674)
Records “bills of mortality” that included Information about the numbers and causes of
deaths in the city of London
De Moivre (1773)
De Moivre discovered the equation of Normal Distributions
Adolphe Quetelet (1796 - 1874)
He known as “Father of Modern Statistics
Karl Pearson (1857 - 1936)
He discovered probability theory of regression and correlation
Ronald Fisher (1890 - 1962)
He developed the f-tool in inferential statistics and for experimental design.
George Gallup (1857 - 1936)
He was an instrumental in making statistical polling, common tool in political campaigns.
Psychology
Determine attitudinal Patters, the causes and effects Of misbehavior.
Business and Economics
Validate or test a claim or inferences about a group of people, objects or a series of events.
Medicine
It collects information about patience and diseases to make decisions about the use of new
drugs treatment.
Meteorologist
It finds patterns in the weather and predictions about what future weather will be like.
TYPES OF SAMPLES
Random Sample
It is the most commonly used sampling techniques. It is a procedure where every elements
of population is given an equal chance of being selected as a member of the sample.
Convenience Sample
It is a sample that is chosen so that it is easy for the researcher.
Cluster Sample
It is a sample that consists of items in a group such as neighborhood or a household the
group maybe at random.
DATA COLLECTION
Data can be define as the value of a variable (e.g. number, images, words, figures, facts or
ideas)
It is a lowest unit of information from which other measurements and analysis can be done.
Data is one of the most important and vital aspect of any research study.
Factors to be Considered Before Collection of Data
Object and scope of the enquiry.
Sources of information.
Quantitative expression.
Techniques of data collection.
Unit of collection.
METHODS OF PRIMARY DATA COLLECTION
1. Questionnaire Method
2. Interview Method
3. Focus Group Discussion (FGD)
4. Participatory rural Appraisal/ Assessment (PRA)
5. Rapid Rural Appraisal/ Assessment (RRA)
6. Observation Method
7. Survey Method
8. Case Study method
9. Diaries Method
10. Principal Component Analysis (PCA)
11. Activity Sampling Technique
12. Memo Motion Study
13. Process Analysis
14. Link Analysis
15. Time and Motion Study
16. Experimental Method
17. Statistical Method
SOURCES OF DATA
1. External sources
Primary data
Secondary data
2. Internal sources
Primary Data
Data that has been collected from first-hand experiences is known as primary data. It has
more reliable, authentic and not been published anywhere.
Primary data has not been changed or altered by human being; therefore, its validity is
greater than secondary data.
Methods Of Collecting Primary Data
Direct Personal Investigation (i.e. interview method)
Investigation through observation
Investigation through mailed questionnaire
Investigation through local reporters’ questionnaire
Indirect oral investigation (i.e. through enumerators)
Secondary Data
Secondary data are those that have already been collected by others.
These are usually in journals, periodicals, research publication, official record etc.
Secondary data may be available in the published or unpublished form. When it is not
possible to collect the data by primary method, the investigator go for secondary method.
This data collected for some purpose other than the problem at hand.
Method Of Collection Secondary Data
1. Published Sources
International
Government
Municipal corporation
Institutional/ commercial
2. Unpublished sources
SECONDARY DATA
MERITS DEMERITS
Quick and cheap source of data No fulfill our specific research needs
Wider geographical area Poor accuracy
Longer orientation period Data are not up to date
Leading to find primary data Poor accessibility in some cases
DIFFERENCE B/W PRIMARY AND SECONDARY DATA
Primary Data Secondary data
Real time data Past data
Sure about sources of data Not sure about of sources of data
Help to give results/finding Refining the problem
Costly and time consuming process Cheap and no time consuming process
Avoid biasness of response data Cannot know in data biasness or not
More flexible Less flexible
TYPES OF STUDIES
Agenda
• Dependent Variable
• Independent Variable
• Intervening/Mediating Variable
• Organismic Variable
• Control/Constant Variable
• Interval variable
• Ratio variable
• Nominal/Categorical variable
• Ordinal variable
• Dummy variables
• Preference variable
• Multiple response variable
• Extraneous Variable
VARIABLE
• Any characteristic which is subject to change and can have more than one value such as
age, intelligence, motivation, gender, etc.
Dependent Variable
• Variable affected by the independent variable
• It responds to the independent variable.
Independent Variable
• Variable that is presumed to influence other variable
• It is the presumed cause, whereas the dependent variable is the presumed effect.
Example 1 Example 2
You are interested in “How stress affects Promotion affects employees’ motivation
mental state of human beings?”
Independent variable Promotion
Independent variable Stress Dependent variable Employees motivation
Dependent variable mental state of human
beings
You can directly manipulate stress levels in
your human subjects and measure how those
stress levels change mental state.
• Explained • Explanatory
• Predictand • Predictor
• Regressand • Regressor
• Response • Stimulus
• Outcome • Covariate
• Controlled • Control
Intervening/Mediating Variable
It is a variable whose existence is inferred but it cannot be measured.
Example 1 Example 2
Determining the effect of video clips on Higher education typically leads to higher
learning ability of students of M.Phil. income
The association between videoclips and Higher education independent variable
learning ability needs to be explained. Higher income dependent variable
Other variables intervene Better occupation intervening variable
Such as anxiety, fatigue, motivation, improper
diet, etc. It is causally affected by education and itself
It is caused by the independent variable and is affects income.
itself a cause of the dependent variable.
Organismic Variable
Any characteristic of the research participant/individual under study that can be used for
classification
Such as personal characteristics of gender, height, weight, age, etc. in behavioral sciences.
Control/Constant variable
It is variable that is NOT allowed to be changed unpredictably during an experiment.
As they are ideally expected to remain the same, they are also called constant variables.
Example
An example of a constant variable is the voltage from a power supply.
If you are examining “How electricity affects experimental subjects” you should keep the
voltage constant, otherwise the energy supplied will change as the voltage will be changed.
Interval Variable
Interval variables have a numerical value
These have order and equal intervals.
They allow not only to rank order the items that are measured but also to quantify and
compare the magnitudes of differences between them.
Example
Suppose you have a variable such as monthly income that is measured in rupees, and we have
three people who make
• Rs. 10,000
• Rs. 15,000 and
• Rs. 20,000
Ratio Variable
A ratio variable is similar to an interval variable with one difference: the ratio makes
sense.
Example
• Let’s say respondents were being surveyed about their stress levels on a scale of 0-10.
• A respondent with a stress level of 10 should have twice the stress experienced as a
respondent who selected a stress level of 5.
Age, height, and weight are also good examples of ratio variables. Someone who is 6’.0” tall is
twice as tall as someone who is 3’.0” tall.
Nominal/Categorical Variable
• They can be measured only in terms of whether the individual items belong to certain
distinct categories
• We cannot quantify or even rank/order the categories:
• Nominal data has no order
• One cannot perform arithmetic (+, -, /, *) or logical operations (>, <, =) on the nominal
data.
Example
Gender: (Dichotomous Variable) Marital Status:
1. Male 1. Unmarried
2. Female 2. Married
3. Divorcee
4. Widower
Ordinal Variable
An ordinal variable is a nominal variable, but its different states are ordered in a meaningful
sequence.
Ordinal data has order but the intervals between scale points may be uneven.
Because of lack of equal distances, arithmetic operations are impossible, but logical
operations can be performed on the ordinal data.
A typical example of an ordinal variable is the socio-economic status of families.
We know 'upper middle' is higher than 'middle' but we cannot say 'how much higher'.
Example
A questionnaire on the time involvement of scientists in the 'perception and identification of
research problems'.
The respondents were asked to indicate their involvement by selecting one of the following
codes:
1 = Very low or nil 2 = Low
3 = Medium
4 = Great
5 = Very great
Here, the variable 'Time Involvement' is an ordinal variable with 5 states.
Dummy Variable
A qualitative variable can be transformed into quantitative variable(s), called dummy
variable.
Preference Variable
Preference variables are specific discrete variables whose values are either in a decreasing
or increasing order.
For example,
In a survey, a respondent may be asked to indicate the importance of the following FIVE sources
of information in his research and development work, by using the code [1] for the most important
source and [5] for the least important source:
1. Literature published in the country
2. Literature published abroad
3. Scientific abstracts
4. Unpublished reports, material, etc.
5. Discussions with colleagues within the research unit
Multiple Response Variable
Multiple response variables are those which can assume more than one value
Example
A typical example is a survey questionnaire about the use of computers in research.
The respondents were asked to indicate the purpose(s) for which they use computers in their
research work. The respondents could score more than one category.
1. Statistical analysis
2. Lab automation/ process control
3. Data base management, storage and retrieval
4. Modeling and simulation
5. Scientific and engineering calculations
6. Computer aided design (CAD)
Extraneous Variable
Extraneous variables are undesirable variables that influence the relationship between the
variables an experimenter is examining.
Example
An educational psychologist has developed a new learning strategy and is interested in
examining the effectiveness of this strategy.
The experimenter randomly assigns students into two groups. All of the students’ study text
materials on a biology topic for thirty minutes. One group uses the new strategy and the other
uses a strategy of their choice.
Then all students complete a test over the materials.
Extraneous variable pre-knowledge of the biology topic
Random Variable
Statistics and Probability
Competencies:
Illustrates a random variable (discrete and continuous);
Illustrates a random variable (dependent and independent);
distinguishes between a discrete and continuous random variable and dependent
and independent variable; and
Finds the possible values of a random variable.
Example 1:
Factors Affecting Academic Performance of Grade 11 Students
Factors may include I.Q., study habits, etc.
Independent Variable Dependent Variable
Factors Affecting Academic Performance of Factors may include I.Q., study habits, etc.
Grade 11 Students
Example 2:
Mr. S set up an experiment to see how the mass of a ball affects the distance it rolls off a ramp.
Example 3:
Eating breakfast in the morning increases the ability to learn in school.
Independent Variable Dependent Variable
mass of a ball distance it rolls off a ramp
Eating breakfast ability to learn
GRAPHICAL REPRESENTATION
• The visual display of statistical data in the form of points,
• Lines, areas and other geometrical forms and symbols, is in the most general terms known
as graphical representation.
BAR GRAPH
• A bar graph is a chart that uses either horizontal or vertical bars to show comparisons
among categories.
LINE DIAGRAM
A graph that shows information that is connected in some way (such as change over time).
Line graph represent data or information in the form of dots, and these dots shows like a
line in the particular graph.
PIE DIAGRAM
Pie diagram is a circular diagram where the whole circle represents a total and the
components of the total are represented by sectors of the pie diagram.
Pie diagram is also called sector diagram.
Example (Pie Chart)
The Chart below shows the percentage of usage of different browser in Europe. In this
chart 37.9% of people in Europe use Firefox and 15.5% of people use chrome, vice versa.
PICTOGRAM
A pictogram is a popular device for portraying the statistical data by means of pictures or
small symbols.
HISTOGRAM
A histogram consists of a set of adjacent rectangles whose bases are marked off by class
boundaries (not class limits) on the X- axis and whose heights are proportional to the
frequencies associated with respective classes.
Example (Histogram)
SAMPLING DESIGN AND SAMPLING DISTRIBUTIONS
Target Population
The target population is the collection of elements or objects that possess the information
sought by the researcher and about which inferences are to be made.
TERMINOLOGY
ELEMENT
is the object about which or from which the information is desired, e.g., the respondent
SAMPLING UNIT
is an element, or a unit containing the element, that is available for selection at some stage
of the sampling process
EXTENT
refers to the geographical boundaries
TIME
is the time period under consideration
STATISTICAL ERRORS
– The difference between the value of a sample statistic of interest and the value of the
corresponding population parameter a statistical error has occurred.
Types of Errors
1. Random Sampling Error
The difference between the sample result and the result of a census conducted using
identical procedures
These errors are due to chance fluctuations
2. Systematic Error
Systematic (non sanmpling) errors result from non sampling factors, primarily the
nature of a study’s design and the correctness of execution
These are not due to chance fluctuatuions
Classification of Sampling Techniques
Sampling Techniques
Non probability Sampling Techniques
– Convenience Sampling
– Judgmental Sampling
– Quota Sampling
– Snowball Sampling
2. Systematic Sampling
• A sampling procedure in which a starting point is selected by a random process and
then every nth number on the list is selected.
3. Stratified Sampling
• A probability sampling procedure in which simple random subsamples that are
more or less equal on some characteristic are drawn from within each stratum of
population.
5. Cluster Sampling
• An economically efficient sampling technique in which the primary sampling unit
is not the individual element in the population but a cluster of element; clusters are
selected randomly.
NONPROBABILITY SAMPLES
DESCRIPTION COST AND ADVANTAGES DISADVANTAGES
DEGREE OF USE
Judgement – an expert - Moderate cost, - Useful for certain - Bias due to expert’s
or experienced - average use types of forecasting, beliefs may make sample
researcher selects the - sample guaranteed unrepresentative,
sample to fulfill a to meet a specific projecting data beyond
purpose, such as objective sample is risky
ensuring that all
members have a certain
characteristic
Quota – the researcher - Moderate cost, - Introduces some - Introduces bias in
classifies the population - very extensively stratification of researcher’s classification
by pertinent properties, used population, of subjects,
determines the desired - requires no list of - nonrandom selection
proportion to sample population within classes means
from each interviewer error from population
cannot be estimated,
projecting data beyond
sample is risky
Snowball – initial - Low cost, Useful in locating High bias because sample
respondents are - used in special members of rate units are not independent,
selected by probability situations populations projecting data beyond
samples, additional sample is risky
respondents are
obtained by referral
from initial respondents
PROBABILITY SAMPLES
Simple Random – the High cost Only minimal
Requires sampling frame
researcher assigns advance knowledge
to work from, does not
each member of the Moderately used in of population
use knowledge of
sampling frame a practice (most needed, easy to population that researcher
number, then selects common in random analyze data andmay have, larger errors
sample unit by digit dialing and with compute error for same sampling,
random method computerized respondents may be
sampling frames widely dispersed, hence
cost may be higher
Systematic – the Moderate cost Simple to draw, If sampling interval is
researcher users Moderately used easy to check related to periodic
natural ordering or the ordering of the
order of the sampling population, may
frame, selects an introduce increased
arbitrary starting variability
point, then selects
items at a preselected
interval
Stratified – the High cost Ensures Requires accurate
researcher divides the Moderately used representation of all information on proportion
populations into groups in sample, in each stratum, if
groups and randomly characteristics of stratified list are not
selects subsamples each stratum can be already available, they
from each group. estimated and can be costly to prepare
Variations include comparisons made,
proportional, reduces variability
disproportional and for same sample
optimal allocation of size
subsample sizes
Cluster – the Low cost If clusters Larger error for
researcher selects Frequently used geographically comparable size than with
sampling units at defined, yields other probability samples,
random, the does a lowest field cost, researcher must be able to
complete observation requires listing of assign population
of all units or draws a all clusters, but of members to unique
probability sample in individuals only cluster or else duplication
the group within clusters can or omission of individuals
estimate will result
characteristics of
clusters as well as of
population
Multistage - High cost Depends on Depends on techniques
progressively smaller Frequently used, techniques combined
areas are selected in especially in combined
each stage by some nationwide surveys
combination of thhe
first hour techniques
Internet Sampling
Advantages
• Allow researchers to reach a large sample rapidly
• Sample size requirements can be met quickly
• Easier to carry out
• Less costly
Disadvantages
• Lack of computer ownership and internet access
• Unrepresentative of all target populations
TERMINOLOGY
Control Group - A group assigned to the experiment, but not for the purpose of being exposed
to the treatment. Performance of this group serves as a baseline.
Treatment Group - The Group in an experiment which receives the specified treatment.
Factor - This term is used when an experiment involves more than one variable. These
variables are often identified as factor.
Level - Refers to the degree or intensity of a factor.
Randomness -refers to the property of completely chance events that are not predictable.
Replication - The repetition of the treatment under consideration.
Blocks - refers to the categories of subjects with a treatment 2/g7/r2o02u0 p.
EXPERIMENTAL ERROR
o is the variation in the responses among experimental units which are assigned the same
treatment, and are observed under the same experimental conditions. It is measured by
SSE (or MSE). Ideally, we would like experimental error to be zero.
This is impossible because of (at least) one or more of the following reasons:
• There are inherent differences in the experimental units before they receive treatments.
• There is variation in the devices that record the measurements.
• There is variation in applying or setting the treatments.
• There are extraneous factors other than the treatments which affect the response.
COMPLETE DESIGNS
COMPLETELY RANDOMIZED DESIGN (CRD)
• COMPLETELY RANDOMIZED DESIGNS are the simplest design in which the treatments
are assigned to the experimental units completely at random. This allows every
experimental unit to have an equal probability of receiving a treatment.
• For CRD, any difference among experimental units receiving the same treatment is
considered as experimental error.
CHARACTERISTICS OF THE CRD
• CRD is the simplest design to use.
• CRD is appropriate only for experiments with homogeneous experimental units, such as
laboratory experiments, where environmental effects are relatively easy to control. .
• The CRD is best suited for experiments with a small number of treatments.
• For field experiments, where there is generally large variation among experimental plots
in such environmental factors as soil, the CRD is rarely used.
• Every experimental unit has the same probability of receiving any treatment
• Treatments are assigned to experimental units completely at random
EXAMPLE OF CRD
• In order to determine whether there is significant difference in the durability of 3 makes of
computers, samples of size 5 are selected from each make and the frequency of repair
during the first year is observed. The results are as follows:
Makes
A B C
5 8 7
6 10 3
8 11 5
9 12 4
7 4 1
HYPOTHESIS
H0: The three makes of computers do not differ significantly in the durability.
H1: Atleast one of the makes of computers differ significantly in the durability.
ADVANTAGES
• Very flexible design (i.e. number of treatments and replicates is only limited by the
available number of experimental units).
• Statistical analysis is simple compared to other designs.
• Loss of information due to missing data is small compared to other designs due to
the larger number of degrees of freedom for the error source of variation.
• Provides maximum number of degrees of freedom.
DISADVANTAGES
• If experimental units are not homogeneous and you fail to minimize this variation
using blocking, there may be a loss of precision.
• Usually the least efficient design unless experimental \units are homogeneous.
• Not suited for a large number of treatments.
RANDOMISED BLOCK DESIGN (RBD)
• Any experimental design in which the randomization of treatments is restricted to groups
of experimental units within a predefined block of units assumed to be internally
homogeneous is called a randomized block design.
• Divides the group of experimental units into n homogeneous groups of equal or unequal
sizes.
• These homogeneous groups are called blocks.
• The treatments are then randomly assigned to the experimental units in each block - one
treatment to a unit in each block.
CHARACTERISTICS OF RBD
• A randomized block experiment is assumed to be a two-factor experiment., the factors are
blocks and treatments.
• The blocks of experimental units are uniform.
• There is one observation per cell. It is assumed that there is no interaction between blocks
and treatments.
• The degrees of freedom for the interaction is used to estimate error.
• Treatments randomly assigned to each experimental unit of a block.
ADVANTAGES
• Complete flexibility can have any number of treatments and blocks.
• Provides more accurate results than the completely randomized design due to
grouping.
• Relatively easy statistical analysis even with missing data.
• Some treatments may be replicated more times than others.
• Whole treatments or entire replicates may be deleted from the analysis.
DISADVANTAGES
• Not suitable for large numbers of treatments because blocks become too large, and
there is possibility of heterogeneity among the experimental units of the blocks
• Interactions between block and treatment effects increase error.
• Serious problem with the analysis if a block factor by treatment interaction effect
actually exists and no replication within blocks has been included. (solution: use
replication within blocks when possible).
LATIN SQUARE DESIGN (LSD)
• A Latin square is a square array of objects (letters A, B, C, …) such that each object
appears once and only once in each row and each column.
• Example - 4 x 4 Latin Square.
ABCD BCDA CDAB DABC
• The Latin Square Design is for a situation in which there are two extraneous sources
of variation. If the rows and columns of a square are thought of as levels of the the two
extraneous variables, then in a Latin square each treatment appears exactly once in each
row and column.
• With the Latin Square design we are able to control variation in tw 2/7o/20d20irecti4o6 ns.
CHARACTERISTICS OF LSD
• In LSD we have three factors: Treatments, Rows and Columns
• The number of treatments = the number of rows = the number of colums = t (say).
• The row-column treatments are represented by cells in a t x t array.
• The treatments are assigned to row-column combinations using a Latin-square
arrangement, that is each row contains every treatment. and each column contains
every treatment.
• Every treatment occurs once in each row and column.
HYPOTHESIS
H0A: There is no significant difference between burners.
H1A: At least one of the burner is significantly different.
H0B: There is no significant difference between the days.
H1B: At least one of the day is significantly different
H0C: There is no significant difference between Engines.
H1C: At least one of the engine is significantly different
ADVANTAGES
We can control variation in two directions. It means LSD is more efficient then
CRD and RBD.
Being 3-way design, it is economic over the corresponding complete 3-
way design. Instead of 𝑟3 experimental units, here only 𝑟2 experimental units are
sufficient.
The analysis remains relatively simple even with missing data.
DISADVANTAGES
Number of treatment is limited to the number of replicates which seldom
exceeds 10.
If we have less than 5 treatments, the df for controlling random variation is
relatively large and the df for error is small.
The number of treatments must equal the number of replicates.
The experimental error is likely to increase with the size of the square.
Evaluation of interactions between rows and columns, rows and treatments &
columns and treatments is not possible separately.
FACTORIAL EXPERIMENT
• Factorial designs include two or more factors, each having more than one level or
treatment. Participants typically are randomized to a combination that includes one
treatment or level from each factor.
NESTED DESIGNS
• In certain multifactor experiments, the levels of one factor are similar but not identical for
different levels of another factor, (is unique to that particular factor) this is called
hierarchical or nested design.