Data Management

DATA MANAGEMENT 1
GMATH: Mathematics in the Modern World

Chapter 4: DATA MANAGEMENT
STATISTICS AND DATA
Statistics is the science that deals with the collection, presentation, analysis, and interpretation of data. The two
divisions of statistics are descriptive and inferential statistics. Descriptive statistics deals with the gathering,
classification, presentation, and analysis of data without generalizing the results for the entire population.
Inferential statistics concerns with the generalization of sample results for the whole population. It demands
deductive reasoning and it needs a higher degree of critical judgment and mathematical methods.
Functions of Statistics
a. Organizes data for presentation and better understanding
b. Estimates quantities and measurements
c. Facilitates information dissemination
d. Helps in establishing differences
e. Explains the relationship between variable of interest
f. Test assertions and claims
g. Predicts and forecasts future outcomes
Data are the raw materials of research or any statistical investigations. They arise when measurements are
made and/or observations are recorded. In general, data can be categorized as quantitative and qualitative.
Quantitative data take numerical values for which descriptions such as means, standard deviations, and other
parameters or statistics are meaningful.
Qualitative data, such as eye and hair colors of an individual, are not computable by arithmetic relations. They
are labels that advise in which category or class an individual, object, or process fall.
Data can also be categorized according to source as primary and secondary. Primary data refer to the
information which are gathered directly from an original source or which are based on direct or first-hand
experience. Secondary data refer to the information taken from published/unpublished materials that have
been previously gathered by other individuals, researchers, or agencies.
Steps in Statistical Data Analysis
1. Definition of the problem (Knowing the problem)

2. Data Gathering (Data Collection)
3. Data Presentation
4. Data Analysis
5. Interpretation
GMATH- Mathematics in the Modern World

DATA MANAGEMENT 2
SCALES OF MEASUREMENT
The classifications of measurements depending on the precision made by the measurement procedure are
nominal, ordinal, interval, and ratio.
In the nominal scale, a name, label, or category is assigned to classify each element observed with respect to
the property of interest. Gender is one variable measured through the nominal scale. Male and female are the
two categories which do not follow an order or rank. Other examples are - Civil status: Single, Married,
Widowed/er, Separated; Quality of products: defective or good
In an ordinal or ranking scale, the elements or categories are arranged in some meaningful kind of order or
rank, which corresponds to their relative position or “size”. Examples are preference and quality. Preference
may have the categories such as most preferred, next preferred, and least preferred. These categories follow
an order or rank – least preferred being the lowest and most preferred being the highest. Another variable
measured through an ordinal scale is quality (poor, fair, good, very good, and outstanding). Birth Order: Eldest,
…,Youngest; Size: Large, Medium, Small; Preference: most preferred, next preferred, and least preferred;
Quality: poor, fair, good, very good, and outstanding. Note that the difference between ranks is not
meaningful.
In an interval scale, the elements can be differentiated and ordered, and the arithmetic difference between
elements is meaningful. This scale of measurement is more informative than either the nominal or ordinal scale,
since the fact that the distance between elements can be determined implies that there is a fixed unit of
measurement and a zero point (origin), even though the latter is arbitrary. Examples are temperature, I.Q., and
grade (in numerical form), Time, Blood pressure, Calendar dates
The highest level of measurement is the ratio scale. Here, there’s not only an order property, a unit of
measurement and a meaningful difference between elements, but there’s also a fixed origin (which is zero) as
opposed to an arbitrary origin. Examples are height, weight, length, salary, number of bacteria, tensile strength.
POPULATION and SAMPLE
Population or study population is the totality of all objects, individuals or entities wherein its unique properties or
characteristics are the subject of a research or statistical inquiry. A study population can be finite or infinite.
A population is said to be finite if it is possible to count its individual members. Sometimes it is not possible to
count the units or members in a population. Such populations are described as infinite.
School of students, set of books, group of patients, organization of employees, herd of cattles, and set of bags
of cement are examples of finite populations. Infinite populations include tourists (registered and unregistered)
in a certain location, rats in an open area, stones in a riverbank, turtles in a pond, and micro-organisms of the
same species inhabiting a given area.
It’s usually due to time and budget constraints that the whole population can not be studied. This suggests the
consideration of a small portion of the population in the investigation. Sample is a representative part of a
population. A characteristic of a population which is the consideration of a statistical inquiry or research is
called a parameter. On the other hand, statistic is a characteristic of a sample. A statistic is used to estimate,
describe, or represent a parameter.
Sampling is the process of selecting units, like people, organizations, or objects from a population of interest in
order to study and fairly generalize the results back to the population from which the sample was taken.
Sampling is the process of getting information from only part of a larger group.

DATA MANAGEMENT 3
Sample Size Determination
The number of respondents or subjects to form a sample is termed as the sample size. Cochran (1977)
presented a set of formulas that can be used to determine the sample size.
In estimating a population mean, the following formulas can be used.
1) For a finite and known population size, N:

2
(𝑍𝛼 ) 𝑠 2 𝑁
2
𝑛≥ 2
(𝑍𝛼 ) 𝑠 2 + 𝑁𝑒 2
2
where:
n is the sample size

𝑍𝛼 is the two-tailed z-score corresponding to the level of significance,
2
s is the known standard deviation
e is the margin of error
2) For an infinite or unknown population size, N:
2
(𝑍𝛼 ) 𝑠 2
2
𝑛≥
𝑒2
In estimating a population proportion, the following formulas can be used.
1) For a finite and known population size, N:

2
(𝑍𝛼 ) 𝑝𝑞𝑁
2
𝑛≥ 2
(𝑍𝛼 ) 𝑝𝑞 + (𝑁 − 1)𝑒 2
2
where:
p is the past estimate of the population proportion

q=1–p
2) For an infinite or unknown population size, N:

2
(𝑍𝛼 ) 𝑝𝑞
2
𝑛≥
𝑒2

DATA MANAGEMENT 4
Notes:
i. The level of significance, 𝛼, can take any of the standard values namely, 0.01, 0.05, and 0.10.
ii. The following table presents the values of 𝑍𝛼 corresponding to the standard values of 𝛼:
2
𝛼 𝑍𝛼
2
0.01 2.575
0.05 1.96
0.10 1.645
iii. The standard deviation, s, can be estimated from a pilot data set or the value can be adopted from a
previous study that considered the same or similar population.
iv. In the same manner as s, p can be the past estimate of the population proportion or can be computed from
a pilot data set.
Yamane’s Formula (Simplified Formula for Proportions)
If the behavior of the population is not certain or the researcher is not familiar with the population’s
behavior, Yaro Yamen’s formula (1980) or Taro Yamane’s formula (1967) may be used. The formula is:
N
n
1  Ne 2
Where: n - is the sample size

N - is the population size
e - is the level of precision.
Example. From a population of 10,000 individuals of a certain town, what sample size is needed in order to get
an accurate result for a certain study using a margin of error of a.) 1% ; b.) 2.5% ; c.) 5%
SAMPLING TECHNIQUES
Sampling is the process of getting information from only part of a larger group. The two types of sampling are
random sampling and nonrandom sampling. Nonrandom sampling uses some criteria for choosing the sample
whereas random sampling does not.
A. Random Sampling Techniques
Simple Random Sampling

Simple random sampling is the most basic and well-known type of random sampling technique. In simple
random sampling, every case in the population being sampled has an equal chance of being chosen. It is
an equal probability sampling method (EPSEM). EPSEMs are important because they produce
representative samples.

DATA MANAGEMENT 5
Basic Steps:
1. Construct the sampling frame
2. Determine the sample size
3. Employ any of the following selection procedure:
a. Draw lots
b. Lottery
c. Usage of gadgets like the calculator or computer to generate Random Numbers
d. Table of Random Numbers
Systematic Sampling
This method consists of randomly selecting one unit and choosing additional elements at equal intervals until
the desired sample size is achieved.
Basic Steps:
1. Construct the sampling frame
2. Determine the sample size
3. Determine the sampling interval, k:
𝑁
𝑘=
𝑛
4. Identify the random start, r, using any of the selection procedure under SRS:
1≤𝑟≤𝑘
The random start identifies the first sampling unit.
5. Commencing with the random start, select every kth item until the desired sample size is reached.
Example. From a population size of 300 items, 30 are to be selected randomly using systematic random
sampling. Which elements or units in the population are to be taken for the sample?
Stratified Random Sampling

Stratified random sampling involves dividing the potential samples into two or more mutually exclusive
groups based on categories of interest in the research. The purpose is to organize the potential samples into
homogenous subsets before sampling. For example, you could divide the potential samples based on
gender, race or occupation. You then draw a random sample from each subset. Stratified random
sampling is common because it ensures that each subgroup of the larger group is adequately represented
in the sample.
Proportional allocation:
𝑁ℎ
𝑛ℎ = (𝑛)
𝑁
where nh = sample size for each stratum
Nh = stratum size
N = population size
n = sample size

DATA MANAGEMENT 6
Example: Suppose a school has five departments composed of the following number of students. Determine
the number of students to be part of the sample when the researcher needs 363 respondents.
Department Nh nh
a. Business Administration 1,500
b. Management 1,200
c. Finance 850
d. Entrepreneurship 200
e. Culinary Arts 150
Total 3,900
Cluster Random Sampling

In cluster random sampling, you randomly select clusters instead of individual samples in the first stage of
sampling. For example, a cluster might be a school, a team or a village. This technique is used when no list
of individual samples is available. Usually, the way this type of sampling is done is by starting at the higher
level clusters and then sampling at subsequent levels until individual samples are reached.
Multi-stage sampling
This method uses several stages or phases in getting random samples from the general population.
B. Non-Probability Sampling
This is a sampling method that does not involve random selection of samples. With non-probability
samples, the population may or may not be represented well, and it will often be difficult to know how well
the population has been represented. Some forms of non-probability sampling are:
1. Accidental or Haphazard or Convenience sampling

- one of the most common methods of sampling where methods done are normally biased since
- the researcher considers his/her convenience in the collection of the data.
2. Purposive sampling
- sampling is based on certain criteria laid down by the researcher. People who satisfy the criteria
are interviewed.
Subcategories of Purposive sampling:
a. Modal instance sampling

- When we do modal instance sampling, we are sampling the most frequent case. The problem
with modal instance sampling is identifying the “modal” case. Modal instance sampling is only
sensible for informal sampling contexts.
b. Expert sampling
- Involves the assembling of a sample of persons with known or demonstrable experience and
expertise in some area.
Two reasons we might do expert sampling:
1. It would be the best way to elicit the views of persons who have specific expertise.
2. To provide evidence for the validity of another sampling approach you’ve chosen.

DATA MANAGEMENT 7
c. Quota sampling
- Select items nonrandomly according to some fixed quota.
d. Snowball sampling
- Begin by identifying someone who meets the criteria for inclusion in your study. You then ask
them to recommend others who they may know who also meet the criteria.
Advantages of Sampling
1. Faster – a smaller group understudy requires shorter time spent for data collection and processing
2. Cheaper – cost entailed in studying only a part of the population is much lower with investigations
involving whole population
3. Better quality of information may be collected – a smaller study group allows a more accurate
execution of technical procedures
4. More comprehensive data may be gathered.
Good Sampling Design

1. Representative – samples to be collected should reflect the characteristics as well as the variability of
the population
2. Feasible – sampling procedure should be simple enough to be implemented and can be carried out
and sustained according to plan
3. Adequate – the sample size should be sufficiently large to provide reliable generalization
4. Economic – sampling design should be efficient enough to produce the most information at a least cost
DATA COLLECTION
ASSIGNMENT No.1 (Due Date: ________________________ )
Title: DATA COLLECTION METHODS and PRESENTATIONS
 To be submitted by group with 3 – 5 members only

 Use short (8.5” x 11”) bond paper.
 Submit it with a cover page (in no particular format) but including the Title, Group members’ name,
Class Schedule, Name of the Instructor, Date Due and Date Submitted.
 Strictly handwritten (in no particular format) with 1” all sides as imaginary margin. Necessary images
may be drawn or pasted. Use black/blue pen only.
Methods of Data Collection

1. Interview method
2. Questionnaire method
3. Observation method
4. Registration method
5. Experimentation method
ASSIGNMENT No.1a
a. Define each of the methods of data collection
b. Identify/Specify the positive (pros) and negative (cons) aspects of each method.
c. Give a situational example where the method is appropriately applied.

DATA MANAGEMENT 8
DATA PRESENTATIOIN
Methods of Data Presentation
A. Textual Presentation – This type of presentation incorporates data in set of narrative sentences or
paragraph. It emphasizes and compares important figures. However, it can be tedious to read
especially if it consists of lengthy paragraphs and some figures or words are repeated many times.
2000 Census of Population
The population of the Philippines as of May 1, 2000 is 75.33 million. This figure is
higher by 6.71 million from the 1995 population.
The annual growth rate from 1995 to 2000 is 2.02 percent, which is lower by 0.30
percentage point from the 1995 figure of 2.32 percent and by 0.33 percentage points
from the 1990 figure of 2.35 percent
Source: NSO Monthly Bulletin of Statistics, August 2000
B. Tabular Presentation – This is a systematic way of categorizing related data in rows and columns. This
methodical arrangement called statistical table presents data in a more concise and greater detail
than in textual or graphical form.
Table number
Title table  heading
Stub Column Column

Head Caption Caption
Row
Caption
BODY
Row
Caption
C. Graphical Method – This is a method of presenting quantitative data in pictorial form produces a device
which is often referred to as graph or chart. They have visual appeal that can attract better and hold further,
the reader’s interests.
Qualities of a Good Graph

1. Accurate – It must be accurately constructed using correct and reliable data in order to produce
correct interpretation. It should not be deceiving, imprecise or confusing so as not to create illusory
vision.
2. Clear – An effective chart is easy to read and understand. It should emphasize the information it wants
to present supported with definite details. It should be useful in interpretation of facts.
3. Simple – Its design should be uncomplicated and straight forward. It should contain only necessary and
relevant data or symbols to gain efficient visual communication.
4. Attractive – Its appearance should be neat and with a scholarly or professional look. The overall design
elements should be harmonious, consistent in style and balanced.

DATA MANAGEMENT 9
Types of Graph
1. Line chart
2. Bar charts and Histograms
3. Pie Chart
4. Pictograph
5. Dot plot
6. Stem and leaf plot
7. Boxplot (box and whisker plot)
8. Scatterplot
ASSIGNMENT No.1b
a. Define/ describe the characteristics and features each of the types of graph.
b. State the procedure in the construction of ach graph.
c. Give an example of each graph. Indicate all the parts, labels, legends and the interpretation of the
example shown.
MEASURES OF CENTRAL TENDENCY
Measures of Central Tendency are numerical values that tend to locate in some sense the middle of a set of
data. The term average is often associated with these measures. The most important measure of central
tendency are (1) the mean, (2) the median, and (3) the mode.
A. MEAN, 𝜇 or 𝑥̅
1. Arithmetic Mean – it is obtained by adding all the observations and dividing the sum by the number of
observations, thus it is called a computational average.
Population mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑁 represents a finite population of size 𝑁, then the population
mean 𝜇 is
N
x
i 1
i

N
Sample Mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑛 represents a finite sample of size 𝑛, then the sample mean 𝑥̅ is
n
x
i 1
1
x
n
Example:
Suppose you are to choose ten people who enter the campus and whose ages are as follows:
15 25 18 20 25 18 18 20 25 15
What is the mean age of this sample?
2. Weighted Mean – if the data set 𝑥1 , 𝑥2 … 𝑥𝑘 have assigned weights 𝑤1 , 𝑤2 … 𝑤𝑘 , respectively, then the
weighted mean is computed as follows:
k
w x i i
x i 1
k
w
i 1
i
Example:
A student was taking six subjects in college during the first semester. Find his average grade if his final
grades were as follows:
Subject Math Physics English Speech Statistics
Grade 1.75 2.50 2.25 1.50 3.0
Units 3 5 3 2 4
DATA MANAGEMENT 10
B. MEDIAN, 𝜇̃ or 𝑥̃
- a value that divides the distribution into two equal parts (after arranging the values/scores in ascending or
descending order). As such, it is a positional average. The median is defined by
𝑥𝑛+1 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
2
𝜇̃ 𝑜𝑟 𝑥̃ = {𝑥𝑛 + 𝑥𝑛+1
2 2
𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2
Example:
Find the median:
(a) 12, 15, 18, 8, 9,10, 6
(b) 23, 18, 15, 12, 10, 9, 8, 6
C. MODE, 𝜇̂ or 𝑥̂
- the value in the distribution with the highest frequency. It locates the point where the observation values
occur with the greatest density. It can be used for quantitative as well as qualitative data.
Example:
Find the mode of the following data:
15 12 4 9 6 10 5 15 12 4 12 6 12
5 15 12 4 15 4 6 5
Evidently, a distribution can have no mode, one mode, or more than one mode. Thus, the mode is not a very
reliable measure of central tendency. However, there are instances when no other measure can be used
except the mode. In determining the prevalent gender, civil status, or highest educational attainment, only the
mode can be used because no numerical values can be assigned to these variables.
D. MIDRANGE
- the mean of the largest and smallest values in the data set.
Remarks
Mean:
1. All the scores or measurements are considered in the computation of the mean.
2. Very high or very low scores or measurements affect the mean.
Median:
1. Only the middle scores or measurements are considered in the computation of the median.
2. Very high or very low scores do not affect the median.
Mode:
1. It is very easy to compute but is seldom used because it is very unstable.
2. It is most appropriate for nominal scale as a measure of popularity.
MEASURES OF LOCATION
There are several other measures of location that describe or locate the position of certain non-central pieces
of data relative to the entire set of data. These measures, often referred to as quantiles or fractiles are values
below which a specific fraction or percentage of the observations in a given set must fall.
PERCENTILES
Percentiles are values that divide a set of observations into 100 equal parts. These values, denoted by
𝑃1 , 𝑃2 , … , 𝑃99 , are such that 1% of the data falls below 𝑃1 , 2% falls below 𝑃2 , …, and 99% falls below 𝑃99 .
The 𝑘th percentile, 𝑃𝑘 (𝑘 = 1, 2, 3, … ,99), can be determined using the following procedure:

DATA MANAGEMENT 11
𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = (100) 𝑛, where 𝑛 is the
number of observations.
𝑥 +𝑥
2. If 𝑖 is an integer, 𝑃𝑘 = 𝑖 2 𝑖+1 . If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝑃𝑘 = 𝑥𝑖 .
DECILES
Deciles are values that divide a set of observations into 10 equal parts. These values, denoted by 𝐷1 , 𝐷2 , … , 𝐷9 ,
are such that 10% of the data falls below 𝐷1 , 20% falls below 𝐷2 , …, and 90% falls below 𝐷9 .
The 𝑘th decile, 𝐷𝑘 (𝑘 = 1, 2, … ,9), can be determined using the following procedure:
𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = (10) 𝑛, where 𝑛 is the
𝑥 +𝑥
2. If 𝑖 is an integer, 𝐷𝑘 = 𝑖 2 𝑖+1 . If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝐷𝑘 = 𝑥𝑖 .
QUARTILES
Quartiles are values that divide a set of observations into 4 equal parts. These values, denoted by 𝑄1 , 𝑄2 , and
𝑄3 , are such that 25% of the data falls below 𝑄1 , 50% falls below 𝑄2 and 75% falls below 𝑄3 .
The 𝑘th quartile, 𝑄𝑘 (𝑘 = 1, 2, 3), can be determined using the following procedure:
𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = ( 4) 𝑛, where 𝑛 is the
𝑥 +𝑥
2. If 𝑖 is an integer, 𝑄𝑘 = 𝑖 2 𝑖+1. If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝑄𝑘 = 𝑥𝑖 .
Examples
1. Find the quartiles, interquartile range, 3rd and 7th deciles, and 12th, 37th, 95th percentiles for the
following examination scores given in the stem-and-leaf plot.
Exam Scores
4 |568
5 |34569
6 |2356699
7 |01133455578
8 |122369
2. As part of a quality-control study aimed at improving a production line, the weights (in ounces) of 50
bars of soap are measured. The results are as follows, sorted from smallest to largest. Find the
interquartile range, the 3rd and 9th deciles, and the 12th, 43rd, and 61st percentiles.
11.6 12.6 12.7 12.8 13.1 13.3 13.6 13.7 13.8 14.1
14.3 14.3 14.6 14.8 15.1 15.2 15.6 15.6 15.7 15.8
15.8 15.9 15.9 16.1 16.2 16.2 16.3 16.4 16.5 16.5
16.5 16.6 17.0 17.1 17.3 17.3 17.4 17.4 17.4 17.6
17.7 18.1 18.3 18.3 18.3 18.5 18.5 18.8 19.2 20.3
MEASURES OF VARIABILITY OR DISPERSION
The measures of central tendency do not by themselves give an adequate description of the data. It is also
very important for us to know how the observations spread out from the average. The measures of variation
indicate the extent to which individual items in a series are scattered about the average. It is used to determine
the extent of the scatter so that steps may be taken to control the existing variation.

DATA MANAGEMENT 12
Let us consider the following measurements for two samples of data:
Sample A P24,500 20,700 22,900 26,000 24,100 23,800 22,500

Sample B P24,900 17,500 21,600 29,700 25,300 23,800 21,700
Both samples have the same mean but, it is quite obvious that the measurements for sample A are more
uniform or the values are close to each other as compared to sample B.
General Classifications of Measures of Variation

 Measures of Absolute Dispersion
 Measures of Relative Dispersion
Measures of Absolute Dispersion

The measures of absolute dispersion are expressed in the units of the original observations. They cannot
be used to compare variations of two data sets when the averages of these data sets differ a lot in value or
when the observations differ in units of measurement. The most common statistics for measuring the variability
of a set of data are the range, variance, and the standard deviation.
RANGE
The range measures the distance between the largest and the smallest values and, as such, gives an idea of
the spread of the data set. However, the range does not use the concept of deviation. It is affected by outliers
but does not consider all values in the data set. Thus it is a not a very useful measure of variability.
𝑅𝑎𝑛𝑔𝑒 (𝑅) = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 – 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒
MEAN ABSOLUTE DEVIATION

The mean absolute deviation (MAD) utilizes deviations of the data values from the mean in its computation.
The MAD is the average of the absolute values of the deviations from the mean, computed using the formula
∑ |𝑥𝑖−𝜇| ̅
∑ |𝑥𝑖−𝑥|
population: 𝑀𝐴𝐷 = sample: 𝑀𝐴𝐷 =
𝑁 𝑛
If a data set A has a greater MAD than data set B, then it is reasonable to believe that the values in data set A
are more spread out (variable) than the values in set B.
VARIANCE AND STANDARD DEVIATION

The variance and the standard deviation are the most common and useful measures of variability. These two
measures provide information about how the data vary about the mean. The variance 𝜎 2 or 𝑠 2 is a measure of
variation which considers the position of each observation relative to the mean of the set. It is an approximate
average of the squared deviations from the sample mean. The standard deviation 𝜎 or 𝑠 is the square root of
the variance.
Population Variance: Given the finite population 𝑥1 , 𝑥2 … 𝑥𝑁 , the population variance, which is exact, is
∑(𝑥𝑖 −𝜇)2 𝑁∑𝑥𝑖 2 −(∑𝑥𝑖 )2

𝜎2 = or 𝜎2 =
𝑁 𝑁2
Sample Variance: Given a random sample 𝑥1 , 𝑥2 … 𝑥𝑛, the sample variance is
∑(𝑥𝑖 −𝑥̅ )2 𝑛∑𝑥𝑖 2 −(∑𝑥𝑖 )2

𝑠2 = or 𝑠2 =
𝑛−1 𝑛(𝑛−1)

DATA MANAGEMENT 13
where:  = population standard deviation 𝑥𝑖 = 𝑖th observation

𝑠 = sample standard deviation 𝜇 = population mean
𝑥̅ = sample mean 𝑁 = population size
𝑛 = sample size
If the data are clustered around the mean, then the variance and the standard deviation will be somewhat
small. If, however, the data are widely scattered about the mean, the variance and the standard deviation will
be somewhat large.
Notes:
1. We divide by the quantity 𝑛 − 1 in order to make the sample variance an unbiased estimator of the
population variance. (An estimator is unbiased if its average value is equal to the parameter it is
estimating.)
2. The unit of the standard deviation is the same as that of the raw data, so it is preferable to use the
standard deviation as a measure of variability instead of the variance.
3. The range is a quick but a rough measure of variation since considers only the highest value and the
lowest value of the observations.
Measures of Relative Dispersion

The measures of relative dispersion are unit less and are used when one wishes to compare the
dispersion of one distribution with another distribution.
COEFFICIENT OF VARIATION (CV)

The coefficient of variation standardizes the variation by dividing it by the sample mean. Because of this
property, it can be used to compare variations for different variables with different units.
𝜎 𝑠
population: 𝐶𝑉 = (𝜇) 100% sample: 𝐶𝑉 = (𝑥̅ ) 100%
A larger coefficient of variation implies a more spread out or more dispersed data set.
This is only defined for non-zero mean, and is most useful for variables that are always positive. It is also known
as unitized risk or the variation coefficient. CV is unitless. It is used to compare dispersion of two or more data
sets with the same or different units. The higher the CV the more variable is the data set relative to its mean.
Example:
Several measurements of the diameter of a spherical instrument bearing made with one micrometer
had a mean of 2.49 mm and a standard deviation of 0.12 mm, and several measurements of the
unstretched length of a spring made with another micrometer had a mean of 0.75 in. with a standard
deviation of 0.02 in. Which of the two micrometers is relatively more precise?
Example:
Blood samples from 10 persons were sent to each of two laboratories for cholesterol determination.
Measurements were as follows (Kuzma and Bohnenblust, 2005):
Subject 1 2 3 4 5 6 7 8 9 10
Lab1 296 268 244 272 240 244 282 254 244 262
Lab2 318 287 260 279 245 249 294 271 262 285
Compare the data sets recorded by the two laboratories by considering the following descriptive
measures: mean, median, mode, first quartile, third quartile, range, standard deviation, variance,
mean absolute deviation, and coefficient of variation.

DATA MANAGEMENT 14
CORRELATION and REGRESSION ANALYSIS
Correlation analysis is a technique used to describe the relationship or association between variables. If
we want to know the degree of relationship between two variables which are measured in at least an interval
scale, the Pearson Product Moment Correlation Coefficient (r) may be obtained.
Interpreting the Correlation Coefficient:

The value of the correlation coefficient indicates the degree as to how the variables are related with
each other. The correlation coefficient is a value between -1 and +1 inclusive where if the value of r is negative,
there is a negative relationship between the variables while if r is positive, the relationship is said to be positive.
The value of r is interpreted as follows:
Correlation
Linear Relationship
Coefficient
0 None
± 0.01 - ± 0.20 Very Weak
± 0.21 - ± 0.40 Weak
± 0.41 - ± 0.60 Moderate
± 0.61 - ± 0.80 Strong
± 0.81 - ± 0.99 Very Strong
±1 Perfect Linear
Pearson Product Moment Correlation Coefficient ρ
The estimator of the true population Pearson Product Moment Correlation Coefficient (ρ) is given by
 x  y 
 xy  n
r
  x   
2
 y  2

 x    y 
2 2

 n   n 
Properties of the Correlation Coefficient (r):

1. It is a unitless quantity.
2. It is always some number between -1 and +1, inclusive.
3. The magnitude of r is simply a measure of how closely the points cluster about a certain trend line
which is known as the regression line.
Example: Consider the scores obtained in Math (X) and Statistics (Y) by 10 students.
Student 1 2 3 4 5 6 7 8 9 10
Math
Score 5 8 10 12 12 14 15 16 18 20
(X)
Stat
Score 2 7 8 9 10 12 14 10 16 12
(Y)
Compute for the correlation coefficient, r

DATA MANAGEMENT 15
Correlation and regression analysis are closely related since both involve relationship between two
variables and they both use paired observations obtained from the same (or matched) subjects. While
correlation is used to determine the degree as well as the direction of relationship between variables,
regression analysis deals with the use of the relationship for forecasting or predicting the value of a dependent
variable. The primary goal of regression analysis is to develop a statistical (regression) model that will
characterize the association of the variables and also to determine the statistical relationship, if any, between
variables. If the regression model is found to be adequate, it can then be used to estimate or forecast values of
the dependent variable.
Before proceeding with regression analysis, a scatter diagram of Y versus X can be done. It may give
an idea of the form of relationship between them.
* Simple Linear Regression
- A statistical tool that is used to

o Describe the dependence of variable Y on the independent variable X.
o Lend support to the hypothesis regarding the possible causation of changes in Y brought about
by changes in X.
o Predict Y in terms of X.
o Explain some of the variations of Y by X.
The Simple Linear Regression Model
In most real situation, the relation between the two variables is not perfect. For example, if a student
obtained a grade of 85%, it cannot be solely attributed to the students’ IQ. The student’s performance is also
affected by other factors aside from the student’s IQ level.
The simple linear regression model, expresses the response (or dependent) variable (Y) as a function of
one predictor (or independent) variable (X), as
Yi = β0 + β1Xi + εi
Where
Y = observed value of the dependent variable
X = observed value of the independent variable
βo = true regression intercept or the value of the response variable when X is zero
β1 = true regression slope or the changes (increase if positive or decrease if negative) in the
response variable brought about by an increase of one unit in the independent variable
εi = random error component which captures all other factors affecting the response variable
but were not included in the model
Estimation of the Parameters βo and β1:

The values of the parameters in the regression equation or model are often times unknown. The
common practice is to take sample observations and from this sample data, the parameters are estimated
The estimate of the parameter β1 is the statistic b1 and is given by
 x  y 
x y
i i
i i 
b1  n
  xi 2
x 2
i 
n
The estimate of the parameter β0, on the other hand, is given by the statistic b0 where
b0  y  b1 x

DATA MANAGEMENT 16
Example:
1. A corporation administers an aptitude test to all new sales representatives. Management is interested in the
extent to which this test is able to predict their eventual success. The accompanying table records average
weekly sales (in thousands of pesos) and aptitude test scores for a random sample of eight representatives.
Test Scores 55 60 85 75 80 85 65 60
Weekly Sales 10 12 28 24 18 16 15 12
a.) Estimate the linear regression of weekly sales on aptitude test scores.
b.) Interpret the estimated slope of the regression line.
2. The IQ test scores and freshmen algebra grades of a sample of students were recorded and are given in
the following table. Find the regression equation and draw the regression line. What could be the algebra
grade of a student with IQ score of 88?
Student 1 2 3 4 5 6 7 8 9 10
IQ Test Score 80 75 90 105 97 85 92 100 94 78
Algebra 79
83 80 88 90 89 82 88 91 87
Grade

Data Management

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Management

Hochgeladen von

Copyright:

Verfügbare Formate

DATA MANAGEMENT 1

GMATH: Mathematics in the Modern World

STATISTICS AND DATA

Steps in Statistical Data Analysis

1. Definition of the problem (Knowing the problem)

GMATH- Mathematics in the Modern World

POPULATION and SAMPLE

GMATH- Mathematics in the Modern World

Sample Size Determination

In estimating a population mean, the following formulas can be used.

1) For a finite and known population size, N:

n is the sample size

2) For an infinite or unknown population size, N:

In estimating a population proportion, the following formulas can be used.

1) For a finite and known population size, N:

p is the past estimate of the population proportion

2) For an infinite or unknown population size, N:

GMATH- Mathematics in the Modern World

Yamane’s Formula (Simplified Formula for Proportions)

Where: n - is the sample size

A. Random Sampling Techniques

Simple Random Sampling

GMATH- Mathematics in the Modern World

Stratified Random Sampling

GMATH- Mathematics in the Modern World

Cluster Random Sampling

1. Accidental or Haphazard or Convenience sampling

Subcategories of Purposive sampling:

a. Modal instance sampling

GMATH- Mathematics in the Modern World

Good Sampling Design

ASSIGNMENT No.1 (Due Date: ________________________ )

Title: DATA COLLECTION METHODS and PRESENTATIONS

 To be submitted by group with 3 – 5 members only

Methods of Data Collection

GMATH- Mathematics in the Modern World

Methods of Data Presentation

2000 Census of Population

Source: NSO Monthly Bulletin of Statistics, August 2000

Stub Column Column

Qualities of a Good Graph

GMATH- Mathematics in the Modern World

MEASURES OF CENTRAL TENDENCY

GMATH- Mathematics in the Modern World

MEASURES OF VARIABILITY OR DISPERSION

GMATH- Mathematics in the Modern World

Let us consider the following measurements for two samples of data:

Sample A P24,500 20,700 22,900 26,000 24,100 23,800 22,500

General Classifications of Measures of Variation

Measures of Absolute Dispersion

𝑅𝑎𝑛𝑔𝑒 (𝑅) = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 – 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒

MEAN ABSOLUTE DEVIATION

VARIANCE AND STANDARD DEVIATION

∑(𝑥𝑖 −𝜇)2 𝑁∑𝑥𝑖 2 −(∑𝑥𝑖 )2

Sample Variance: Given a random sample 𝑥1 , 𝑥2 … 𝑥𝑛, the sample variance is

∑(𝑥𝑖 −𝑥̅ )2 𝑛∑𝑥𝑖 2 −(∑𝑥𝑖 )2

GMATH- Mathematics in the Modern World

where:  = population standard deviation 𝑥𝑖 = 𝑖th observation

Measures of Relative Dispersion

COEFFICIENT OF VARIATION (CV)

GMATH- Mathematics in the Modern World

CORRELATION and REGRESSION ANALYSIS

Interpreting the Correlation Coefficient: