You are on page 1of 8

2.

DATA ANALYSIS
(Part 1)
HANIM AWAB Department of Chemistry Faculty of Science UTM

Statistics in Analytical Chemistry

Statistical methods enables a chemist to base


assessment on fewer data (generally>25) or data accumulated from the analysis of similar samples The problem is examined with respect to precision, accuracy and reliability required of the results Analysis of the results obtained are resolved into two stages: - examination of the reliability of the results - assessment of the meaning of the results

Errors in Chemical Analysis

TYPES OF ERROR
1. GROSS ERROR (eg. eg. C Contaminated ontaminated reagents, faulty instrument) - Serious obvious errors that give outlier readings - Detectable with sufficient replicate measurements - Experiments with gross errors must be repeated 2. RANDOM/INDETERMINATE ERROR (eg. eg. Inaccurate manipulation of procedure) - Data scattered symmetrically about a mean value - Deviations of measurements from the mean shown using the Gaussian or normal error curve - Cannot eliminate but can be minimized - Error can be assessed by statistical tests

The goal of data analysis: to minimize errors


and calculate the size of the errors

Errors cannot be eliminated, but can can be


minimized and approximated to an acceptable precision

Some ways to overcome errors Carry out replicate measurements Analyse accurately using known standards or standard reference materials (SRM) Perform statistical tests on data

3. SYSTEMATIC/DETERMINATE ERROR Operator/Instrument error/Method error - All data too high/too low or data increases with magnitude of measurement - Causes bias in technique (either +ve +ve or ve) ve) - Affects accuracy - May be detected by: - blank determinations, - analysis of standard samples, - independent analyses by alternative/dissimilar methods - Can be avoided/eliminated avoided/eliminated by correcting instrument, method and personal errors* errors*

*Ways to minimize/eliminate systematic errors Instrument errors: - Careful recalibration and good maintenance of apparatus (eg (eg glassware) and instruments ( (eg eg AAS, GC)

Method errors: - Analysis of certified standard reference


materials (SRM) Use 2 or more independent methods - Analysis of blanks

Personal errors:
- Training of operator, care and selfself-discipline

Basic Statistical Concepts/Definitions

True value - value that remains unknown except


when a standard sample is analyzed (value estimated from results of varying precision depending on the method used) Accuracy - nearness of a measurement or result to the true value (expressed in terms of error) Precision - variability of a measurement (Standard deviations are precision indicators) SpreadSpread - difference between the highest and lowest results in a set (spread is a measure of precision) Mean - average of a replicate set of results Median - middle value of a replicate set of results

Degree of Freedom - number of results in a set (each time another quantity is derived from the set, the degrees of freedom are reduced by 1) Range - difference between the highest and lowest value of the results Standard Deviation (s or ) - difference, with respect to sign, between an individual result and the mean or median of the set Relative Standard Deviation (RSD) - Also known as the coefficient of variation, often used in comparing precisions Variance (V) (V) - square of the value of standard deviation (2 or s2)

Determinations/Formula
MEAN (AVERAGE) MEDIAN

STANDARD DEVIATION Measure of spread about the mean Estimate the variability of individual measurement (The standard deviation is better estimated by the pooling of results from more than one set)

Average value Sum of measurements


divided by number of measurements
N

Middle value Arranged in ascending


order, if data in the middle is an odd number record it as the median Arranged in ascending order, if two middle data are even numbers then average the two numbers

x =
i = 1

Small Sample (N<30), s


N-1 = degree of freedom

xi

(
x

2222

xxxx

))))

iiii

ssss

iiii

1111 NNNN

Large sample (N30)


aka population, N = Number of replicate

xi = values of x N = # of measurements is the sample mean is the population mean

2222

xxxx iiii x NNNN

))))

iiii

RELATIVE STANDARD DEVIATION (RSD)/ COEFFICIENT OF VARIATION (CV) Standard deviation divided by mean (depends on the units used)

Example: Determine the content of Se in a batch of brown rice

Mean = xi/N = 0.077 (x xi-mean)2 = 4.01x10-4 VARIANCE The square of standard deviation - Sample variance ( 30) 30): V = s2 - Population variance (large #) #): : V = 2

Sample 1 2 3 4 5 6 7 8 9

Se (mg/g) 0.07 0.07 0.08 0.07 0.07 0.08 0.08 0.09 0.08

(xi - mean) 4.9x10-5 4.9x10-5 9.0x10-6 4.9x10-5 4.9x10-5 9.0x10-6 9.0x10-6 1.69x10-4 9.0x10-6

S.D. =

s=

(x
i

x)2
= 0.007

N 1

Content of Se = 0.077 0.007 mg/g What does this result mean?

STD. DEV. FOR POOLED DATA (Spooled) To achieve a value of good approx. to s for N 30, it is sometimes necessary to pool pool data from a number of sets of measurements Suppose there are t small sets of data, comprising N1, N2,.Nt measurements, the equation for the resultant sample standard deviation is:

Example of Pooled Standard Deviation


Analysis of 6 bottles for sugar
Bottle Sugar (% ) 1 0.94 2 1.08 3 1.20 4 0.67 5 0.83 6 0.76
2222

Obs 3 4 5 4 3 4
2222

Deviations from mean 0.05, 0.10, 0.08 0.06, 0.05, 0.09, 0.06 0.05, 0.12, 0.07, 0.00, 0.08 0.05, 0.10, 0.06, 0.09 0.07, 0.09, 0.10 0.06, 0.12, 0.04, 0.03
2222

(xi x1) + (xi x2 ) + (xi x3) +....


2 2 2

N1

N2

N3

= (

5 0 . 0

) +(
2

0 1 2222 . 0

) +(

8 0 . 0

9 8 1 2222 0 . 0

) =

spooled =

i =1

i =1

i =1

N1 + N2 + N3 +......t

(Note: 1 degree of freedom is lost for each set of data)

S 1 2 3 4 5 6 Total

ssss

7 9 0 . 0 =

1111

(x x) 0.0189 0.0178 0.0282 0.0242 0.0230 0.0205 0.1326


i

sn 0.097 0.077 0.084 0.090 0.107 0.083

For all data, calculate Sn

6 6666 2 3 1 3 . 0 2

ssss

% 8 8 0 . 0

d e l o o p

Solve this Problem Given a set of diameters of four cells in units of m, 120, 135, 160 150 (a) Use functions available in your calculator (b) Use the Excel Spreadsheet (at your own time and submit the data and result printout) Calculate the following: - Mean - Median - Standard Deviation - Relative Standard Deviation (RSD) - Variance

ACCURACY versus PRECISION

PRECISION

- Reproducibility (repeatability) of repeated measurements ie How similar are values obtained in exactly the same way? Useful for measuring deviation from the mean

d i = xi x

ACCURACY Nearness (proximity) to the true value, ie. measurement of agreement between experimental mean and true value (which may not be known!) Measures of accuracy:
- Absolute error:

ACCURACY versus PRECISION (Graphical illustration)

E = xi - (where = true or accepted value)

- Relative error: E R = |

xi | 100%

Relative error is more useful in practice

Discussion Question 1 Four students analyzed Fe content in a sample. Each student performed 5 replicates and the results are illustrated below. Comment on the accuracy and precision of each set of results (Hint: Student C obtained the best results)
True value A B C D 9.80 10.00 10.20 mean 10.10 9.90 10.01 10.01

Discussion Question 2 - Comment on the accuracy and precision of the following results. Explain or show proof? - Which set of data has to be thrown out (discarded)? (discarded) ? Why?
Student A B 10.10 10.08 10.09 10.07 10.08 10.10 0.01 C 9.65 9.75 9.78 10.07 10.24 9.90 0.25 D 9.97 9.98 10.02 10.03 10.05 10.01 0.03 E 9.80 9.89 10.01 10.13 10.22 10.01 0.17

X
DATA VALUE 10.00 10.00 10.00 10.00 10.00 10.00 0.00

MEAN STD DEV

USE OF STATISTICS IN DATA EVALUATION

CONFIDENCE LIMIT & CONFIDENCE INTERVAL Confidence Interval (CI) is the range of values surrounding the mean, mean, within which the population mean, is expected to lie with a certain degree of probability The boundries of the range are called the Confidence Limits Confidence Level (CL) is the probability that the true mean lies within a certain interval (expressed as %) Example: It is 99% probable that for a set of measurement is 7.25mg 0.15. Thus, the mean should lie in the interval from 7.10mg to 7.40mg with 99% probability

Expressions of Confidence Interval

CI for large no. of data (>30) with known population std deviation, CI for small no. of data (30) without knowing (know s)

=x z N
=x ts N

Values of z for determining confidence limits Confidence level (%) 50 68 80 90 95 96 99 99.7 99.9 z 0.67 1.0 1.29 1.64 1.96 2.00 2.58 3.00 3.29

N = Number of measurements z = values from normal distribution curve (Read from the zz-table) t = values from normal distribution curve but depends on the degree of freedom (N(N-1) (Read from the tt-table) t is also known as the students t, generally used in hypothesis tests

Values of t for various levels of probability


Degrees of Freedom (N (N-1) 1 2 3 4 5 6 7 8 9 19 59 80% 3.08 1.89 1.64 1.53 1.48 1.44 1.42 1.40 1.38 1.33 1.30 1.29 90% 6.31 2.92 2.35 2.13 2.02 1.94 1.90 1.86 1.83 1.73 1.67 1.64 95% 12.7 4.30 3.18 2.78 2.57 2.45 2.36 2.31 2.26 2.10 2.00 1.96 99% 63.7 9.92 5.84 4.60 4.03 3.71 3.50 3.36 3.25 2.88 2.66 2.58

SAMPLE QUESTION (CONFIDENCE INTERVAL) Calculate the confidence interval (CI) at 95%, 90% & 99% confidence level given the following data for the analysis of Ca in a rock sample: 14.35, 14.41, 14.40, 14.32, 14.37 Mean = 14.37, s = 0.037 From table: @ confidence level 95% & NN-1 = 4, t = 2.78 = 14.37 2.78 x 0.037 CI = = x t s =

Confidence interval is 14.37 0.05 or 14.32<< 14.42 Summary of results (calculate the rest by yourselves): @ Confidence level Confidence interval (CI) 90% = 14.37 0.04 95% = 14.37 0.05 = 14.37 0.08 99% If confidence level increases, the CI increases, and the probability of appearing in the interval also increases

Sample Question (Confidence Limit when s is known)


AAS analysis of Cu in aircraft engine oil gave a mean value of 8.53 mg Cu/mL Cu/mL. . Pooled results of many analyses showed that s = 0.32 mg Cu/mL Cu/mL. . Calculate the confidence intervals (CI) at 90% & 99% confidence levels based on (a) 1 (b) 4 (c) 16 measurements (a) Confidence limit (CL) = = x t s
(b)

90%, CL = 8.53

(1.64)(0.3 2) = 8.53 0.26 4

99%, CL = 8.53

(2.58)(0.3 2) = 8.53 0.41 4

N
(c)

(For many analyses, t is read at degrees of freedom)

(1.64)(0.3 2) = 8.53 0.52 g/mL @ 90%, CL = 8.53 1

90%, CL = 8.53

(1.64)(0.3 2) = 8.53 0.13 16

@ 99%, CL = 8.53

(2.58)(0.3 2) = 8.53 0.83 1

99%, CL = 8.53

(2.58)(0.3 2) = 8.53 0.21 16

Sample Question (Confidence Limits when s is not known)


Analysis of an insecticide gave the following values for % of Lindane: 7.47, 6.98, 7.27. Calculate the CL for the mean value at the 90% confidence level

OTHER USAGE OF CONFIDENCE INTERVAL To determine # of replicates (N) needed for the the mean to be within the confidence interval To determine systematic error

xi% 7.47 6.98 7.27

xi 55.8009 48.7204 52.8529


2 i

xi = 21.72 xi2 = 157.3742


x=

x
N

2172 . = 7.24 3

s=

( x i ) 2 2 157.3742 (21.72) N 3 = 0.246 = 0.25% = N 1 2

@90%, CL = x ts

= 7.24

(2.92)(0.25) = 7.24 0.42% 3

Example 1: 1: Calculate the number of replicates needed to change the confidence interval by 1.5 g/mL at 95% confidence level. Given, s = 2.4 g/mL

Example 2: 2: Ten measurements on a sample gave a mean of 0.461, with std dev of 0.003. A solution gave a reading of 0.470. Show whether systematic error exists at 95% confidence level At 95 95% % confidence level, (N (N 1) = 9, t = 2.26

(0.003 ) ts = 0.461 2.26 N 10 = 0.461 0.002 This means, 0.459 < < 0.463, ie 95% of the time, the true value lies between 0.459 to 0.463 Therefore, the the reading 0.470 is NOT in the range, and systematic error EXISTS
= x

DISTRIBUTION OF ERRORS
NORMAL or GAUSSIAN distribution (bell shaped, symmetrical curve) gives limits within which the population mean () is expected to lie with a given degree of probability (without any systematic error)
50% -0.67s +0.67s 80% -1.29s
dN/N

95% +1.29s
dN/N

Based on the curve, percentages of area under the curves between certain limits of z are as follows: 50% of area lies between 0.67s 80% " 1.29s 90% " 1.64s 95% " 1.96s 2.58s 99% " When we say that at a confidence level of 80%, the confidence limits are 1.29s we mean that: - 80% of the time the true mean will lie between 1.29s of the measurements made - or in other words 20% of the time the true mean will NOT lie between 1.29s

-1.96s

+1.96s

dN/N

1s 2s 3s 4s

-4s -3s -2s -1s 0 1s 2s 3s 4s -4s -3s -2s -1s 0 1s 2s 3s 4s -4s -3s -2s -1s 0

mean is indicated by

SIGNIFICANCE TESTS
Tests whether the difference between two results is significant (or merely due to random variations) - used to decide whether the difference between the measured and known values can be explained by random errors The NULL HYPOTHESIS, HYPOTHESIS, Ho If Ho is accepted: accepted: means there is NO significant difference between observed and known values (other than that due to random observation) If Ho is rejected: rejected: means difference is significant

The t Test (Student tt-test)


Has two uses: (1) Comparison of true value, and mean, to detect if difference is significant - Used to detect the existence of systematic error or bias Calculate t (generally for 95% confidence level) If value of tcalculate < tcritical (ie tcalc < ttable), ACCEPT the null hypothesis, thus Ho: = Accepting Ho means that there is NO significant difference (or no systematic error) at the 95% confidence level, but there is 5% probability that there is a sgnificant difference

(2) Comparison of means ( ) of two samples - eg Compare mean of new method with a reference (or standard) method - Accept Null hypothesis (Ho) if NO significant difference between methods ie the results are the same, or =0 - Calculate t, if tcalc < ttable, accept Ho to show that there is NO significant difference in results Use pooled estimate of std dev, s2={(n1-1)s12+ (n2-1)s22} / (n1+n2-2), or

EXAMPLE 1: Detection of Systematic Error (Bias)

EXAMPLE 2: Determine if results differ significantly

The F Test

F-TABLE

Compares Std dev (ie random errors of 2 data sets)

- One tailed test: test: test whether method A is more - Two tailed test: test: test whether methods A and B - F is ratio of two
sample variances:

precise than method B (assumes A is always precise) differ in their precision (ie any method can be precise)

F=

s2 1 1 = 2 2 s2

Ho: Population variances are equal (or 1) [F is always >1, thus the smaller ie the more precise is always the denominator] If Fcalc < Ftable (Accept Ho) which means that there is NO significant difference in precision between the two methods

Example Question: ONEONE-TAILED F TEST A proposed method for COD of wastewater was compared with a standardized method The results are given as follows: Standardized method (8 (8 determinations): determinations): mean =72 mg/L, s = 3.31 mg/L determinations): Proposed method (9 (9 determinations): mean = 72 mg/L, s = 1.51 mg/L () Is the proposed method significantly more precise than the standardized method? F = (SStd)2/(SProp)2 = (3.31)2/(1.51)2 = 4.8 Data values: 8 for Std & 9 for proposed, thus from the FF-table degrees of freedom (N(N-1) = 7numerator and 8denominator, Fcrit = 3.50 Since Fcalc >Ftable , reject Ho. Thus there is a significant difference bet the methods and the proposed method is significantly more precise
Set as denominator

Example: Determination of CO using a Standard Procedure gave an s value of 0.21 ppm. The method was modified twice giving s1 of 0.15 and s2 of 0.12 (both 9 degrees of freedom). Are the modified methods significantly more precise than the std? Ho : s1 = sstd Ho: s2 = sstd

F1 =

s 0.21 = = 1.96 s 0.152

2 std 2 1

F2 =

s2 0.21 2 std = = 3.06 s2 0.12 2 2

In standard methods the # of data is large, thus s, & degrees of freedom becomes infinity, From FF-table, num=, den=9; Fcrit = 2.71 F1< Ftable : accept Ho but F2>Ftable : reject Ho Only the 2nd modified method is is significantly more precise than the standard method

The Q TEST or DIXONS TEST (Detection of gross errors) The QQ-Test is used for detecting outlier (suspected unreasonable data) which statistically does not belong to the set Example: Example : 10.05, 10.10, 10.15, 10.05, 10.45, 10.10

Use this Equation:

By inspection, inspection, 10.45 seems to be out of the data


normal range (More easily observed when numbers are arranged in a decreasing or increasing order) 10.05, 10.05, 10.10, 10.10, 10.15, 10.45

The Qcal is compared with the Qtable and the null hypothesis, Ho is checked

Q expt =

10.45 - 10.15 10.45 - 10.05

= 0.75

From QQ-table (@95% & N=6) Q = 0.625 (Q-table:Next slide ) Qcal > Qtable data (10.45) can be rejected

Can/should this data be eliminated ? (The mean


will change from the original value if changed!)
Contd

Q TABLE No. of Observations 3 4 5 6 7 8 9 10 Confidence Level 90% 0.941 0.765 0.642 0.560 0.507 0.468 0.437 0.412 95% 99% 0.970 0.829 0.710 0.625 0.568 0.526 0.493 0.466 0.994 0.926 0.821 0.740 0.680 0.634 0.599 0.568

EXAMPLE QUESTION: QQ-TEST The following data was obtained for the determination of nitrite concentration (mg/L) in a sample of river water: 0.403, 0.410, 0.401, 0.380, 0.400, 0.413, 0.411 Should the data 0.380 be retained? Q = |0.380 - 0.400|/|0.413 - 0.380)| = 0.606 From the QQ-table: Sample size 7, Qtable = 0.570 Qcalc>Qtable, thus the suspect outlier is rejected