Sie sind auf Seite 1von 49

G606 Geostatistics and Spatial Modeling

Lecture Notes

Spring Semester Idaho State University Department of Geosciences

2004 John Welhan Idaho Geological Survey

Contents

Syllabus.................................................................................................................................. 1 General Information................................................................................................... 1 Course Content........................................................................................................... 2 Basic Statistics Review.......................................................................................................... 3 Self-Evaluation Quiz.................................................................................................. 4 Univariate vs. Multivariate Data and Geostatistics................................................... 5 1. Introduction, Definitions, Software, Basic Statistics Review............................................ 6 S.1.1. Generalized geostatistical analysis and modeling sequence............................ 16 2. Lecture Notes: Parametric Tests, Nonparametric Tests.....................................................18 Problem Set I. Review: Statistical summarization.....................................................23 S.2.1. Summary of hypothesis testing........................................................................24 S.2.2. Interpreting p-values returned by a statistical test........................................... 25 S.2.3. The power of an hypothesis test...................................................................... 26 S.2.4. Classification and measures of classification performance............................. 27 3. Bivariate Data, Regionalized Variables, EDA.................................................................. 35 Problem Set II. Exploratory Data Analysis............................................................... 38 S.3.1.Hypothesis testing of regression parameters.................................................... 39 4. Autocorrelation and Spatial Continuity............................................................................. 40 Problem Set III. Spatial Correlation 1 - Experimental Variography......................... 46 5. Identifying Spatial Autocorreclation Structure.................................................................. 47 Problem Set IV. Spatial Correlation 2 - Variogram Modeling.................................. 47 S.5.1. Words of Wisdom on Experimental Variogram Analysis............................... 48 S.5.2. Variogram Analysis Software......................................................................... 50 6. Modeling Spatial Autocorrelation Structure...................................................................... 51 Problem Set V. Introduction to ArcGIS: EDA and Variogram Modeling................. 53 S.6.1. Words of Wisdom on Variogram Modeling....................................................53 7. Kriging Concepts, Introduction to the Kriging System of Equations................................ 54 Problem Set VI. Kriging Estimation: the Neighborhood Search Process..................60 8. Cross-autocorrelation and Cross-variograms..................................................................... 61 Problem Set VII. Cross-variogram Analysis with ArcGIS........................................ 64 9. Practical Implementation of Kriging................................................................................. 65 Problem Set VIII.. Cross-Validation and Validation of Kriging Models.................. 67

10. Assessing Kriging Uncertainty; Probability Mapping.................................................... 68 Problem Set IX. Introduction to Indicator Variables and Indicator Kriging............ 71 11. Introduction to Stochastic Simulation............................................................................. 72 Problem Set X. Introduction to Sequential Gaussian Simulation............................ 75 S.11.1. Evaluating bivariate normality...................................................................... 76 12. Indicator Variables, Multiple Indicator Kriging and Simulation.................................... 79 Problem Set XI. Introduction to Sequential Indicator Simulation.......................... 84 13. Other Simulation Techniques.......................................................................................... 85 14. Change of Support.......................................................................................................... 86 92 Geostatistics on the Internet.................................................................................................. 94

1 G 606 - Geostatistics and Spatial Modeling 4 Credits Spring Semester

Instructor: John Welhan, Idaho Geological Survey Textbook: Isaaks and Srivastava (1989) Meeting: Tue/Thur. 8:30-10:00 am, plus a 2-hour lab, schedule TBA An introduction to the description, analysis, and modeling of geospatial data and of the resulting uncertainty in the models. Theory and its correct application will be integrated with the use of various software tools (including GIS) and appropriate examples to emphasize the crossdisciplinary applicability of geostatistical analysis and modeling. Prerequisites are an introductory applied statistics course, familiarity with the Windows operating environment, and basic spreadsheet-based data manipulation. Knowledge of ArcView or other GIS software is strongly encouraged. All software will be on a Windows/Intel platform. Readings will be assigned to stimulate in-class discussion, and all students will research, present, and lead a class discussion on published articles of their choosing that focus on concepts and applications of geostatistical analysis and modeling. Office hours: one hour after each class and by appointment. Grading: 40% - weekly computer lab tutorial / problem sets 40% - project, oral presentation, final report **Students must bring a suitable spatial data set to be analyzed as a term project** 10% - analysis & presentation of published literature 10% - discussion of readings, class participation

Assigned Readings taken from: (* Obeler Library reserve; + instructor check out): General Statistics References: * Till, Roger (1974) Statistical Methods for the Earth Scientist; Wiley, NY Davis, J.C. (2002) Statistics and Data Analysis in Geology (3rd ed.); Wiley & Sons, NY Geostatistics (G), Kriging (K), Stochastic Simulation (S), and Software (W) References: * Isaaks and Srivastava (1989), Introduction to Applied Geostatistics, Oxford Univ. Press, (G,K) + Deutsch and Journel (1997) Geostatistical Software Library and Users Guide, Oxford (W) + Deutsch (2003) Reservoir Modeling, Oxford (G, S) + Goovaerts, P. (1997) Geostatistics for Natural Resources Evaluation, Oxford (G, K, S) * Houlding, S.W. (1999) Practical Geostatistics, Springer (G,K,S --although geology-oriented) * Clark, I. (1979) Practical Geostatistics, Applied Science Publishers (G,K with a mining focus) * Yarus, J.M. and Chambers, R.L. (1994) Stochastic Modeling and Geostatistics, AAPG (G,K,S) + Pannatier, Y. (1996) VarioWin: Software for Spatial Data Analysis in 2D, Springer (W) GIS-Remote Sensing Applications + Heuvelink (1998) Error Propogation in Environmental Modelling, Taylor & Francis, BristolPA + Stein et al. (1999) Spatial Statistics for Remote Sensing, Kluwer Academic Publ., Boston, MA + Johnston, K. et al. (2001) Using ArcGIS Geostatistical Analyst, GIS by ESRI, Redlands, CA

2 Course Content

Week 1. Overview, Course Topics and Case Study: Overview of applications and techniques to be covered: spatial continuity (autocorrelation) analysis; statistical modeling vs. data modeling; estimation; simulation; prediction uncertainty. Week 1-3. Statistics Review / Exploratory Data Analysis: Statistical summarization, analysis, and modeling; representing spatial data, continuous vs. categorical data; frequency distibutions; correlation and conditional correlation; transformations (logarithmic, normal-score, indicator, rank-order); evaluating classification performance; software applications. Week 4-6. Analysis and Quantification of Spatial Continuity: Statistical measures of autocorrelation; experimental variograms; autocorrelation function models, modeling anisotropy and nested functions; indicator variograms; cross-autocorrelation (co-spatial variability of multiple variables). Week 7-9. Best Linear Unbiased Spatial Estimation: Techniques of spatial estimation, limitations of biased estimators; kriging as a 'best', linear unbiased estimator; the kriging system of equations; use and misuse of the kriging variance; sensitivity of kriging on variogram and search strategy decisions; cross-validation and validation as measures of kriging performance; cokriging.

Week 10. Spring Break

Week 11-14. Limitations of Kriging / Stochastic Simulation / Indicator Estimation: Simulation vs. kriging, differences, philosophy, applications; adaptation of the kriging system of equations to simulation; multiple indicator kriging, advantages; basic Gaussian and indicator simulation algorithms; other simulation approaches. Week 15. Change of Support: Data measurement scale; impacts on modeling and choice of scales; regularization (changing scales, numerical techniques for addressing scaling issues (block kriging, averaging techniques). Week 16. Analysis of Uncertainty: Probability mapping; threshold exceedance; estimation vs. simulation approaches to uncertainty analysis; error propagation.

3 Statistics Review and Relevant Background You should be familiar with the statistical terminology below, including how to calculate, use, and describe these most basic statistical concepts. If you have trouble with the Self-Evaluation Quiz (p.4), your stress level in this course will be high. If you are not already familiar with the non-underlined terminology, you will need to be by the end of the review phase of this course. The review material covered in Chapters 1-3, below, includes applied statistical material that you will need to be familiar with and on which you will be tested in Week 3. If you do poorly on Week 3's test, a one-on-one appointment will be arranged to evaluate your level of preparedness for this course.

Basic statistical terminology: Classical statistics - the analysis and description of variability ("modeling") in order to estimate the likelihood of a future outcome ("predicting"), based on that model. For example, fitting a normal probability distribution model to a histogram creates a statistical model of variability, from which a prediction can be made of the probability that a specified value will be exceeded in future sampling. Classical statistics is predicated on the assumption that all outcomes in a sample are independent of one another (i.e., measurements made at one location or time have no bearing on other measurements made nearby). Populations - parent, sample; frequency distibutions, histogram, probability distribution function, cumulative distribution function, homogeneity (unimodel / multi-modal), ergodicity, homoscedasticity, stationarity Measures of central tendency - mode / median / mean / expected value Measures of dispersion - variance / standard deviation / inter-quartile range / skewness / kurtosis Parametric vs. non-parametric statistics - the Gaussian distribution / Gaussian probability tables Bivariate correlation - regression, covariance, statistical tests of regression Hypothesis testing - level of significance, p-values, tests of normality and population similarity, the Student's t-test, the 2 statistic, the Kolmogorov-Smirnov (K-S), Mann-Whitney, and other non-parametric tests Data transformations - the standard normal deviate / the lognormal transformation

You are responsible for the prerequisite knowledge required in this course. Use the review chapters and suggested reference material or your own reference material to brush up on these statistical concepts and applications--they are vital to the subsequent development of the concepts and application of spatial statistics!

Self-Evaluation Quiz: Note: if you cannot answer questions (a) through (h), this course may present difficulties!

a) What is the name commonly used for the expected value of Gaussian probability distributions? b) What measure of dispersion is used to characterize a bell-shaped probability distribution? c) What values of skewness and kurtosis would a normal probability distribution have? d) What is a log-normal probability distribution? Which of the following means most closely appoximates the mode of a log-normal distribution: arithmetic mean, geometric mean, harmonic mean? e) What percentage of outcomes fall between the first and third quartiles of a sample population? f) What statistical test could be applied to determine if two sets of measurements were drawn from Gaussian populations having similar variances but different means? g) Consider a histogram that has two peaks; does this indicate a statistically homogeneous population? could such a distribution arise in a stationary population? h) The following frequency tables describe two histograms of surface temperature measured in two small, adjacent lakes. Fill in the required information for the samples' descriptive statistics. Lake A:
T, C Frequency
o

Lake B:
T,oC Frequency

12 13 14 15 16 17 18 19

1 2 2 4 5 4 1 1

n = 20 mode = ? mean = ? variance = ?

12 13 14 15 16 17 18 19

1 5 5 4 2 2 1 0

n = 20 mode = ? mean = ? variance = ?

i) Apply a suitable (quantitative) statistical test to evaluate the hypothesis that both sets of temperatures are drawn from the same population. j) In fact, both lakes overlie the same aquifer and are fed by the same ground water source (at a uniform 12 oC); they are the surface expression of the same ground water table. Howis it possible that the temperature data portray such a different picture between the two samples?

Univariate vs. Multivariate Data and Geostatistics

Univariate observations - a single dependent variable measured in a sample drawn from a population (e.g. gold assays in drill cores) and analyzed without regard to position. Multivariate observations - a single dependent variable measured with respect to the independent variables of position; OR multiple dependent variables measured with or without accompanying position information (e.g. gold assays in rock samples referenced to x, y, z position; water levels in a well referenced to temporal position; gold, silver, and sulfur assays in each of multiple ore samples). e.g. all GIS (spatial) data are multivariate. An attribute measured at spatial coordinates at different times is a multivariate variable where space and time are the independent variables. Thus, water level, Zi, in multiple wells, i, measured at three different times, tn, can be viewed as three multivariate variables with respect to space, Zi[xi, yi] t1, Zi[xi, yi] t2, Zi[xi, yi] t3 , or as a single multivariate variable Z[x, y,t]. Similarly, time-series measurements of water level at a single location is bivariate, with position represented by time as the independent variable.

Geospatial data are a type of multivariate data. There may be only one variable of interest (the dependent variable) but its values are related to position (independent variables of location and/or time. Note that multivariate data can be analyzed as separate, univariate distibutions by considering one variable at a time. However, the interrelationships between independent variables in space can often be exploited to learn more of the relevant physical process(es) and spatial statistical structure; such relationships can be analyzed using a variety of multivariate statistical methods (e.g. multiple linear regression, generalized analysis of variance, discriminant analysis, factor analysis, canonical correlation, etc.) which are not within the scope of this course. Geostatistics - the analysis and modeling of spatial (or 'geospatial') data that are distributed in a coordinate system of space and/or time; a.k.a. 'Spatial Statistics'. Geostatistics is a class of statistical methods that considers the interrelationship and spatial dependencies among one or more correlated dependent variables and the independent variables of position (x, y, z, or time). This course does not deal with other multivariate methods such as multiple linear regression, discriminant analysis, or factor analysis. See Koch and Link (1971), Volume 2, Chapter 10, for an excellent summary of the concepts in (non-spatial) multivariate data analysis and good introductions to these multivariate methods of analysis

6 1. Review-I - Introduction, Definitions, Software, Basic Statistics Review 1.1. What is Geostatistics? A class of statistical tools that are used to analyze and model spatial variability, to better understand physical process, and to quantify prediction uncertainty in spatial models. - geostatistical description and prediction does not model a physical or biological process and is of little use in extrapolative prediction (prediction beyond the spatial bounds of available measurements); it is best suited for interpolative prediction . - geostatistics does not replace deterministic modeling techniques where the process model, its input variables, and the parameter values that describe spatial variability are sufficiently well constrained to construct an accurate quantitative prediction. However, even with a good process model, the random or non-trend component of spatial variability is best modeled with a geostatistical approach. - geostatistics is most valuable in the analysis of attribute values which are distributed--and physically correlated--in space and/or time (as are most GIS data). 1.2. Five Steps to Geostatistical Modeling. - the analysis of the statistical charactersitics of spatial variability is called "structural analysis" or "variogram modeling." From this structural analysis, predictions of attibute values at unsampled locations can be made using two broad classes of modeling techniques known as "kriging" and "stochastic simulation." - some steps, like (1), (2), and (3) below, are mandatory and may have to be repeated iteratively before appropriate decisions can be made; (4) and (5) are optional depending on the goals of the study and the types of predictions and uncertainty analyses required: (see Section S.1.1.) (1) EDA-I (Exploratory Data Analysis): classical descriptive statistics and analysis of stationarity and population homogeneity, population outliers, basic statistical hypothesis testing, and correlations among attributes; (2) EDA-II: understanding the spatial nature of variability, spatial data density and sample availability, data clustering, spatial trends and discontinuities; identifying possible population regroupings and data and coordinate transformations; repeat (1) if necessary; (3) Variography (EDA-III): spatial autocorrelation analysis, identification and treatment of spatial outliers, structural insights into process, population regroupings, data transformations; repeat (1) and/or (2) where necessary; (4) Decision time: what can be done with the data? is spatial autocorrelation present? is it strongly expressed? can it be described confidently? will geostatistical prediction be of use? which prediction method will be used and for what reasons? (4) Prediction: The BLUE method -- two maps based on linear, weighted averaging, one to predict large-scale spatial variability and the other, a statistical measure of prediction uncertainty; The Stochastic method -- an unlimited number of maps based on BLUE prediction or other methods to represent spatial variability at all scales and to assess prediction uncertainty better than BLUE methods can.

7 1.3. Kriging vs Simulation: - kriging is a statistical weighting procedure that produces a best linear unbiased estimate (B.L.U.E.) and the variance of the estimate - stochastic simulation is a probabilistic depiction of local variability superimposed on the regional spatial variability described by kriging - kriging is a smoothing interpolator that predicts large-scale spatial variability - honors global statistics and local data - produces a smoothed representation of large-scale spatial variability - quantifies statistical uncertainty at each location where an estimate is made - used for best estimation of expected values - simulation is a probabilistic representation of local variability - honors global and local statistics and local data - reproduces local variability - best for representing local variability and local uncertainty - both methods can incorporate and honor different types of hard data, but only simulation can incorporate soft information and other constraints (eg: physical geometry)

1.4. Statistics Review -- Some Basic Definitions: - probability: the expectation of an outcome of a random event or measurement - stochastic: synonymous with probabilistic - dependent variable: a qualitative or quantitative measure of a physical attribute - sampling item (or event): a single outcome or measurement - (global) population: the set of all possible outcomes of a process that can be sampled - sample (population): a set of measurements or outcomes drawn from a global population. A specimen is an event; a sample is an experiment in which multiple specimens are collected, from which we attempt to estimate the statistical characteristics of the underlying population - regionalized variable: a variable whose value is dependent on spatial and/or temporal location

1.5. Additional Definitions: - variance: the variation of a single variable about its mean - covariance [as in bivariate regression]: the joint variation of two correlated variables about their common mean - (auto)Covariance [as in geostatistics]: the variation of a single regionalized variable - cross-Covariance [as in geostatistics]: joint variation of two correlated regionalized variables - correlation structure: a statistical description of the Covariance of a regionalized variable

1.6. Reference Sources - Basic Statistics and Review: Till (1974) and Davis (2002) are highly recommended for their clear presentation of concepts and methods, using numerous (albeit geological) examples. An excellent review of statistical techniques developed by the Biological Sciences Department at Manchester Metropolitan Univ. is available on the web. You may find it helpful in scraping off the rust: http://asio.jde.aca.mmu.ac.uk/teaching.htm (click on M.Sc. Research Methods)

8 - Introductory Geostatistics: Isaaks and Srivastava (1989) is an excellent text and reference source for the beginner or experienced user. It covers the fundamentals of geostatistics through variography, kriging, and cokriging. More advanced treatments are found in Cressie (1993) and in Journel and Huijbregts (1978). - Advanced Geostatistics and Simulation: Although not on library reserve, Deutsch and Journel (1992, 1998) and Goovaerts (1997) are written as companion texts and are highly recommended for the most comprehensive software applications and theory covering many facets of geostatistics analysis, estimation, and simulation (intermediate level). Another reference source not available on library reserve is the ESRI documentation for ArcGIS's Geostatistical Analyst extension; it is both a review of statistical concepts and a tutorial for ArcGIS's geostatistical capabilities.

1.7. Software Overview - Software used in this course is a combination of ArcGIS, third-party proprietary, and third-party public domain software. ArcGIS now offers an excellent, user-friendly interface to a goodly subset of exploratory statistical analysis methods, variogram and cross- variogram analysis, various kriging methods, cokriging, and cross-validation analysis as well as several deterministic (non-statistical) interpolation methods. However, its general statistical capabilities are limited, and its variogram and cross-variogram analysis capabilities (though robust and powerful) are overly automated (designed for users who "can't wait to krige" but can't be bothered with understanding what the automation is doing or not doing). Furthermore, only 2-dimensional data analysis is permitted, and stochastic methods of modeling and uncertainty analysis are not currently available. - In order to introduce the various geostatistical concepts--particularly in exploratory data analysis (general statistical treatment of data) and variography--as well as to introduce stochastic simulation concepts which ArcGIS does not currently support and 3-dimensional geostatistical analysis, other software packages will be introduced throughout this course.

1.8. Data Measurement Scales (Till, p.3-5): - measurements can be made on four different typologic scales, differing in their information content from qualitative to highly quantitative - nominal scale: categorization into arbitrary classes or categories; eg: colors, species, rock types - ordinal scale: ranking into a sequence of classes whose sizes may be arbitrary or constant; e.g., the mineral hardness scale; high, medium and low-valued categories - interval scale: continuous, numerical measurements relative to an arbitrary zero (eg: oC, oF temperature scales) - ratio scale: continuous, numerical measurements relative to a true zero (eg: oK temperature scale; permeability; nutrient abundance) - the ratio scale of measurement has the highest information content, but it is not always possible or desirable to make such measurements; the nominal scale is the basis on which the natural sciences are founded (classification and description) and are still extremely useful in quantitative analysis; ie. you choose the data type that makes most sense for your problem--even if it means transforming the data from a higher to a lower typology; e.g.,

9 classification of ordinal and interval data into bins or classes, or transforming to a binary indicator variable (above and below some threshold) - an extremely important consideration in collecting and analyzing data of any type is the 'support' or size of the 1-, 2-, or 3-dimensional space over which a measurement is made; e.g. the choice of pixel size for spectral images determines how ground-based measurements will be collected and analyzed; an ore assay on a chip sample would not provide as good an estimate of the average ore grade of a mine stope as a properly homogenized sample from a 10-ton truckload from the stope; topographic contours estimated from 10-meter pixel imagery will have a different accuracy level that 1000-meter pixel estimates

1.9. Sampling Design: (Till, p.51 - sampling is like religion: all are for it, no one practices it) - the purpose of sampling is to make statistical inferences about the underlying population as efficiently and as accurately as possible - systematic vs. random sampling is dictated by practicality (availability) and the nature of what is being sampled. For example, if a process is known to produce a patchy spatial variation of high and low values, then a regular grid won't be efficient or even able to capture the statistical properties of that patchiness - steps in designing a sampling strategy: 1. develop a conceptual framework: the purpose of the sampling campaign, expected populations to be encountered, types of variables (continuous, categorical), expected sources of variability 2. form a working statistical model based on the conceptual framework (eg: a normal popn) 3. choose a sampling plan based on the model that will achieve the stated purpose 4. decide on the number of samples to collect to achieve the accepted levels of precision vs. accuracy (repeatability vs. truth) - types of sampling: regular (gridded or geometric); random; biased (historical or available) - sampling goals will differ depending on project goals: scales of variability, areas representative of the study region, number and spacing constraints of samples, sample support size, etc. - Note: the goals of sampling are as varied as the earth is complex; different sampling campaigns within the same project can have very different objectives, and their impacts on spatial data analysis can be enormous! For example, obtaining the best global estimates of environmental contamination may require spatially random (unbiased) sampling, whereas the most efficient sampling plan for locating zones of high-grade ore and their size may involve progressively biasing the collection of data towards high-valued areas. There is nothing to prevent a sampling campaign from evolving from a random or gridded (unbiased) scheme to a localized, "hot-spot" (biased) campaign - sampling campaigns that target the high or low end of the data distribution produce spatially clustered data: thatis, the mean, variance and frequency distribution of the sample data are biased by the inclusion of a disproportionate number of samples of high or low values. Declustering of such a data set is unnecessary for kriging, which automatically accounts for data clustering, but is important for simulation because the underlying population distribution must be accurately estimated to make predictions where neighborhood data are unavailable. The technique known as cell-declustering superimposes regular grids of various cell sizes

10 over the data domain, and assigns a declustering weight to clusters of samples that fall within each cell that are proportional to the inverse of the number of samples within a cell. The global declustered mean for a given cell size is then defined as 1 m declus = N S dixi , where N is the number of samples, i is the cell-declustering weight. Optimum declustering weights are chosen for a cell size in which the global declustered mean is a minimum (or maximum, if low-valued values are preferentially clustered)

1.10. Probability: (Till, Chapter 2, 3) - a sample of a population is the outcome of a statistical experiment (a sampling campaign) - if multiple outcomes are possible for each experiment, there is an uncertainty associated with each outcome, i (eg: if the population consists of 50 red and black marbles in a bag and the sample size is 50, then only one possible outcome exists; there is no uncertainty in the sample or the inferences made from the sample) - the probability of an outcome, pi, is the proportion of outcome i to all possible outcomes: pi = ni/N; N = ni - for any population, there is an unknowable "true" probability of an outcome (a population statistic), which cannot be known, only estimated (a sample statistic) - a process (or a sample drawn from the population created by the process) is considered random when the nth outcome is independent of the (n-1)th outcome (an outcome has no "memory" of preceding outcomes) - a Markov process is a random process but the probability of outcome i, pi, depends on a proceeding outcome j, pj (e.g., prograding fluvial vs. deltaic vs. lacustrine depositional environments; the temporal vegetation succession following a forest fire) - a Markov chain is a series of possible states, with the probability of transition from state i to state j defined for all possible transitions - 1st-order Markov chain: j = i+1 ; nth-order: j = i+n - a stationary process exists where the transition probabilities are constant in time (space) - example: see cyclothem example in Till, p.11-14 for transition probabilities - the concept of Markov processes is related to the concept of spatial correlation which is central to geostatistics; ie. values of a spatial variable are not randomly distributed but depend on their spatial context - a key concept in geostatistics is that of the random function model. This is the hypothetical underlying probabilistic model with which we will describe the statistical properties of a variable and all possible spatial arrangements of its values; it is a purely theoretical concept because only one possible spatial arrangements is ever available to us for sampling (i.e. the Earth as it exists at the time of sampling). The random function (or R.F.) model allows for an infinite number of possible spatial arrangements or so-called realizations of a regionalized variable. For example, the bag of 50 marbles constitutes a "population" in classical statistics, because it is the only one we have to sample; but if it is viewed as just one realization of a random function, then there are many possible arrangements and proprtions of red and black marbles in the bag. In geostatistics we are not concerned with the other possible arrangements, only in inferring the statistical properties of the R.F. so we can use it to make

11 better predictions of what's in the bag.

1.11. Describing a Sample Distribution: (see Cheeney, p.13; Till, p.90-91) - frequency is the number of specimens/events in a sample - a histogram is a probability frequency distribution (pfd); note that it is the area (not the height) of the bars that is proportional to frequency; i.e., the bin sizes (class intervals) of the bars can vary. The appearance of a histogram can be greatly altered by the choice of class (bin) sizes! - probability density function (pdf): where the number of data values is large, the number of histogram classes can be made arbitrarily large to more accurately reflect the shape of the underlying population histogram; in the limit, as histogram class size approaches zero, the histogram smooths out and approaches a continuous curve: this curve is the probability density function (pdf) and is the basis for predicting the probability that a variable lies within a specified range, or the probability that a variable lies above or below a specified threshold. The most commonly encountered pdf's are the normal (gaussian) and lognormal forms - the cumulative frequency distribution (cfd) is the cumulative analog of the histogram; the cumulative distribution function (cdf) is the cumulative analog of the pdf - the height of the cdf at a threshold, z, is equivalent to the area, , beneath its pdf to the left of z; tabulated values, (z), for the standard normal distribution are available to define the probability (area under the pdf) that a variable's value is less than z (see Section 1.2 below) - the q-quantile (or quantile) on a cdf is the height, q, of the cdf at a given value of the variable; a Q-Q plot therefore compares the shapes of two cdfs (or cfds) - the p-quantile (or probability) on a cdf is the value of the variable that a proportion, p, of the data does not exceed; thus, a P-P plot compares cumulative probabilities of two cdfs (or cfds) - common measures of central tendency: mode (highest frequency class); median; mean 1 1 (arithmetic, geometric, harmonic): m arithmetic = S x i ; m geometric = [P x i ] n ; m harmonic = [1/ S xi ] - measures of dispersion (shape) about the central tendency: range (max, min, interquartile); variance / std. dev.; skewness (approximately equal to 3*[mean-median]/std.dev. = -1 to +1) 1 2 - common statistical "moments": 1st moment (mean) = 1 n S x i , 2nd (variance) = n S(x i l) , x l i 1 3 4 3rd (skewness) = n S( r ) , 4th (kurtosis) = f[S(x i l) ] - Note: the pdf and cdf exist only for continuous measurement data, but a discontinuous type of "cfd" can be plotted for any categorical variable like any histogram can (using constant or varying class intervals) for frequency analysis of categorical data. Also, the concept of a cdf is often used interchangeably with that of a "cfd", even where the cdf is unknown--so be aware of its specific usage in a particular context.

1.12. Degrees of Freedom: (Till, p.56,57) - in general, the D.F. of a statistic is equal to the number of outcomes in the sample (n) less the number of statistical parameter estimates that are required to calculate the statistic - for example, to calculate the sample mean, we only need the n outcomes (no other statistical information), so the D.F. for the mean is simply n. In calculating the variance, however, we

12 require an estimate of the mean, so the variance's D.F. is n - 1. - when calculating a population statistic, by definition we have all the information about the population and its statistics, so estimates are not required. That is, in calculating the variance of a global population, the exact mean is known, so no estimate are required e.g.: in a bag of 50 red and black marbles the population N is 50; if sample size n < 50, then calculating the variance requires an estimate of the mean; hence, the variance's D.F.= n-1 However, if the sample size is 50 (all of the population is represented in the sample), then the mean no longer needs to be estimated (it is known with certainty), and so the variance's D.F. is N.
(xl) 2 2r 2

1.13. Normal Probability Distribution Function (pdf): (1.1) or y=


1 r 2o z exp( 2 )
2

y=

1 r 2o

exp(

(see Till, p.33)

where z = (x-)/ is the standard normal deviate. - the standard normal pdf has = 0, = 1, so that y =
1 2o z exp( 2 ) (see Till, p.37 for l ! 3r )
2

- the total area under the standard normal curve is 1.0 (ie. the probability that [ x, z [ is 1) - the area under the curve c [a [ x [ b] is equal to the cumulative distribution function (cdf) and is the probability that x lies between a and b : (1.2) F(a, b) = p[a [ x [ b] = a
b 1 r 2o

exp(

(xl) 2 2r

) dx

- tabulated values of the standard normal curve list the inverse cdf area, 1-(z) - since (z) is the probability (the area of the pdf from to z) that the standardized random variable is less than z, then 1-(z) is the probability that it is greater than z: (1.3) 1 F(z) = p[z [

(xl) r

[ ] = z

1 2o

x exp( 2 ) dx
2

Area 1 z

- tabulated values of 1-(z) (Till, p. 34) range from 0.50 at z = 0 to very small values as z increases - note that since the normal distribution is symmetrical about a mean of zero, only positive z values are tabulated; for negative z values, the tabulated values of (-z) h 1-(|z|) Note: the tabulated values correspond to ; e.g. at z = 1.96, 1-(z) = 0.025. That is, the sum of the two tails is 0.05 so that the probability that values fall between -1.96 and +1.96 of the mean is 1-2(z) = 95%

13

Example: if porosity (in percent) is normally distributed with mean, = 20 and = 2, the probability that porosity lies between 17.5 and 23 can be found as follows: -let the variable x represent porosity, so its standardized form is z: x1 = 17.5, so z1 = (x1-)/ = (17.5-20)/2 = -1.25 x2 = 23, so z2 = (x2-)/ = (23-20)/2 = +1.5 - the tabulated value for 1 - (1.5) = 0.0668 therefore, area A2 = 1 - [1 - (1.5)] = (1.5) (xl) = 0.9332 = p[ r [ z] = probability that x < 23
p = 1 - 0.0668 p = 0.1056 A1 z=1.5 A2

- for z = -1.25, look up the value for 1 - (1.25) = 0.1056; and since the curve is (xl) symmetric about zero, A1 = 0.1056 = p[ r [ z] - therefore the desired probability = A2 - A1 = 0.8266 ie. the probability that the porosity lies between 17.5 and 23% is 82.66% - because of the definition of the standard normal deviate, the value of z is equivalent to the number of standard deviations, , away from the mean (see Till, Fig. 3.13)

1.14. The Normal-Score Transform: - any non-normal distribution (including lognormal ones) can be transformed into a standard normal distribution by a numerical or graphical method known as a normal-score transform, which can be treated as a perfect normal distribution, then back-transformed by an inverse procedure; see Isaaks and Srivastava, p.469-470, Hohn, p.171-172, p.175-185; note that this procedure is always safer than log-transformation, because back-transformation of a log-estimated values introduces a systematic bias in the estimates (Deutsch and Journel, p. 93) - normalization of the data distribution is not necessary for kriging per se but is essential for stochastic simulation of continuous variables based on gaussian-type simulation algorithms. It also circumvents the dual problems of choosing an appropriate (and usually arbitrary) transformation algorithm for an irregular frequency distribution and of appropriately interpreting the back-transform of the linear estimate (as is the case for log-transforms). It makes correlation structure in variogram analysis easier to identify and conceptualize; and, finally, it minimizes chance of numerical instability in solving the kriging matrices

14 1.15. Other Transforms - transforms are useful for changing any distribution into one of a different form - the simplest transform is a linear shift of values (e.g., multiplying fractional values by 100 to generate values in per cent); the normal-score transform is an example of a non-linear transform - the logarithmic transform (w = ln[z]) is a commonly applied transform used to convert a positively-skewed distribution into a more symmetric, normal-like distribution; although the simple back-transform (z' = exp[w]) of any value returns the original value, the simple back transform of a statistical estimate derived from the normally-distibuted transformed values will be biased, requiring a special transform (Deutsch and Journel, 1997, p.75-76) - values of back-transformed confidence limits, mean, and std. dev'n deriveded from their log-transformed counterparts as: (1.4) (1.5) (1.6) x = exp(ln[x]) for the non-log value of xx = exp(ln[x])
2 l x = exp(l ln x + 1 2 r ln x ) for estimate of the (non-log) mean of x 2 2 r2 x = l [exp(r ln x ) 1] for estimate of the (non-log) std. dev'n of x

1.16. The Indicator Transform: - a very useful non-linear transform that is widely used in geostatistics; indicator transformation is the basis for indicator kriging and indicator simulation. Ordinary kriging and Gaussian simulation make estimates on the basis of an assumed Gaussian form of the global cdf. Where such an assumption is inapplicable, indicator geostatistics are used - an indicator transform value, IZc, is assigned on the basis of a decision tree: IZc = K IZc = NOT{K} if z < zc if z mzc for a threshold value, zc.

- indicator transforms take on values of K = 0 or 1, with the sense determined by the application; where the transform is applied to estimate a probability of exceeding the zc threshold, K is set to 0 and NOT{K}, to 1 (for estimating probability of not exceeding zc, the sense of K would be the opposite). However, where indicator transforms are used to estimate a cdf (see the following section), K is always assigned a value of 1 and NOT{K}, a value of 0. - for example, to estimate the probability of exceeding zc = 10, the indicator transform of values less than 10 would be assigned a value of 0 and values of 10 or greater would have an indicator value of 1. Ordinary kriging of the transformed variable produces estimates of the probability that zc is exceeded. - in general, indicator transforms are useful in the following circumstances: 1) to estimate the distribution of values above or below one or more specified threshold(s), zc(i) (e.g., the proportions of measurements classified as high- and low-permeability); 2) to represent the cdf of a categorical variable (nominal or ordinal data), by assigning K = 1 to indicate Presence of a particular category, 0 to indicate Absence; 3) to model a continuous variable with a non-Gaussian cdf in a "non-parametric" form, by estimating the cdf with multiple indicator transforms

15

1.17. Estimating a Distribution with Indicator (non-parametric) Statistics: - where the form of a cdf is unknown or not well defined by available data, it can be estimated with indicator transformation around multiple thresholds, zc(i) ; in essence, ratio- or intervalscale data are converted into a few ordinal-scale classes around two or more thresholds known as indicator cutoffs; from the relative proportions of outcomes falling below each threshold, the form of the cdf can be estimated, as in the following example:
Raw Data (noisy):
n 1.0 p (I1)
pfd

Approximate Population cdf with Indicator Probabilities:


I1 I2

cdf p (I3)
0

I3

0 C1 Choice of Indicator Cutoffs (pdf thresholds): I1 I2 Area = p (I3) pdf C2 C3

Reconstruct Population pdf:


1.0

cfd

I3 0 C1 C2 C3 0 C1 C2 C3

- see the spreadsheet "IndicatorCDF.wk4" for an example of how the cdf of a continuous distribution can be approximated by indicator transforms.

1.18. Introduction to Problem Sets: The Walker Lake Data Set - refer to data files and details given in class

16 S.1.1. Generalized Sequence in Geostatistical Analysis and Modeling - this sequence of steps is intended to provide a general idea of the process of data manipulation, analysis, and evaluation in a geostatistical modeling project. It is not a "laundry list" to be followed in strict sequence; rather, all or most of these steps need to be addressed in the order that is appropriate for a particular problem and data set - software that can be used at each step is also indicated ("GsA" stands for ArcMap's Geostatistical Analyst)

EDA-I and EDA-II 1. Evaluate data distribution spreadsheet, StatMost, HISTPLT, PROBPLT, SCATPLT - global statistical character (normal/skewed, outliers, univariate / bivariate summary) - GsA has limited capabilities, although the interface is convenient 2. Create plot of sample locations (a post plot) GsA - location errors, visual examination (clustering, hi/lo values, etc.) 3. Data and coordinate manipulations (if necessary) spreadsheet, ROTCOORD - remove trends in raw data; indicator transformation (if necessary); rotate or transform spatial coordinates to remove non-orthogonal spatial arrangements 4. Trend analysis GsA - identify possible trends in raw data 5. Compute declustering weights DECLUS - for estimating unbiased statistics; for input to simulations 6. Normal-score transformation spreadsheet, NSCORE, GsA - for improving variogram analysis, kriging results, necessary for Gaussian simulation Decision points: spatial discontinuities, segregate and/or regroup populations, apply data transformations, repeat 1-6 where necessary

Variography (EDA-III): 7. Construct experimental variograms GsA, VarioWin - identify overall autocorrelation structure, optimal lag classes, anisotropy, data outliers - steps in variogram analysis (VarioWin) - construct isotropic variogram (if any), choose optimal bin parameters - construct anisotropic variograms, identify principal orientations (if any) - look for internal consistency in alternative measures of autocorrelation Decision points: treatment of spatial outliers, structural insights into process, population regroupings, data transformations; is spatial autocorrelation present? will geostatistical prediction be of use? which prediction method will be used and for what reasons? repeat 1-7 where necessary

17 Variogram Modeling: 8. Model the variograms' autocorrelation structures GsA, VarioWin - identify and fit appropriate variogram model(s) depending on whether kriging or simulation will be performed, if raw or normal-score data modeled, etc. Prediction - Kriging 9. Perform kriging and/or indicator kriging GsA, KT3D, IK3D 10. Statistically evaluate the prediction process GsA, KT3D, IK3D - perform cross-validation to evaluate kriging errors 11. Spatially evaluate the prediction process (various software) - look for systematic spatial bias and trends in estimated values and kriging errors 12. If applicable, back-transform estimated variable(s) BACKTR - reproduce original (detrended) variable's range and values Prediction - Simulation 9. Perform sequential simulation SGSIM, SISIM - estimate local and global uncertainty and the spatial character of variability 10. Post-process multiple simulations POSTSIM - calculate expected values, variances, exceedance probabilities from n simulations 11. If necessary, back-transform estimated variable(s) BACKTR - reproduce original (detrended) variable's range and values 12. If necessary, post-process indicator simulations POSTIK - perform corrections and other final adjustments to the simulation results Post-Prediction Analysis 13. If applicable, valuate prediction performance GsA - compare estimates with a subset of the data that was held back for validation purposes - compare prediction performance of alternative variogram/search parameter choices 14. If applicable, restore the trend surface (various software) - reproduce the original range of values and spatial trends in the data 15. Evaluate overall performance (various software) - compare original data values with estimates, check reproduction of global cdfs, global variograms, bivariate correlations, etc. Decision Points: has the prediction process produced satisfactory results? could performance be improved by regrouping, alternative variogram model and/or prediction search strategies? repeat 8-15 as necessary

18 2. Review-II Parametric Tests, Nonparametric Tests 2.1. Statistical Tests - one of the most fundamental applications of statistics is in deciding whether a result (be it a confidence interval, a regression slope, two or more sample distributions, or estimates of those distributions' statistical characteristics) is meaningful in a probabilistic sense. For example, is the population mean estimated from sample #1 statistically different from that estimated from sample #2? Do the distibutions in sample #1 and sample #2 represent a normal population? If variable y is correlated with variable x according to a calculated regression slope, b, is the slope statistically meaningful? - statistical tests are conducted by formulating a testable hypothesis. A null hypothesis, Ho, is formulated (e.g., 'the mean of population 1 = mean of population 2'; or 'the regression slope = zero') and tested statistically. The alternatiuve hypothesis, Ha, is the antithesis of Ho. The result of the statistical test of the hypothesis is to accept either the null hypothesis, the alternative hypothesis, or to conclude that both could be true, depending on the choice of the statistical level of significance.

2.2. Parametric Tests - are applied to statistical data that are known or assumed to be derived from distributions of a particular form (e.g., a normal or Gaussian distribution, a lognormal distribution, etc.). Parametric tests are useful for comparing the means or variances of two populations, for determining whether two samples were drawn from the same or different populations, and for quantifying the confidence or probability that the mean of a sample falls within or outside of specified thresholds. Parametric tests are applicable only to interval or ratio-scale measurements.

2.3. Student's t-test: (Till, p.56-61) - the t-test is a method to compare two sample means or to determine whether a population was drawn from a normal population of the same mean, either for known or unknown variances but always assuming a normally-distributed population! A similar comparative test of multiple sample distributions is performed by the Analysis of Variance (ANOVA) test (see Till, p.106). - given a population with mean and variance 2, draw a sample of size n, whose sample mean is m and sample variance is s2 _ (ml) - draw all possible samples of size n and calculate t = s/ n for each sample, where s/ n can be thought of as the standard deviation standardized by the sample size - plot the pdf of the t-statistic, which defines Students - t distribution (Till, p.57) - the level of significance, , is defined as the probability of obtaining a value further from the mean than the specified value |t| - a one-tail test is used if the test is formulated as to whether a statistic is greater than or less than a given threshold (ie. when the probability refers to only one side (tail) of the pdf); in that case, look up the tabulated t-value for level of significance

19 - a two-tailed test is used for a confidence interval (C.I.) or a test of difference or, in general, if the probability being tested refers to the entire pdf regardless of whether the difference is more or less than a specified value; in that case, look up the tabulated t-value for /2 level of significance (eg: if your test is two-tailed at the 95% level, look up the t-value for = 0.025) - Note: the t-statistic has (n-1) D.F. since we can compute m and s from the data, but we need to estimate Example: a C.I. estimate based on 16 measurements (D.F. = 15) define m = 9.26, s = 2.66; from a table of Student's t critical values, for n-1=15, /2 = 0.025, t/2,15 = 2.131. Therefore: (2.1) (2.2) 2.131 [
_
ml s/ 16

[ +2.131
_

or, rearranging:

m 2.131s/ 16 [ l [m +2.131s 16

substituting m and s: C.I.95 = 7.8 [ l [ 10.7 ie. 95% likely that is within this range Example: To compare two sample means (Till, p.62; Cheeney, p.68), use a pooled t-statistic and a pooled variance (based on the combined sizes of both sample populations); then, if the computed t-statistic is less than the tabulated t/2,15 value, the means are 1 % likely to be from the same population - Types of t-tests: General t-test: samples are drawn from populations of equal variance; are the means different? Unpaired t-test: samples are drawn from populations with different variances; " " " Paired t-test: are two sets of outcomes drawn from the same population? (e.g. are the means of duplicate sets of analyses the same?) - The power of an hypothesis test: (see Section S.2.3. ; also Till, p.63-65) The level of significance, , is the risk of rejecting Ho when it should be accepted (Type I error), whereas the risk of accepting Ho when it is in fact false, is (Type II error). The "power" of a test is 1-; the higher the power, the better the test; but increasing the level of significance () always reduces the power of a test A common point of compromise is at = 0.05 - in general, non-parametric tests require fewer assumptions about a population but have a lower power and hence greater risk of Type II error than a comparable parametric test

2.4. The F-test for Comparison of Variances: (Till, p.66) - sample two normal populations for all possible sample sizes n1, n2; define F = s12/s22 for all possible combinations of n1, n2 (ie. an infinite family of F distributions) - D.F. = n1-1, n2-1; the F-test is a one-tailed test 2.5. The 2-test: Goodness-of-Fit - used to test how well a sample distribution fits a theoretical distribution (Till, p.69); this is a goodness-of-fit test. However, it is also useful for nominal data, in a non-parametric analysis

20 of occurrence frequency in contingency tables (preferably for n > 40: Till, p.121, 124) - from a repetitive sampling of a normal distribution, calculate z = (x-)/ for each member n of the sample popn and define x 2 = S i=1 (z 2 ) for all samples of size n: this is the x 2 distribution - transform the sample data to standard normal deviates, group the z-values into r classes (each with at least 5 measurements), and compute the test statistic: (2.3) X2 = S i=1
r (0bservedValue[ith]ClassExpectedValue[ith]Class) 2 ExpectedValue[ith]Class

=S

(OE) 2 E

- D.F. is defined as r-k-1, where k is the number of parameters to be compared against the theoretical distribution (eg: if m, s are to be compared with , of the normal distribution, then k = 2), and r is the number of classes used in the comparison - Note: the test is sensitive to the number of classes used (Ho is more likely to be rejected if a large number of classes are used; if the data are grouped into too few classes, the power of the test decreases, i.e., Ho may be falsely accepted, just as by binning a histogram into too few classes may result in too simplistic a visual comparison of relative frequencies). Furthermore, there should be at least five observations within each class. - see Till, p. 69-70 example of a 2 parametric test of goodness-of-fit to a theoretical distribution

2.6. Non-Parametric Tests: - used for testing distributions of unknown form, for populations of unequal or unknown variances, and for nominal or ordinal-scale measurements (Till, p.121-124) 2.7. The 2-test: Contingency Table Analysis - used for comparing sample populations where the distributions are non-normal or unknown - useful for testing sample populations that are measured on a nominal scale (regardless of the type of distribution); the data are grouped and counted in a contingency table - e.g., a number of high-, med- and low-permeability measurements (lognormally distributed) are made in two different rock types: do the two rock types have the same permeability distributions? (for a numerical example, see Table 7.4 in Till, p.121)
Number of Measurements in: Gravelly Sediment Sandy SedimentTotal Number a d a+d b e b+e c f c+f a+b+c d+e+f n

Categories High Medium Low Totals

- note that the data have been transformed into a nominal scale (high, medium, low classes), so that the only statistical analysis possible is counting of occurrences within/between classes - set up the table, with i = 3 rows and j = 2 columns, and with marginal row and column totals, Ti and Tj

21

- the expected probability of finding measurements of a given permeability class in either type of T sediment is the expected value, E: p i h E i = ni - that is, by chance alone, we would expect that the probability of measuring a low-permeability ( c+ f) value in either sediment type would be n - the expected probability of finding values of a given permeability class in a given rock type is the joint probability: T i $T j E ij = n - eg: if the two sedimentary types were hydraulically identical, a purely random distribution of permeability values should exist between, as well as within, the sediment groupings; therefore, the number of measurements in the high-permeability class in gravel alone that is expected by chance is: (a+d)$(a+b+c) E 1,1 = n - define the null hypothesis, Ho: no significant difference exists in the distribution of permeability between gravelly and sandy sediments. - calculate the test statistic: (2.4) X 2 = S i=1 S j=1
r k (O ij E ij ) 2 E ij

where r, k = number of rows, columns in the contingency table - D.F. is defined as (r-1)*(k-1); for the above contingency table, D.F. = 2 - from 2 tables, look up the significance level, , and D.F.= 2, to find the critical value of 2 e.g., for = 0.05, D.F. = 2, 2(0.05, 2) = 5.99 - compare the test statistic, X2, with the critical 2 value; if X2 < critical value, the null hypothesis is accepted (i.e., at a level of significance of 0.05, the variations of a, b, c vs. d, e, f in the table will occur due to chance 95 times out 100); conversely, if X2 > critical value, the null hypothesis is rejected (that is, in only 5 out of a 100 times will the observed permeability differences between sediment types arise by chance) - in terms of the p value: when statistical analysis software is used to analyze a contingency table, the calculated p value would be the probability that the observed differences could arise if the null hypothesis were true. Note: for use of 2 test of RxC tables in StatMost, see p.274-276 in the user's guide - use Statistics | Contingency Table | RxC Table to use StatMost's built-in chi-square contingency table analysis; note that the contingency table data are entered without marginal totals - rather than requiring a confidence level to be specified, StatMost computes the effective p-value corresponding to the computed X2 statistic - e.g., for Till's p.121 example, the null hypothesis is rejected for confidence levels greater than about 0.02 (see "chi-sq RxC example.dmd" for this worked example)

2.8. The Kolmogorov-Smirnov Goodness-of-Fit Test for Normality: (Cheeney, p.62-64) - this is a non-parametric test that is used to determine if a subsequent parametric test is justified - the sample distribution is normalized to an appropriate form (e.g., if the test is whether the distribution could be Gaussian, the sample data would be transformed according to the

22 standard normal deviate, thus "normalizing" the distribution around a mean of 0.0 and a standard deviation of 1.0) - the D test-statistic for the normalized sample cfd is defined as the maximum class interval departure from the theoretical cdf (in this example, the standard normal distribution) - the test D-statistic is compared to a critical D-statistic - e.g., for a one-sample test, and n > 15, the critical D-statistic is A/ n (for = 0.1, A = 1.07; = 0.05, A = 1.22; for = 0.01, A = 1.51 ; = 0.005, A = 1.63 (Cheeney, p.46; Rock, p.96) - if the test D-statistic > the critical D-statistic, the sample population is not normally distributed - Note: The K-S test is designed to be useful to test goodness-of-fit to any distribution. For this reason, StatMost does not automatically transform a sample distribution prior to applying the K-S normality test, so the sample data must first be normalized! The Lilliefors normality test is a modification of the K-S method that does not require the data to be normalized (note that StatMost's implementation uses an estimation method to determine the critical statistic so that the calculated Lilliefors p-value will be slightly different from that calculated in the K-S test; see Davis, p.109)

2.9. A Non-Parametric Test of Similarity: The Kolmogorov-Smirnov Test (see Cheeney, p. 45-46; Till, p.125-130 for details and example) - use the K-S test to compare any two (normal or non-normal) sample populations, or to compare an unknown sample distribution to a normal distribution - the test compares the forms of two distributions; statistical dissimilarity between them is identified regardless of whether it arises from differences in the mean, variance, or skewness; the test does not identify why the distributions differ. Because of this, it loses its statistical power at small values of n, and it always less powerful (in a statistical sense) than a comparable parametric test (such as a t-test). - define the null hypothesis as m1 = m2 and define the alternate hypothesis as m1 ! m2 (two-tail) or m1>m2 (one-tail), where mi=sample mean (Till, p.128) - the critical D-statistic is determined from the sample sizes of both populations and whether a one- or two-tail test is made; e.g., for different sample sizes, the critical D value for a twotailed test is A (n 1 + n 2 )/(n 1 n 2 ) (where A=1.36 for =0.05; A=1.63 for =0.01) - Note: StatMost only provides the option of a two-tailed test; it computes a test D-statistic, but does not report the critical D-statistic. Instead, it calculates a p-value which provides an estimate of the minimum confidence level or probability at which the null hypothesis could be accepted; thus, the computed probability gives more information about the test's level of significance (i.e., how close the test is to a borderline rejection) than a manual comparison of D-statistics; see the "Till p.127 K-S example.dmd" data file for worked examples of K-S test comparisons

2.10. Other Non-parametric Tests (useful for ordinal data and classification) - see Cheeney (Ch.6-7), Till (Ch.7) for examples - Mann-Whitney U-test; Kruskal-Wallis; Spearmans rank correlation coeficient.; Kendall's- - commonly available tests in StatMost, SPSS, and other standard statistical packages - before applying any test, be familiar with the test procedure and assumptions, its nomenclature,

23 and perform a dummy test on a known data distribution to ensure that you know how to interpret the results correctly

2.11. Problem Set I. Statistical Summarization: Exploratory Data Analysis-1 - introduction to the use of software, using the demo data set (Walker Lake): univariate summary statistics, box plots, frequency and cumulative frequency distributions (histograms, cfds), K-S tests of normality

Readings: Isaaks and Srivastava (1989) Introduction to Applied Geostatistics (on library reserve) p. 4-6, Ch. 6 - the Walker Lake data set Chapter 3 - bivariate correlation, q-q plots, conditional expectation p. 40-55 - spatial description, the proportional effect, skewed data, h-scatterplots

24 S.2.1. Summary of Hypothesis Testing: Rationale Behind Statistical Tests - from the size, the shape and dispersion of the sample data, compare the sample distribution to a known distribution or another distribution to determine similarities or differences at a specified level of confidence (probability of being wrong). Parametric vs. Non-Parametric Tests - if a normal distribution is known or inferred, a parametric t-test is the most powerful; if the parent population distribution is not normal or is not known, a parametric test cannot be applied and non-parametric comparisons have to be applied (eg: K-S). Types of Tests:
One Sample Tests One-Tailed Tests 1. is the mean < or > a specified value? (t-test) Two-Tailed Tests 2. is the sample mean equal to a specified value? (t-test) Goodness-of-Fit Tests 4. is the sample population Gaussian? (a two-tailed test) (K-S) Two Sample Tests -

3. are two sample sets drawn form equivalent populations? (t-test or K-S) t-tests: general, paired, unpaired

Define the Null Hypothis: Null Hypothesis case 1. m < specified value or m > specified value case 2. m = specified value case 3. msample population 1 = msample population 2 case 4. sample's normalized z-score distribution = standard normal distribution Specify the Confidence Level: For one-tailed tests, the confidence level of the test-statistic is the same as the specified confidence level of the test; eg: a 0.95 confidence level is desired, so the confidence level applied to the test-statistic is also 0.95 (and significance level, , is 0.05). For two-tailed tests, the test-statistic refers equally to both tails of the outcome, so at a specified confidence level (eg: 0.95), the probability of rejecting the null hypothesis is equally shared by both tails, so that the specified significance level is half (eg: 0.025). Specify the Degrees of Freedom: case 1., 2. (t-test) D.F. = n - 1 ;

case 3. (t-test) D.F. = n1 + n2 - 2

Note: for a Kolmogorov-Smirnov test, the critical D-statistics are defined as case 3. (n > 40) Dcrit = 1.36 (n 1 + n 2 )/n 1 n 2 (for =.05); = 1.63 (n 1 + n 2 )/n 1 n 2 (=.01) or case 4. (n > 15) Dcrit = 1.22/ n (for = .05); = 1.51/ n ( = .01)

25 S.2.2. Interpreting P-Values Returned by a Statistical Test


(modified from: http://www.graphpad.com/articles/interpret/principles/p_values.htm

What is a p-value? Observing different sample means is not enough to conclude that they represent populations with different means. It is possible that the samples represent the same population and that the difference you observed is simply a coincidence. There is no way you can ever be sure if the difference you observed reflects a true difference or if it is just a coincidence of random sampling. All you can do is calculate probabilities. Statistical calculations can answer this question: If the populations really have the same mean, what is the probability of observing as large (or larger) difference between sample means in an experiment of this size? The answer to this question is called the p value. The p-value is a probability, with a value ranging from zero to one. If the p-value is small, the difference between sample means is unlikely to be a coincidence.

The null hypothesis: In general, the null hypothesis states that there is no difference between what is being compared. The p-value is the probability that the observed difference could have arisen by chance if the null hypothesis were true. For example, consider this output from StatMost's t-test:
Sample Size Mean Variance t-Value General -6.1539 UnPaired -6.1539 Sample A 9 1.7778 1.1944 Probability 1.387 E-005 1.387 E-005 Sample B 9 4.7778 0.9444 DF 16 16 Difference = -3.0000 Ratio = 1.2647 Critical t-Value 2.1199 2.1199

If the mean of sample A actually were the same as sample B's, then the probability is less than 0.002% that two sample means would differ this much by chance. That is, in repeated samplings of these populations, we would correctly conclude that the null hypothesis is false 99.998% of the time; that is, the null hypothesis could safely be rejected at the 99% confidence level. On the other hand, if the observed difference in sample means was quite small, the calculated p value would be large, again indicating the probability that the observed difference truly was this small.

Common misinterpretation of the p-value: If a p-value is reported as 0.03, it would be incorrect to say that there is a 97% probability that the observed difference reflects the actual difference between populations and a 3% probability that it does not. Rather, it means that there is a 3% chance of obtaining a difference as large as the observed difference if the two samples were drawn from one population. That is, 97% of the time, random samplings of the same population would produce a difference smaller than the observed difference, and only 3% of the time could it be as large or larger.

26 S.2.3. The Power of an Hypothesis Test (or Minimizing the Risk of Falsely Accepting Ho):
(see Till, p. 63-66; also http://asio.jde.aca.mmu.ac.uk/rd/power.htm)

Consider two samples with very different sample means drawn from different populations in which a t-test rejects the null hypothesis at the 95% confidence level. We conclude that the samples were not drawn from the same population. But what if the difference in sample means was the result of a chance draw of extreme values? We would be wrong to reject Ho. The risk of such an error at the 95% confidence level is 5%, or 0.05. This is the test's level of significance, ; it is the risk of commiting a Type-I error--the probability of rejecting Ho when it is in fact true. Consider the example of temperature measurements taken from two lakes (such as the data set on the first-day quiz). A t-test returns a p-value of 0.068, indicating there is a 6.8% chance that the difference in sample means could arise purely by chance in the process of sampling a single underlying population; thus, 93.2% of the time we would expect to be correct in accepting the null hypothesis. The risk of making a Type-I error is 6.8%. A second type of error can also occur. This is known as a Type-II error, , the risk of accepting Ho when it is actually false. The power of an hypothesis test is defined as 1-. Determining the power of an hypothesis test is an involved process; it is typically most important in the analysis of trends. See http://www.mp1-pwrc.usgs.gov/powcase/ steps.html for a discussion of the procedure and an example of power analysis software. Perhaps the concept can be more easily envisioned in light of the sample t-statistic and the critical t-statistic (StatMost prints out tcritical and tsample when it reports the p-value for the t-test). In our lake example, at the 95% confidence level and 38 D.F., tcritical is 2.02; the calculated tsample is 1.88. Since |tsample| < |tcritical|, we accept Ho. If we repeat the test at the 90% confidence level, the critical t-statistic is 1.68; since |tsample| > |tcritical|, and we would be forced to reject Ho at this confidence level. In this example, we know that Ho is true. So, in rejecting it at the 90% confidence level, we'd be making a Type-I error. On the other hand, by rejecting Ho we have completely eliminated the possibility of making a Type-II error! That is, at = 0.1, the value of has dropped from some finite value (0 < < 1) when was specified as 0.05, to = 0; that is, the power of the test has become 1-, or 100%. If we always rejected Ho, we'd always be assured of the highest possible power for the test, but it would be counter-productive (we'd be 100% sure of avoiding Type-II errors but unable to ever determine whether Ho were true).
(to determine a test's power, see Till, p.64-65, and http://www.mp1-pwrc.usgs.gov/powcase/steps.html)

It should be apparent that the goal in choosing an appropriate level is mimimizing the risk of Type-I errors (by setting as low as possible) while maximizing the power of the test. Because the power of the test decreases as increases, a common point of compromise is to choose = 0.05. Algorithms that report p values provide more information to assist in choosing an optimum . For example, comparing only deep-water temperature measurements from both lakes (those least affected by solar heating), the difference in sample means is only 0.6o, and the t-test returns a p value of 0.22; i.e., we could reject Ho only if we specified a confidence level of 78% ( = 0.22), but the power of the test would be the highest. If lowering the risk of a Type-II error were important, we might be willing to accept Ho at, say, the 85% or 90% confidence level, instead.

27 S.2.4. Classification and Measures of Classification Performance:

Classification into categorical outcomes can be essentially error-free (e.g., a gold analysis of a rock sample determines if that particular sample is ore grade or not in an economic sense). More often, however, some degree of classification error is involved because classifications are based on a decision threshold (e.g., classifying land cover based on a remotely sensed vegetation index carries uncertainty; using vegetation type to predict the presence of an animal known to inhabit a particular vegetation type carries uncertainty; can an ore deposit be termed economic or not, based on the average gold content of 300 samples) The simplest type of classification is a dichotomous (or binary) state: presence/absence, yes/no, high/low, uranium-bearing/not uranium-bearing, etc. Whether binary or multiple classification categories are used, we need a quantitative measure of how well a classification scheme performs: that is, we need a measure of relative classification performance. For example, is land use type predicted significantly better with a combination of derived remotely sensed measures than by a single, accurate vegetation index measure? Is one classification scheme more accurate / less inaccurate than another? To minimize the error of classifying a rock as uranium-bearing when it is not, is a decision threshold of 20 cps better than 10 cps ? Methods of evaluating classification performance can be grouped into two types: thresholddependent and threshold-independent. A threshold-dependent statistical measure summarizes the proportions of correct and incorrect classifications and presents the result as a summary statistic. That is, once a threshold has been decided and a classification produced (e.g., uranium-bearing if >20 cps, not uranium-bearing if <20 cps), the classification results can be grouped into a contingency table to summarize classification performance. The simplest summary table is for a binary outcome, known as a 2x2 confusion or error matrix: Actually Present a c Actually Absent b d

Predicted Present Predicted Absent

where a, b, c, d represent frequencies of occurrence of possible outcomes from N (=a + b + c + d) total outcomes. Those outcomes (a, d) which are predicted correctly are known as True Positive and True Negative outcomes, respectively; the proportions of misclassified outcomes are known as False Positives (b) and False Negatives (c). Note that classification performance can be evaluated using this formalism for any number of categories, not just the binary case. A variety of different measures of classification performance can be defined from the information presented in an error matrix. A few of these are:
Sensitivity Specificity Positive Predictive Power Classification Rate Misclassification Rate a/(a + c) d/(b + d) a/(a + b) (a + d)/N (b + c)/N

28

All of these measures have different characteristics, and some are overly sensitive to sampling bias, as reflected in the prevalence ratio ( [a + c]/N = the proportion of positive cases in the sample data set). For a comparison of the effect of prevalence on predictive power of various measures of classification performance, see http://asio.jde.aca.mmu.ac.uk/resdesgn/presabs.htm. One of the most useful statistical measures of classification performance is the statistic. For a 2x2 error matrix, it can be defined as: = (a + d) - (((a + c)(a + b) + (b + d)(c + d)))/N
N - (((a + c)(a + b) + (b + d)(c + d))/N)

The statistic represents the proportion of specific agreement among correct and incorrect classifications. Unlike other measures of classification performance, makes use of all of the information in the error matrix.

The Kappa Statistic As stated above, is a measure of agreement. Although the 2 test is also a measure of agreement, it provides no direct information about how good or poor the agreement is, only if it is statistically significant or not. In contrast, the statistic quantifies more of the information in the error matrix so that it can be used to compare relative classification performance. Both the 2 and the statistics are calculated from RxC contingency tables that summarize frequencies of responses (simple counts). In contrast to the 2 statistic, however, is only defined for a square contingency table (R = C). For example, the relative performance of a classifier that produces three categorical states can be compared between two classification outcomes. For example, a 3x3 contingency table would look like : Outcome-2 Category2 b = 10 e = 40 h = 10 b+e+h = 60

Outcome-1 Category1 Category2 Category3 Totals

Category1 a = 88 d = 14 g = 18 a+d+g = 120

Category3 c=2 f=6 i = 12 c+f+i = 20

Totals a+b+c = 100 d+e+f = 60 g+h+i = 40 a+b+... +h+i = 200

where a - i designate occurrence frequencies. The statistic can be used as an index of agreement between two outcomes, expressing the percentage of times that the outcomes agree in each category. In the above example, agreements are shown in the diagonal cells (cells with counts of a, e, and i). So, 88+40+12 = 140 out of 200

29 total comparisons agree; this is the observed rate of agreement, or probability of agreement, Po = 0.7. We don't know if this is good or not, because we don't know what the level of agreement would be by pure chance, alone (that is, the expected probability, Pe). The chance level of agreement is given by the expected counts for the same three cells. The expected counts are found in the same manner that we found expected frequencies for 2 test; the expected probabilities are expressed as Pe = (row total * column total)/N2. Thus, the expected probabilities in the above example are: Outcome-2 Outcome-1 Category1 Category2 Category3 Category1 60/200 30/200 10/200 Category2 36/200 18/200 6/200 Category3 24/200 12/200 4/200 Totals 120/200 60/200 20/200

Totals 100/200 60/200 40/200 200/200

The sum of the expected counts in the diagonal cells (cells a, c, i) gives us the expected frequency of agreement (60+18+4 = 82) for an expected probability of agreement of 82/200, or Pe = 0.41. Kappa compares the observed levels of agreement with the levels of agreement expected in a purely random classification outcome. The kappa statistic is defined for a generalized n x n error matrix as: = (Po - Pe)/(1 - Pe) . In the above example, then, = (0.7 - 0.41)/(1 - 0.41) = 0.49, which represents the proportion of agreements after chance agreement has been excluded. Its upper limit is +1.00 (total agreement). If two outcomes agree purely at a chance level, = 0.0. The value of can be used in a quantitative sense to compare classification performance among two or more different classification outcomes. A rule of thumb for interpreting the kappa statistic is:

= > > < =

1.0 perfect agreement (in a non-spatial statistical sense) 0.75 excellent agreement 0.4 good agreement 0.4 poor or marginal agreement 0.0 indistinguishable from random agreement

30 Example 1: Evaluate whether an outcome is significantly better than a chance outcome (category counts are randomly distributed across all categories). A simple example: 10 events classified into 2 categories Outcome 1 = result of a supervised classification scheme Outcome 2 = random assignment (equal category counts, random location) Outcome Results: 1 2 A B A A B A B B B A B B A A A A A B A B Outcome 1 compared to Outcome 2: Number of correspondences in Category A = 3 Number of correspondences in Category B = 2 Number of False Positives =3 Number of False Negatives =2 Total Comparisons = 10 RxC table:
Outcome 1: Outcome 2:

A B

A 3 2

B 3 2

The result, = 0.0, indicates that Outcome 1 is not at all similar to Outcome 2 (a random classification outcome) so it is an improvement over a random classification.

Example 2: Compare two different classification schemes to evaluate their relative performance. A value of 1.0 would indicate two classification schemes perform equally well (though not necessarily identically, particulalry in a spatial sense), whereas a value of 0.0 would indicate that the correspondence between the two outcomes is purely random: 10 events classified into 2 categories Outcome 1 = result of a supervised classification scheme Outcome 2 = result of a different classification scheme Outcome Results: 1 2 A B A B A B A B A B A A B A B A B A B A Outcome 1 compared to Outcome 2: Number of correspondences in Category A Number of correspondences in Category B Number of False Positives Number of False Negatives
Total Comparisons

=1 =0 =5 =4
= 10

RxC table:
Outcome 1:

Outcome 2:

A B

A 1 4

B 5 0

and the result, = - 0.80, indicates that Outcome 1 is almost the exact antithesis of Outcome 2.

31

This example demonstrates that can measure antithetical as well as direct correspondence; i.e., a value of -1.0 indicates that two classification outcomes are mirror images of one another (but only in a non-spatial statistical sense; the two classification outcomes may still have very different spatial patterns). Note: StatMost reports all values as | |; therefore, StatMost's reported Po and Pe values must be examined in order to determine whether < 0.

Threshold-Independent Measures: the Receiver Operating Characteristic Curve: In classifying results into categories on the basis of a decision threshold, the classification results will differ according to the value of the threshold. Consider a classification threshold that is used to segregate two populations of measurements or probabilities (e.g., those indicating Presence and those indicating Absence) into two classes: Predicted Present and Predicted Absent.

The proportions of misclassified and correctly classified measurements are indicated as True and False Negative and Positive outcomes (TN, FP, etc.). In a binary error matrix, overall classification performance is represented as: Predicted Present Predicted Absent Actually Present a (TP) c (FN) Actually Absent b (FP) d (TN)

If a different threshold is applied, for example, to minimize False Negative classifications, then other classification rates will be affected:

32 That is, by lowering the decision threshold, the False Negative misclassification rate is much lower but at the expense of higher False Positive misclassification as well as higher True Positive and lower True Negative classification rates. A threshold-independent measure of classification performance summarizes classification performance over all possible decision thresholds. It is therefore a more powerful measure of performance and can also be used to guide the optimal choice of a threshold to meet specified classification performance criteria. The Receiver Operating Characteristic curve or ROC curve is one such threshold-independent measure of classification performance for dichotomous classification outcomes. The ROC curve is defined by calculating the rates of True Positive (TPr) and False Positive (FPr) classification rates at all possible decision thresholds, z, that span the measurement (or probability) range used to make the dichotomous classification:
TPr(z) = TP(z) TP(z) + FN(z) FP(z) FP(z) + TN(z)

FPr(z) =

where TP, FP, etc. refer to the proportions of True Positive, False Positive, etc. classifications that result from classifying the measurement (or probability) into one of two possible outcomes for a particular decision threshold (z). A plot of TPr vs. FPr defines the ROC curve:

The area under the curve can vary between 0.0 and 1.0; an area of 0.5 (or a curve representing the diagonal line) indicates classification performance no better than by chance. Analogous to a negative kappa value, a curve that lies below the diagonal represents some degree of antithetical correlation. The ROC curve represents the ability of a measurement or probability variable to correctly classify an outcome. Thus, different classification schemes can be compared and ranked on the basis of the areas under their ROC curves. Note that, like the kappa statistic, this type of performance measure is strictly applicable only in a non-spatial statistical sense, meaning that it should not be used as the sole determiner of classification performance for spatial data. To illustrate this, consider a situation where a particular condition is present only at several known sites (o) and nowhere else:

33 o o o.

("-" represents locations that were not sampled as well as where the condition truly does not occur)

Two classification schemes based on the same data and probability information produce two different predictions of likely occurrences (p) at unsampled locations: x p x o p p x x x x x x x o x o x x x o x x p x x p p x x o x o

(x = predicted non-occurence)

These two classifications would be indistinguishable in a non-spatial statistical sense (with identical ROC curves or kappa statistics), but the prediction on the right is obviously more accurate in a spatial sense. ROC analysis is available in SPSS and other statistical packages. "ROC_calculation.xls" is a spreadsheet that demonstrates a simple method for calculating ROC curves for any data set; it can be used as is or as a template for designing a custom application. Rather than calculating classification performance rates over a continuous range of decision thresholds, the spreadsheet calculates classification performance at ten decision thresholds corrresponding to histogram classes defined by the user. The ROC curve in the figure example above was calculated with this spreadsheet, using the following set of training data:

If specific classification performance criteria can be defined (for example, on the basis of relative cost or relative risk), the ROC curve can also be used to assist in choosing a decision threshold that best meets the specified criteria. See http://asio.jde.aca.mmu.ac.uk/resdesgn/roc.htm and the spreadsheet "ROC_calculation.xls" for more information.

34

Additional Information on the Kappa statistic: Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Kraemer, H. C. (1982). Kappa coefficient. In S. Kotz and N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences. New York: John Wiley & Sons Fielding, A. H. and Bell, J. F. (1997). A review of methods for the assessment of prediction error in conservation presence/absence models. Environmental Conservation, 24: 38-49. Manchester Metropolitan University Dept. of Biological Sciences
http://asio.jde.aca.mmu.ac.uk/resdesgn/presabs.htm

Additional Information on ROC Analysis: Beck, J.R. and Schultz, E.K., (1986). The use of relative operating characteristic (ROC) curves in test performance evaluation: Archives of Pathological Laboratory Medicine, 110, p.13-20. Zweig, M.H., & Campbell, G. (1993). Receiver-Operating Characteristic (ROC) Plots: A Fundamental Evaluation Tool in Clinical Medicine. Clin. Chem., 39 (4), pp. 561-577. Simplified primer:
http://www.mieur.nl/mihandbook/r_3_3/booktext/booktext_15_04_01_02o.htm

Excellent, clear, and straightforward description of how ROC curves are calculated:
http://www.anaesthetist.com/mnm/stats/roc/#make

35 3. Correlation, Regionalized Variables, Exploratory Data Analysis 3.1. Definitions: Geostatistics (spatial statistics): A branch of applied statistics focusing on the characterization of the geospatial dependence of one or more attributes whose values vary over space (in 1-D, 2-D, or 3-D); and the use of that spatial dependence to predict (model) values at unsampled locations. (time-series analysis--hydrographs, the stock market--is a close, 1-D relative of spatial statistics) Prediction (estimation, interpolation, modeling) methods: Any of a number of methods to produce estimates of a variable at unsampled locations based on values at discrete points. Examples include: tesselation (Theissen polygons, triangular irregular network, Delauney triangulation, etc.), moving average, inverse distance weighting, spline functions, trend surfaces. The geostatistical equivalent is kriging, a statistically- unbiased linear estimator. Spatial dependence (autocorrelation): Most physical processes generate spatial variability such that two data values sampled close together tend to be more similar than two values sampled far apart. Where a strong spatial dependence exists, spatial statistical tools can be used to predict (model) values at unsampled locations better than other interpolation procedures. Bivariate and multivariate dependence (crosscorrelation): A physical process produces correlated variability in the values of two or more attributes, whose correlation can be used to understand the process and/or to make predictions.

3.2. Bivariate Correlation: - an analysis of variation of two variables za, zb drawn from different but related populations - analysis of bivariate regression is potentially important in spatial data analysis: if an extensively sampled secondary variable (eg: topographic elevation) is correlated to the primary variable we wish to estimate (eg: water table elevation), the spatial cross-correlation between the two variables can greatly help in estimating the primary variable by exploiting the spatial correlation information in the correlated secondary variable if its values are known at other locations where the primary variable is unsampled - bivariate regression analysis is predicated on the assumption that the distributions of both za and zb are Gaussian; in some cases, this condition can be relaxed e.g. the independent variable can take on discrete values - theoretically, linear regression analysis is strictly valid only under the following conditions: - both variables are measured with no (or negligible) error - both variables are normally distributed - the variables are linearly correlated - the X values are independent - prediction error is homoscedastic (constant variance) and Gaussian - prediction errors are independent (not autocorrelated)

36 - So, obviously, just about any real set of data shouldn't be regressed! In practice, though, any of the above requirements can be relaxed and/or ignored. In fact, all the above assumptions can be thrown out IF the sole purpose of linear regression is to predict U from V, and some argue that regression coefficients should only be analyzed with a related technique (linear function analysis) because of the measurement errors present in both variables (Use and Abuse of Statistical Methods in the Earth Sciences, W.B. Size, ed., 1987, Oxford Univ Press; p. 78). - in a bivariate relationship, the overall variance of the two variables can be thought of as composed of three variance components: 1) variance of variable A, 2) variance of variable B and 3) the variance arising from the correlation of A vs. B - this latter variance is known as the bivariate covariance and is defined by: (3.1) Cov(z a , z b ) =
1 n1

S in=1

(z a, i m a )(z b, i m b )

- and the correlation coefficient is: (3.2) r = Cov(z a , z b )/s z a s z b

- the value of the square of the correlation coefficient (r2) represents the fraction of the total variance of za and zb that is due to their linear correlation - where the population is not bivariate normal, use a non-parametric correlation method on the ranked (ordinal) form of the data, such as Kendall's- or Spearmans rank correlation coefficient (see Till, pp. 131-134; Cheeney, Ch. 6), which are based on the difference in rank position of all observations between two samples or between x and y values

3.3. Hypothesis Test of the Significance of Correlation - from the definition of the correlation coefficient in equations (3.1) and (3.2), it is apparent that if values of the dependent variable, zb, are close to their mean (i.e. there is little variation) and the standard deviations of one or both variables, sza or szb, is relatively large, then the value of r can be small; in other words, by itself the value of r may be a poor estimator of the degree of correlation - r is only a relative measure; cannot compare different data sets on the basis of r values - What constitutes a significant correlation? This is entirely subjective, depends on the data, the variables examined, purposes of the analysis - r does not indicate anything of the statistical significance of a correlation - furthermore, if the assumption of bivariate normality is violated (as in a conditional variation, e.g., X is classified into categories such as hi, med, lo), then the r statistic is meaningless - in other words, a more robust test is required to determine statistical significance of a bivariate correlation - if the two variables, za, zb are normally distributed, then various types of t-tests can be applied to determine if the bivariate relationship between them is statistically significant (Till, p.86) - for example, to test whether the value of the correlation coefficient represents a statistically significant bivariate correlation, a null hypothesis is set up to represent the situation where

37 the population correlation coefficient, , is zero (ie. the variables are not correlated); ie. Ho( = 0) and the alternative hypothesis is Ha(r ! 0); this is an example of a two-tailed test, and the t-statistic is defined as: (3.3) t= r
(n2) (1r 2 )

with D.F. = (n-2)

where n is the number of sample data plotted in the regression - therefore, if the value of |t| calculated from eq'n (3.3) is larger than t(n-2) /2, we can reject Ho and conclude that a significant bivariate correlation ( ! 0) exists at a confidence level of 100(1)% - for more on hypothesis testing of regression hypotheses, see Section S.3.1.

3.4. Conditional Expectation - where a non-linear correlation is apparent, an alternative regression technique is to specify the means of U that correspond to different classes of V. This produces a conditional expectation: an expected value of U is defined within each V class; the expected values of U are conditional because they depend on the V class that is specified.

3.5. Regionalized Variables - unlike a random variable which results from a purely random process (eg: the results of throwing dice), a regionalized variable (or r.v.) is distributed in space (and/or time) with location information attached to each measurement; in other words, the measurement variable (z) no longer represents a statistically independent univariate population, but is part of a multivariate population (x, y, zxy) in which the values of zxy may no longer be strictly independent in a statistical sense; that is, zxy may be correlated to x and y because of the physical process which generated it - in general, any measurement which is associated with spatial or temporal coordinates is a r.v. - denote this variable as z(x), where x designates spatial coordinates (in 1-D, x is x; in 2-D, x is x, y; etc.); e.g. if the regionalized variable is rainfall amount, each data point is denoted with coordinates x = (x, y), and z(x) is rainfall amount - key concept: the variability of any regionalized measurement over/in the earth can be viewed as but one possible realization or outcome of a hypothetical random process (a God who throws dice?) which has distributed values of z(x) in just one of an infinite number of possible ways - key concept: in geostatistics, the r.v. is assumed to be the outcome of a physical process (or multiple processes) whose spatial form represents a combination of a structured aspect (eg: lead concentration in contaminated soil due to the contamination process/history) and a random, unstructured aspect (eg: the natural lead content in, and proportions of, feldspar and limestone detritus in the soil); local 'trends' such as contaminant hot spots can be handled within the geostatistical modeling process, but significant regional trends are removed prior to analysis and modeling to ensure that z(x) represents a stationary r.v.; ie. the analysis, estimation and simulation of variation is done with the trend subtracted from the raw data,

38 then the trend is added back in the final estimation (use Surfer or other software to model the trend, then remove it from the raw data); in this course, we will not focus on trend removal (see Koch and Link, 1971, chapt. 9); what constitutes a 'significant' trend is usually a matter of judgement and can be identified in exploratory data analysis or during variogram analysis

3.6. The Walker Lake Data set - see Isaaks and Srivastava, 1989 - be able to outline the phases / steps in a geostatistial analysis - the purpose of exploratory analysis: get to know your data, identify sampling patterns, sampling history and possible sampling biases, patterns of variability, bivariate correlations among multiple r.v.'s, etc. - examine the exhaustive Walker Lake data set, its main features of spatial variability, evidence of heteroscedasticity, correlations of V, U, T - compare the features of the sample data set; plot the sample data, get to know it, identify patterns, sampling bias, etc. (note that in a real analysis, you will not have access to the exhaustive data set; only God knows what the real situation looks like and you are trying to reconstruct that situation from a few paltry measurements) - key concept: different populations of the r.v.'s may be present (e.g. the T variable may represent rock type); the ability to segregate V, U values into two possible classes (T = 0 or 1) raises the question of when population splitting should occur. There are no hard rules, but some guidelines are: 1) Is the distinction physically meaningful? One should have good reasons for splitting, even if the segregation is done subjectively; 2) After splitting into subpopulations, do sufficent data remain within all the subpopulations to justify statistical inference based on the numbers of data points? If some subpopulations have too few data to justify meaningful statistical measures, their segregation may not be useful; and 3) What is the goal of the study? Does population splitting contribute to the goal? For example, estimating the spatial distribution of species proportions from fifty calibration locales in an area with two very different types of land cover: if the distribution of species counts in the calibration sites is statistically indistinguishable between the two land cover types, splitting into subpopulations is unnecessary; if statistically distinct, then splitting must be performed prior to geostatistical analysis. 3.7. Problem Set II. Exploratory Data Analysis - preliminary spatial analysis (the Walker Lake data file, sample locations, clustering, spatial sampling bias, contour maps, spatial trends, coordinate outliers, attribute value outliers, interval [hi/lo] maps, indicator maps) - clean up the raw Walker Lake data file, calculate declustering weights, and calculate the normal-score transform of the V data using NSCORE

39 S.3.1. Hypothesis Testing of Regression Parameters:

A hypothesis test is often used to evaluate the significance of a linear regression fitted to a scatter plot of x, y data. To make a parametric test, the x an y variables are assumed (or known) to be normally distributed. An alternative t-test to the one discussed in Equation (3.3) for the corelation coefficient involves a test of the significance of the calculated regression slope. The Null Hypothesis is defined as b = 0, where b is the calculated slope of the regression line; in other words, Ho posits that y varies independently of x and that the x and y values are not correlated. This is a two-tailed test because the Alternate Hypothesis is that b is not zero - ie. b could be either greater than or less than zero. If Ho were rejected, the Alternate Hypothesis, Ha, would be accepted: that x and y are correlated at the level of significance of the hypothesis test. In this test, the t-statistic is defined by the formula: n (S.3.1) t = (b b o ) s x e
s

where bo is the specified slope (zero in this case), sx is the standard deviation of the independent variable, and (S.3.2) s e =
2 n2 s 2 y (1r ) n(n2)

is the standard error of the correlated data

The significance level defining the critical t-statistic is /2 (two-tailed test) and the test has n-2 degrees of freedom. The Null Hypothesis is rejected if the absolute value of the t-statistic calculated in Equation (3.4) exceeds the tabulated critical t-statistic, t(n-2), /2, and the regression slope is said to be significant at the 100(1-) % level. Once the regression has been deemed significant, a confidence interval about the least-squares value of the slope can also be determined with the calculated t-statistic. In this case, we wish to determine the values of bo in Equation 3.4 that produce a calculated value of t that is equal to t(n-2), /2. In other words: (S.3.3) C.I. for slope = b ! t (n2), a/2 $
se sx 1 n

at the 100(1-) % level.

where b is the value of the slope against which the regression slope is to be tested.

40 4. Autocorrelation and Spatial Continuity

4.1. One-dimensional Autocorrelation: - the correlation function defined in equation (3.2) for bivariate data measures the degree of correlation between two related variables (not necessarily regionalized variables) - within a single 1-D series of spatial measurements, an autocorrelation function can be defined that is a measure of the internal correlation between successive measurements - for a 1-D series of measurements, the concept of autocorrelation is analogous to the covariance of bivariate data, where the variable zbi becomes zai+L where L is the offset from position i in the data series; thus, the autocorrelation function is defined as: (4.1) r L = cov(z, z + L)/s 2 z =
1 n1

S in=1

(z i m i )(z i+L m i+L ) /s 2 z

where m represents the mean of the values defined for zero offset and for an offset of L (note that the definition of covariance contained in the numerator of equation 4.1 equals the population variance of z when L = 0) - this function is calculated for various offsets or separations, L, called "lags" and the value of rL is plotted at each value of L to form the correlogram (equivalent to the standardized spatial covariance function defined below) - conceptually, equation 4.1 compares the degree of correlation or similarity between the timeseries and its copy, where the copy is shifted by L units and a standardized covariance for the region of overlap is computed; note that at L = 0, the covariance term equals the sample variance and rL equals 1.0; as L increases, the amount of overlap decreases until the length of record compared is too small to produce reliable estimates of rL; n is therefore the number of common data values in the overlapped portion of the data series and its shifted copy - since the degree of correlation is symmetric for positive and negative lag shifts, only the absolute value of L is plotted - see Davis, p.235 for examples of different autocorrelative behavior - a cross-correlation function can be defined for the comparison of two different 1-D time-series sing the cross-covariance; the equation is identical, with the appropriate superscripts (za and zb denoting the two different variables) added to the zi and z i+L terms in equation (4.1) (see Davis, p.240-243) - Note: for nominal data (e.g. sediment types in stratigraphic sequences), use cross-association (eg: correlation between two sequences of rock types) and a non-paramteric test (such as a 2-test) for significance of match (see Davis, p. 247-250)

4.2. Regionalized Variable vs. Random Function - aside from random sampling and analytical errors, a geospatial measurement (the regionalized variable, e.g. copper content) is considered to be essentially deterministic (non-random), ie: there exists a single value of porosity, or one possible copper concentration at a point, or a unique water level at any given time in a given water well

41 - key concept: in order to develop spatial correlation statistics from such a variable from which to make geostatistical estimates at unsampled locations, the r.v. is assumed to represent one statistical sample drawn from an infinite number of possible samples all having identical statistical characteristics; for example, a hydrograph of water level vs. time represents only one possible statistical sample of the distribution of water levels vs. time drawn from an infinite number of possible distributions with the same statistical characteristics - the fictitious domain of all such possible distributions is known as a Random Function (R.F.) and a single sample of the regionalized distribution of possible porosities or copper contents drawn from it is called a realization of the R.F. - a R.F. is a function from which values can be drawn which have a variance about a mean, together with skewness, kurtosis, etc.; each of these statistical measures also depends on spatial position; therefore, to fully specify a R.F.'s statistical properties is a theoretical nightmare and so in practice many simplifications are used to represent the R.F. - our task in geostatistics is to infer the nature of the R.F. controlling the spatial distribution of the regionalized variable (eg: the available water level or porosity measurements) so that this function can be used to estimate values of the variable of interest at unsampled locations or times - this is analogous to the task of estimating a univariate population distribution (analogous to the R.F.) from a number of statistical samples (analogous to the realizations of the R.F.) drawn from the population; the key difference in geostatistical analysis is that we do not have multiple samples of the R.F. from which to infer its characteristics, only a single sample (the geospatial data representing the regionalized variable) consisting of a limited number of points - for Star Trek fans: to use a crude analogy, if we could sample water level vs. time at a particular point in a river in a number of parallel universes, we'd be better able to estimate the underlying R.F. that describes water level fluctuations (in a manner analogous to the statistician who can draw red and black marbles from a bag many times to estimate the true proportion of marbles in the bag) - key concept: the mean state of all possible samples of a r.v. would be equal to the average value of the r.v.within a single sample; this would be an example of ergodicity (from the Greek for wandering), in other words, a single sample would reflect the statistical character of the R.F. - the assumption of ergodicity is therefore a crucial one; it is required to infer the properties of the R.F. from a single realization (the measured copper values or porosities); however, non-ergodic behavior is common in the real world, but the criteria for recognizing it are subjective (see discussion under kriging section, Week 7) - note that the requirement of ergodicity is not unique to geostatistics: it is an implicit assumption in all inferential statistics: from estimating the proportion of red and black marbles in a bag, to inferring a population distribution from a histogram, to estimating a population mean from the sample mean

42 - key concept: because we only have one realization to work with in geostatistics, one more key concept must be introduced if we hope to use statistics to make estimates at unsampled locations: if homogeneous physical process(es) produced the variability in an r.v. over some area of interest, then the r.v. will demonstrate the same kind of variability over the entire area as it does within smaller subareas; in other words, the R.F. from which the r.v. is drawn is stationary, and statistical homogeneity (stationarity) can be assumed over the entire area; this is equivalent to saying that if we divide the study area into smaller parts, each part could be considered a different realization of the same R.F.; if so, then we can generate statistical estimates from each of the smaller parts as a kind of surrogate for drawing multiple samples from the R.F. (see Pannatier, p.77 and Fig.A.2) - note that ergodicity requires stationarity, but that stationarity does not imply ergodicity - there are various degrees of stationarity of the underlying process responsible for generating the regionalized variable; the type of stationarity assumed determines the kind of statistical inference that is permitted: - strict stationarity, in which all the random functions parameters are invariant from point to point, is rarely assumed because of the formidable challenge in describing all its parameters - second-order stationarity exists if the R.F.'s mean and variance are independent of location and the covariance depends only on separation or lag between measured values of the regionalized variable - intrinsic stationarity (or the intrinsic hypothesis) is the weakest assumption; certain physical processes (eg: Brownian motion) do not have a definable variance or covariance, but the variance of their increments does exist (see definition of the semivariogram below), in which case the semivariogram can be defined but other measures of spatial correlation (e.g. covariance) cannot

4.3. Spatial Statistical Moments (not to be confused with special romantic moments) - the expected value of a Random Function Z(x) at any location x is equal to its mean: m(x) = E{Z(x)} , assumed constant for a stationary R.F. (in other words, the R.F. is assumed to have Gaussian pdf characteristics)

- in linear (two-point) geostatistics, there exist three second-order moments:


variance: Var{Z(x)} = E{[Z(x) - m(x)]2} 2 = 1 n S [Z(x) - m(x)] C(x, x+L) = E{[Z(x)- m(x)][Z(x+h) - m(x+h)]} = E{Z(x)Z(x+h)} - m(x)m(x+h) = n(1L) S [Z(x) . Z(x+h)] - m(-h) . m(+h)

covariance:

semivariogram:

(x, x+L) =1/2 E{[Z(x) - Z (x+h)]2} (valid only if no trend exists) = n(1L) S [Z(x) - Z (x+h)]2

43

- Note: these second-order moments are not a function of location but are only dependant on lag
separation, h

4.4. Practical Definition of Spatial Correlation Structure for a Regionalized Variable: - based on second-order moments - the term variogram is used here as a generic term for a spatial correlation estimator statistic - specific second-order statistics are defined differently, and are used to better summarize non-normally distributed data or data with extreme-valued outliers - the experimental measure of spatial covariance is defined by: (4.2)

cov(z i , z i+h ) = C(h) =


= =

1 nh
1 nh 1 nh

Sin=1(z i z i )(z i+h z i+h) S in=1 (z i $ z i+h ) z i $ z i+h S in=1 (z i $ z i+h ) z $ z +


h
h h

where zi = ith data value at location (xi, yi), h = lag, nh is the number of data pairs separated by lag h, and the overbars represent the means of the two endpoints of the lag pairs (also often expressed as - and +) - the covariance is equal to the sample variance when h = 0 ie. at zero lag offset, the values of zi and zi+h are equal and equation (4.2) equals the definition of variance, thus C(0) = 2; at large values of h, the values of zi and zi+h are poorly correlated and C(h) -> 0 - graphically, the theoretical covariance function looks like this: 2
C(h)

lag, h

- typically, if the experimental values of cov(h) level off at large lags, the underlying random function is assumed to be second-order stationary (ie. its mean and variance are independent of location, and covariance depends only on lag separation, h) - the spatial covariance can be expressed as the inverted covariance (sometimes referred to as the non-ergodic covariance, because it does not assume that z i = z i+h in equation 4.2): (4.3) C'(h) = C(0) - C(h) , or C'(h) = 2 - C(h)

- note that in the presence of a trend, the covariance does not level off and can take on negative values; in that case, second-order stationarity does not exist

44 - the semivariogram definition is graphically derived from an h-scatterplot (Hohn, p.91-92) - plot all values of zi+h vs. zh that are separated by a given value of h (in practical terms, a range of h values) - the moment of inertia, Im, of the cloud of points about the 45o line is defined as: Im =
1 nh

S in=1 d 2 i
h

1:1 line

z(x+h) d

z(x) - z(x+h)

z(x)

- since the moment of inertia is defined about a 1:1 relationship, a right triangle defines the relationship between d and [z(x) - z(x+h)] - therefore, 2di2 = (zi - zi+h)2, and the semivariogram is defined as (4.4) c(h) = I m =
1 2n h

S in=1 (z i z i+h ) 2
h

- the standardized semivariogram is defined as s(h) = (h)/C(0) where C(0) is the sample variance - similarly, the correlogram is the spatial covariance standardized (divided) by the sample variance, and is exactly analagous to the autocorrelation function for second-order stationarity: (h) = C(h)/C(0) - it takes on a similar form to the other variogram measures when it is expressed as the inverted correlogram: '(h) = 1 - C(h)/C(0) - the madogram is defined as the mean (sometimes median) absolute difference: n mad(h) = n1h S i=h1 z i z i+h (for computational definitions of these various autocorrelation statistics, see Deutsch and Journel, p. 45 and Isaaks and Srivastava, p.59) - these various estimators of spatial correlation are all defined differently but all can be used to inform the parameters of spatial correlation structure during modeling - except for the madogram, these various estimators are exactly equivalent representations of the underlying R.F. if second-order stationarity exists:

45 (4.5) '(h) = C'(h)/C(0) = 1 - C(h)/C(0) = (h)/C(0) = s(h)

- i.e. if the various variogram estimators level off beyond a certain lag then second-order stationarity can be assumed and the inverted covariance, the semivariogram, the standardized semivariogram, and correlogram are all equivalent estimators of spatial continuity - if the semivariogram does not level off but does not rise faster than the square of h, the random function is not second-order stationary and is said to obey the intrinsic hypothesis; in that case, only the semivariogram is a valid estimator of spatial continuity and the other estimators cannot be fitted with a model random function - if experimental semivariogram values increase as fast as or faster than h2, then the intrinsic hypothesis is invalid and the presence of a regional trend is indicated; in order to proceed with variogram analysis, the trend would have to be removed and variogram analysis and modeling performed on the residuals - become familiar with the following nomenclature: nugget; sill; transition region; range of influence; types of variogram shapes: linear, parabolic, spherical, exponential, gaussian - Note: in order to analyze variogram structure and to utilize the resulting correlation structure in subsequent kriging or simulation of a geometrically-deformed body (such as a stratified and folded reservoir or ore body, or a stratified aquifer of variable thickness, or a dipping geologic formation), or to avoid numerical instability problems associated with matrices built from coordinate data of different number size (such as x,y in millions of meters vs z in tens of meters), coordinate transformations are applied - one common type of transformation, utilized for folded or variable thickness geologic bodies, represents the original x,y,z coordinates in stratigraphic coordinates of an equivalent, simpler tabular body: top(x,y)z(x,y) (4.6) z (x, y) = thickness(x,y) where z and z' are the original and stratigraphic coordinates, respectively, at location (x,y); top(x,y) is the elevation of the top of the original stratified body; and thickness(x,y) is the thickness of the geologic body at location (x,y). This transformation "straightens out" a contorted or variable thickness geologic body and represents it for purposes of correlation structure analysis and modeling as an equivalent tabular body. - another common transformation is to change the number size of x,y coordinates to match the number size of z coordinates; for example, if the range of z is from 2500 to 3000 ft above sea level, but x,y are in state plane feet with values of the order of 500,000, transform the x,y values as: (4.7) x = x x min y = y y min where x,y and x',y' are the original and transformed coordinates, respectively, and xmin,ymin are the minimum values of the x,y ranges; for 2D data, transformation (4.7) may be necessary for programs such as VarioWin which cannot handle x,y values larger than 5 digits

46

4.5. Computation of Experimental Variograms - see reading handout (VarioWins Chapter 2 short tutorial) - concepts: lag bins, mean lags, overlapping bins, variable bins, directional search parameters - rules of thumb: minimum data pairs per lag bin ca. 20-30; max. lag ca. 1/2 of max. separation - see hand-out of variography process

4.6. Software Comparison - various variogram programs differ slightly in the manner in which lags are represented for each lag interval, and the flexibility with which lag intervals can be specified; programs which plot lags as their weighted means are preferrable because the effects of clustered data on variogram shape can be visually identified - GeoEas plots variogram statistics at the weighted mean lags and allows specification of unequal lag intervals - VarioWin plots weighted mean lags but allows only for specification of equally-spaced lag intervals - GSLIB's GamV3 also plots weighted mean lags and equal lag intervals - commercial software such as GeoPack plots centered lags, in which the contribution of clustered data cannot be discerned from the correlation function plots - variogram plots in ArcGIS's Geostatistical Analyst are a cross between a conventional variogram and a partial variogram cloud, showing the spread of variogram statistics in the cardinal directions within each lag bin

4.7. Problem Set III. Spatial Correlation 1 - Experimental Variography

Das könnte Ihnen auch gefallen