Error structure of spectroscopic data (NIR, FTIR etc)
- and how to deal with them . Harald Martens and Achim Kohler Centre for Biospectroscopy and Data Modelling, Nofima Food, s, Norway CIGENE Center for Integrative Genetics, University of Life Sciences, s, Department of Mathematical Sciences and Technology (IMT), Norwegian University of Life Sciences, s, Norway 2 DNA mRNA Proteome Metabolome Biological Structure Environment, human activity Data analysis: Integrating different types of bio-data Look for common variation patterns Make quantitative prediction and forecasting Identify outliers Other phenotypes 1D-, 2D - Electrophoresis MALDI-TOF LC-MS GC,LC (-MS) Sequencing, SNP, AFLP, NIR, FT-IR Raman Flourescence Serotyping Realtime PCR Micro-array My own field: Measurements and modelling in systems biology Disease incidence Virulence Drug sensitivity Biofilmformation Sensory Science Economy 3 DNA mRNA Proteome Metabolome Biological Structure Environment, human activity Other phenotypes 1D-, 2D - Electrophoresis MALDI-TOF LC-MS GC,LC (-MS) Sequencing, SNP, AFLP, NIR, FT-IR Raman Flourescence Serotyping Realtime PCR Micro-array Now the real fun starts: feed-back ! Disease incidence Virulence Drug sensitivity Biofilmformation Sensory Science Economy High-dimensional dynamic, non-linear ODEs Spatial PDEs Possible, since we how are getting relevant and reliable high-throughput, high-dimensional instrumentation 4 Biospectroscopy Wavelength ranges: UV-Vis (<750 nm) Near Infra Red (NIR) 750-2500 nm, Fourier Transform Infra Red (FTIR) >2500 nm Raman Scattering - - Fluorescence: (mainly <750 nm) Modes of measurement: Raman, Fluorescence: Measure the light reaching the detector Measured signal is 0% at analyte level 0 Analyte measurement Noisy UV-Vis, NIR, FTIR: Transmittance and/or reflectance measured Measured signal is 100% at analyte level 0 Analyte log(1/measurement) Precise 5 Biospectroscopy Errors in measurements: White noise: Random measurement errors (usually heteroscedastic: higher numbers have higher errors) Coloured noise: Systematic errors Several undesired, but unavoidable interferants From measurement sample thickness, temp. effects From samples light scattering (simple, complicated) constituent interactions Several analytes, with overlapping spectra, Model-based pre-processing: Identify and correct for systematic errors . Turn systematic errors into valuable sources of information. 6 Water variations in tissues Mie Scattering Dispersive artefact 1000 1500 2000 2500 3000 3500 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Wavenumber [cm -1 ] A b s o r p t i o n Wavenumber-dependent effects Baseline shift Multiplicative effect Examples for undesired phenomena in FTIR 7 Chemical absorption Physical contribution Pre- Processing model Measured spectra Principle of model-based pre-processing: Mie Scattering of individual liver cancer cells in Synchrotron FTIR 8 Example: Light microscopy of muscle, one wavelength in visible range 9 Hyperspectral FTIR microscopy of same sample: Traditional Chemical imageat the bestwavelength (1240cm -1 ) - the UNIVARIATE TRADITION! like playing complex music on a grand piano with one finger at a time 10 Hyperspectral FTIR microscopy of same sample: Chemical imageat same wavelength after pre-processing like playing SIMPLE music on a grand piano with one finger at a time 11 Hyperspectral FTIR microscopy of same sample: Chemical imagefrom pre-processing parameters, based on all wavelengths like playing complex music on a grand piano with all fingers and toes (+ nose) 12 Analysing/Visualising estimated parameters/scatter effects Estimated parameters can be used for making physical images: b, proportional to the effective optical path length, is estimated for each pixel spectrum Kohler A, Bertrand D, Martens H, Hannesson K, Kirschner K, and Ofstad R (2007) Multivariate image analysis of a set of FTIR microspectroscopy images of aged bovine tissue combining image and design information. Analytical and Bioanalytical Chemistry 389, 1143-1153. 13 Pre-processing Model-based pre-processing: parameterize the problems Combine knowledge-driven and data-driven modelling Use linear data models (fast, simple, robust), but use both additive and multiplicative operators Complicated non-linear mathematical models replaced by bilinear, compressed summaries of model behaviour 14 Abstract: Short version, overviewof what to do to handle various error types: Error structure of spectroscopic data (NIR, FTIR etc) www.specmod.org) Measurements are usually done in order to quantify an analyte a chemical or physical property of the samples analyzed. But measured data usually reflect several sources of variation, not only the desired chemical or physical property. This is clearly true for spectroscopic measurements. By proper design of the measurements and proper modelling of the data, it is possible to separate the measured signals into the various sources of variation. Of course, different measurement types and different sample types offer different error structures. But most systematic errors have a surprisingly systematic nature, from a mathematical point of view, and can therefore be discovered and corrected for by the same procedures. This lecture addresses some psychological, technical, mathematical and statistical issues in how to reduce undesired variations in measured spectra. Keywords: Remember to address also the undesired phenomena the interferants, not only the desired analytes! Usually, this requires multi-channel profiling. Increase the number and type of measurement channels in the profile, so that even the interferants can be quantified and subtracted. This may increase the cost of the measurements a little, but saves you a lot of problems later. If possible, measure each sample under a set of different conditions this increases the information value of a given instrument type. Study the raw spectra graphically, in 1-way, 2-way and 3-way plots, to detect unexpected phenomena. Check them also by PCA etc to look for more hiddenstructures. Model-based pre-processing should then be applied, building on a combination of approximate causal understanding and empirical covariations. The purpose is to identify and correct for various signal contribution types: random noise, wavelength shifts, multiplicative amplifications (e.g. instrument amplification, sample thickness, effective optical path length), additive contributions (analyte and interferant concentrations), log-additive contributions (stray light) and response non-linearities of various kinds. Interference effects should be removed, but stored for later studies, - not just filteredaway and lost. Multivariate calibration models should finally be developed, in order to enhance the selectivity and provide graphical insight into the main structures in the pre-processed data. Analyte predictions and multivariate scores are in turn obtained by passing new spectra through the calibration models, and can in turn be related to other types of information, e.g. from genetics and genomics. Automatic outlier warnings of various kinds should be used in order to detect anomalies. Selectivity enhancement by multivariate analysis has been well known for 25 years in Chemometrics. But the field is still developing, as the lecture will illustrate, e.g. based on high-throughput instrument types (FTIR robotics) and various new uses of the Extended Multiplicative Signal Correction (EMSC). 15 Notation for model-based pre-processing: ref = a referencespectrum z = an input sample spectrum (EXAMPLE: zz True ! But z True = ref ) m = mean of z,ref (and possibly some others) Error model: 1) m z True 2) z= f(m) + random noise f()=is estimated from input spectra z and m Error correction: z Corr = z True = f -1 (z) 16 0 Spectra z and ref 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e ref. Simple error types; assume z(true)=ref z =ref +a zc =z a Input spectra Visualization tools Corrected spectra 17 Simple error types 0 Spectra z and ref 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e ref. z =ref +a z =ref b z =ref b +a z corr =z a z corr =z / b zc =(z a) / b 18 Simple error types 0 Spectra z and ref 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e ref. z =ref +a z =ref b z =ref b +a z corr =z a z corr =z / b z corr =(z a) / b 19 Simple error types 0 Spectra z and ref 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e ref. z =ref +a z =ref b z =ref b +a z corr =z a z corr =z / b z corr =(z a) / b Method: Multiplicative Signal Correction (MSC) or Standard Normal Variates (SNV) 20 Multiplicative Signal Correction and its extension (EMSC) Model: z = b m + a + z corr = (z a) / b MSC: i.e. z = b (m+ cK analytes + dG interferants ) + a + z corr = (z a G interferants / b EMSC: Model: z = b z True + a + Regression b, a Regression b, , , a Assumption: z True = m + cK analytes + dG interferants Assumption: z True = m i.e. z = b m + K analytes + G interferants + a + 21 Multiplicative Signal Correction and its extension (EMSC) Model: z = b m + a + z corr = (z a) / b MSC: i.e. z = b (m+ cK analytes + dG interferants ) + a + z corr = (z a G interferants / b EMSC: Model: z = b z True + a + Regression b, a Regression b, , , a Assumption: z True = m + cK analytes + dG interferants Assumption: z True = m i.e. z = b m + K analytes + G interferants + a + 22 H.Martens is co-owner of EMSC patent, but academic use is of course free. Algorithms for EMSC are available in Matlab Toolbox etc and in The Unscrambler, for free research use. 23 Example: Model FTIR effects of varying sample temperature in aquous samples Input spectra: water at different temperatures Simple EMSC G interferants =wavelength dependent baseline EMSC with model of water, K analytes and its temperature effects, G interferant Outside instrument range 24 Example: Model FTIR effects of varying sample temperature in aquous samples Input spectra: water at different temperatures Simple EMSC G interferants =wavelength dependent baseline EMSC with model of water, K analytes and its temperature effects, G interferant Outside instrument range 25 Example: Model FTIR effects of varying sample temperature in aquous samples Input spectra: water at different temperatures Simple EMSC G interferants =wavelength dependent baseline EMSC with model of water, K analytes and its temperature effects, G interferant Outside instrument range 26 0 20 40 60 80 100 1.5 2 2.5 3 3.5 Input, EMSC Z .MAT R e s p o n s e Channel # 0 20 40 60 80 100 2.4 2.5 2.6 2.7 2.8 Output, DataCase=155, EMSC, opt.an extra Bad spectrum, in addition to input B R e s p o n s e Channel # 0 20 40 60 80 100 -1 -0.5 0 0.5 1 Input, EMSC Z .MAT M e a n - C e n t r e d
R e s p o n s e Channel # 0 20 40 60 80 100 -0.04 -0.02 0 0.02 0.04 Output, DataCase=155, EMSC, opt.an extra Bad spectrum, in addition to input B M e a n - C e n t r e d
R e s p o n s e Channel # 850 1050 nm Mixtures of protein and starch powders A b s o r b a n c e l o g ( 1 / T ) Example of EMSC: Pre-processing of NIR spectra of powder mixtures 27 -3 0 3 6 -2 -1 0 1 2 3 YGlutenFromXOD,X-expl: 42%,58% Y-expl: 74%,21% 100L 100L 100L 100L 100L 100L 100L 100L 100L 100L 100H 100H 100H 100H 100H 100H 100H 100H 100H 100H 075L 075L 075L 075L 075L 075L 075L 075L 075L 075L 075H 075H 075H 075H 075H 075H 075H 075H 075H 075H 050L 050L 050L 050L 050L 050L 050L 050L 050L 050L 050H 050H 050H 050H 050H 050H 050H 050H 050H 050H 025L 025L 025L 025L 025L 025L 025L 025L 025L 025L 025H 025H 025H 025H 025H 025H 025H 025H 025H 025H 000L 000L 000L 000L 000L 000L 000L 000L 000L 000L 000H 000H 000H 000H 000H 000H 000H 000H 000H 000H PC1 PC2 Scores -4 -2 0 2 4 0 20 40 60 80 100 YGlutenFromXOD, (Y-var, PC): (Gluten,5) X-variables Regression Coefficients 0 50 100 P C _ 0 0 P C _ 0 1 P C _ 0 2 P C _ 0 3 P C _ 0 4 P C _ 0 5 P C _ 0 6 P C _ 0 7 P C _ 0 8 YGlutenFromXOD, Variable: c.Total v.Total PCs Y-variance Explained Variance 0 0.5 1.0 0 0.2 0.4 0.6 0.8 1.0 YGlutenFromXOD, (Y-var, PC): (Gluten,5) Measured Y Predicted Y No preprocessing of log(1/T) spectra: Standard model output from a multivariate calibration program (The Unscrambler) 28 0 20 40 60 80 100 1.5 2 2.5 3 3.5 Input, EMSC Z .MAT R e s p o n s e Channel # 0 20 40 60 80 100 2.4 2.5 2.6 2.7 2.8 Output, DataCase=155, EMSC, opt.an extra Bad spectrum, in addition to input B R e s p o n s e Channel # 0 20 40 60 80 100 -1 -0.5 0 0.5 1 Input, EMSC Z .MAT M e a n - C e n t r e d
R e s p o n s e Channel # 0 20 40 60 80 100 -0.04 -0.02 0 0.02 0.04 Output, DataCase=155, EMSC, opt.an extra Bad spectrum, in addition to input B M e a n - C e n t r e d
R e s p o n s e Channel # 850 1050 nm Mixtures of protein and starch powders, BEFORE PRE- PROCESSING A b s o r b a n c e l o g ( 1 / T ) A b s o r b a n c e l o g ( 1 / T ) 850 1050 nm Mixtures of protein and starch powders, AFTER EMSC PRE- PROCESSING, G interferants found by Simplex opt. of prediction ability 29 More nasty error types 0 z=Ref & nonlin. stray light 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e Response curvature e.g. stray light or detector saturation z=f(z true ) z corr =f -1 (z) Sideways shift (from instrument or sample) z corr =f -1 (z) Random noise, hetero- scedastic z corr =filt(z) Method: Non-linear parameter estimation or Extended Multiplicative Signal Correction (EMSC) 30 0 z=Ref & nonlin. stray light 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e Response curvature e.g. stray light or detector saturation z=f(z true ) z corr =f -1 (z) Sideways shift (from instrument or sample) z corr =f -1 (z) Random noise, hetero- scedastic z corr =filt(z) Method: Non-linear parameter estimation or Extended Multiplicative Signal Correction (EMSC) More nasty error types 31 0 z=Ref & nonlin. stray light 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e Response curvature e.g. stray light or detector saturation z=f(z true ) z corr =f -1 (z) Sideways shift (from instrument or sample) z corr =f -1 (z) Random noise, hetero- scedastic z corr =filt(z) Method: Non-linear parameter estimation or Extended Multiplicative Signal Correction (EMSC) More nasty error types 32 1000 2000 3000 0 0.2 0.4 0.6 0.8 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.8 Absorbance A b s o r b a n c e 01 1000 2000 3000 0 0.2 0.4 0.6 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.2 0.4 0.6 0 0.2 0.4 0.6 Absorbance A b s o r b a n c e 01 Estimating baseline and multiplicative effect and pre-processing Raw spectra MSC/EMSC (basic) Raw spectra vs. mean Corrected spectra vs. mean 33 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Absorbance A b s o r b a n c e 07 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 Absorbance A b s o r b a n c e 07 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.2 0.4 0 0.1 0.2 0.3 0.4 0.5 Absorbance A b s o r b a n c e 07 Examples for EMSC replicate correction (Ed Stark) Raw EMSC (basic) EMSC rep. 34 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Absorbance A b s o r b a n c e 07 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 Absorbance A b s o r b a n c e 07 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.2 0.4 0 0.1 0.2 0.3 0.4 0.5 Absorbance A b s o r b a n c e 07 Examples for EMSC replicate correction (Ed Stark) Raw EMSC (basic) EMSC rep. 35 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Absorbance A b s o r b a n c e 07 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5 Absorbance A b s o r b a n c e 07 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 Wavenumber [cm -1 ] A b s o r b a n c e 0 0.2 0.4 0 0.1 0.2 0.3 0.4 0.5 Absorbance A b s o r b a n c e 07 Examples for EMSC replicate correction (Ed Stark) Raw EMSC (basic) EMSC rep. 36 1000 2000 3000 0 0.2 0.4 0.6 0.8 1 Wavenumber [cm -1 ] A b s o r b a n c e 1000 2000 3000 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Wavenumber [cm -1 ] A b s o r b a n c e 1000 2000 3000 -0.2 0 0.2 0.4 Wavenumber [cm -1 ] A b s o r b a n c e 1000 2000 3000 -0.2 -0.15 -0.1 -0.05 0 0.05 Wavenumber [cm -1 ] A b s o r b a n c e 1000 2000 3000 0 0.1 0.2 0.3 0.4 0.5 0.6 Wavenumber [cm -1 ] A b s o r b a n c e 1000 2000 3000 -0.04 -0.02 0 0.02 0.04 Wavenumber [cm -1 ] A b s o r b a n c e Examples for EMSC replicate correction Kohler A, Bcker U, Warringer J , Blomberg A, Omholt SW, Stark E, Martens H (2008) Reducing inter-replicate variation in FTIR spectrosocopy by extended multiplicative signal correction (EMSC). Applied Spectroscopy. Raw EMSC (basic) EMSC rep. 37 How to obtain more advanced pre- processing models 1. By estimating unwanted variation from the data itself 2. By estimating unwanted variation from mathematical models about known scatter effects, instrumental information etc. But how to mix complicated mathematical models and simple, linear pre-processing models? Solution, e.g. for Mie light scattering ( lense effects ) of individual cells in synchrotron FTIR microscopy 38 Estimating Mie scattering Theory EMSC subspace model Kohler A, Sul-Suso J , SockalingumGD, Tobin M, Bahrami F, Yang Y, Pijanka J , Dumas P, Cotte M, Martens H (2008) Estimating and correcting Mie scattering in synchrotron based microscopic FTIR spectra by extended multiplicative signal correction (EMSC). Applied Spectroscopy , 62, 259-266. Corrected spectra Mie scattering 39 Chemical absorption Physical contribution Pre- Processing model Measured spectra Using Mie scattering model for new samples 40 0 z=Ref & nonlin. stray light 0 Mean and diff. 0 0 z vs Ref 0 z corr. and Ref 0 0 0 0 0 0 0 0 0 0 Wavelength Wavelength Wavelength Absorb.(ref) A b s o r b a n c e A b s o r b a n c e A b s o r b . ( s a m p l e ) A b s o r b a n c e Response curvature e.g. stray light or detector saturation z=f(z true ) z corr =f -1 (z) Sideways shift (from instrument or sample) z corr =f -1 (z) Random noise, hetero- scedastic z corr =filt(z) More nasty error types 41 Milk FTIR spectra -1 0 1 2 3993.03 3649.668 3306.306 2962.944 2619.582 2276.22 1932.858 1589.496 1246.134 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 Variables -0.2 0 0.2 0.4 0.6 3055.536 2839.488 2623.44 2407.392 2098.752 1882.704 1539.342 1323.294 1107.246 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 Variables -0.5 0 0.5 1.0 1.5 2.0 3993.03 3649.668 3306.306 2962.944 2619.582 2276.22 1932.858 1589.496 1246.134 1T 2T 3T 4T 5T 6T 7T 8T 9T 10T 11T 12T 13T 14T 15T 16T 17T 18T 19T 39T 40T 41T 42T 43T 45T 46T 47T 48T 49T 52T 53T 54T 55T 56T 57T 58T 59T Variables Dried samples, lab instrument Wet samples Minus water, routine instrument Useful spectrum wet samples-water, routine instrument 42 -0.02 -0.01 0 0.01 3055.536 2839.488 2623.44 2407.392 2098.752 1882.704 1539.342 1323.294 1107.246 Variables Other components Cal. models Wavenumber Wavenumber Milk FTIR spectra: and functional genomics for optimized milk and meat quality 6 million milk spectra/year Calibration milk samples Reference measurements, fatty acids (GC-MS) Feeding experiments: Pred. fatty acids etc Routine milk analysis: Background knowledge QTLs etc ? 20K SNPs Large-scale FTIR-bioscreening project in Norway Heritability, feeding effects etc Cal. models FA Combinations 43 Estimated effect on human total cholesterol level (assuming 20% of energy intake from milk fat) 0 5 10 15 20 25 0 0.01 0.02 0.03 0.04 0.05 EstCholesterol : RMSECV Comp no. R M S E C V 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 R 2 cv , R2AOpt=0.81913 Comp no. R 2 0 10 20 30 0 0.1 0.2 0.3 0.4 0.5 PC # E x p o n e n t PPLS exponents 1000 2000 3000 4000 5000 -200 -100 0 100 200 EstCholesterol :Regression coeffs ppls 0.4 0.5 0.6 0.7 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Min.MSECV: Fit(r) and cv(co,R 2 =0.81913) Y H a t , A M i n C V = 2 4 0.4 0.5 0.6 0.7 0.4 0.45 0.5 0.55 0.6 0.65 0.7 AOpt: Fit(r) and cv(go,R 2 =0.81913) Y H a t , A O p t = 2 4 Prediction error RMSEP CV Prediction ability R 2 CV Wavenumber Analyte conc. R e g r e s s i o n c o e f f i c i e n t s R 2 CV = 0.82 PLSR model rank PLSR model rank A n a l y t e p r e d .
f i t ,
c r o s s - v a l ( C V ) 44 DNA mRNA Proteome Metabolome Biological Structure Environment, human activity Other phenotypes 1D-, 2D - Electrophoresis MALDI-TOF LC-MS GC,LC (-MS) Sequencing, SNP, AFLP, NIR, FT-IR Raman Flourescence Serotyping Realtime PCR Micro-array Now the real fun starts: feed-back ! Disease incidence Virulence Drug sensitivity Biofilmformation Sensory Science Economy Models: Dynamic, non-linear ODEs Spatial PDEs Different feedback control (Jacobi matr.) in different parts of state space 10000-dimensional input data Eigenvalues vs singular values of the Jacobi matr. Identify outliers 45 1000 1100 1200 1300 1400 1500 1600 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Input spectra Wavenumber A b s o r b a n c e Wavenumber of the FTIR light F T I R
l i g h t a b s o r b a n c e Monitoring dynamic processes by biospectroscopy A fermentation process in dairy industry monitored by FTIR (ATR) for 26 hours 46 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0 0.01 0.02 0.03 PC 1, 89.6 % variance PC 2, 8.7 % variance P C
3 ,
0 . 9
%
v a r i a n c e k 5 k 3 k 4 k 2 k 1 t = 0 6 hrs 19 hrs 21.5 hrs 26 hrs Three first principal component scores 47 Semi-soft modelling of the process s 2 - s 1 s 3 - s 2 s 4 - s 3 State fingerprints State amounts Wavenumber, cm -1 Time, hrs c 1 c 2 c 3 c 4 c 5 1000 1100 1200 1300 1400 1500 1600 0 0.05 0.1 0.15 0 5 10 15 20 25 0 0.5 1 1000 1100 1200 1300 1400 1500 1600 -8 -6 -4 -2 0 2 x 10 -3 0 5 10 15 20 25 0 0.5 1 1000 1100 1200 1300 1400 1500 1600 -5 0 5 10 x 10 -3 0 5 10 15 20 25 0 0.5 1 1000 1100 1200 1300 1400 1500 1600 0 0.02 0 5 10 15 20 25 0 0.5 1 1000 1100 1200 1300 1400 1500 1600 -2 0 2 4 6 x 10 -3 0 5 10 15 20 25 0 0.5 1 s 1 -0.02 s 5 - s 4 48 Non-linear dynamic model identification My other activity in CIGENE: Cell differentiation model: computer simulation, sensory analysis of mathematical solutions The Physiome Project: human heart Individual heart muscle cell, 36 state variables, 72 param. Sets of adjacent, interacting cells Assessing large non-linear dynamic models too complex for theory Nominal-level (Leiden-school!) PLSR of rates vs states Study local J acobians and their eigenvalues vs singular values Represent /replace a mathematical form by its behavioural repertoire, by exhaustive simulation(factorial designs to chosen resolution), in compressed Data Base. 49 Conclusions Many error-types are in fact sources of valuable information. Model-based pre-processing: identify, quantify and separate out systematic error-types. Model-based pre-processing in biospectroscopy requires an understanding of the different errorsthat create the unwanted variation. As usual: It is better to be approximately right than precisely wrong It is better to be aggressive/humble, than to be passive/arrogant . 50 Acknowledgements People who contributed: Centre for Integrative Genetics (CIGENE), Norw. U. Life Sci. : Stig Omholt, Erik Plahte, Arne Gjuvsland, Sigbjrn Lien, Hanne Gro Olsen, shild Randby NOFIMA /Matforsk: AchimKohler, Ulrike Bdtker,Nils Kristian Afseth,Martin Hy TINE: Kjetil J rgensen GENO: Morten Svendsen