Sie sind auf Seite 1von 14

Statistical

Hydrology, Fall 2012

Carlos Serrano Moreno

Principal Component Analysis of Precipitation in Spain


Abstract
One of the common techniques used to deal with large data sets is Principal Components Analysis (PCA). This technique is a statistical analysis method frequently used in the geophysical sciences to explain correlations in a large set of variables and provides a smaller number of independent components. In order to get familiar with the PCA technique, for this project data registered by 19 weather stations in Spain will be used so as to find the relationship among the variables registered that leads to provide a good estimator of the rainfall. By PCA analysis it was possible to decide that the best way to predict the precipitation as a

function of the other registered variables was to use a regional approach. Then one expression was adjusted to every single location. Most of the research centres that work on Climate Change modelling do not provide estimations of precipitation but the do offer predictions for other variables such as temperature, atmospheric pressure, geopotencial height that are easier to predict. Taking into consideration that the variables that are available for future scenarios the following multiple regression approach was used to predict precipitation for every station. Keywords: Principal Component Analysis, multiple regression, precipitation estimation, Climate Change escenario, Mediterranean Areas.

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

Principal Component Analysis of Precipitation in Spain


1 Introduction
Flood forecasting is one of the most important challenges in hydrological sciences nowadays. Providing alerts with an adequate anticipation time on the occurrence of the flood events mitigates its impact and brings enormous social benefits. Mediterranean areas are especially vulnerable to the occurrence of flash flood events, due to the steep slopes and the big amount of the runoff draining along the impermeable surface of the catchment. In order to be able to provide a sufficient lead time for mitigating the effects of this hazardous events scientific researchers use products as Numerical Weather Predictions (NWPs) or rainfall observations provided by weather radars. However, even these products are able to provide rainfall estimations at fine resolution the large uncertainty embedded in these simulations makes that all these estimations have to be pre-processed and corrected before being used as an input for hydrologic models. In order to try to correct these estimations researchers try to use all the information available, this means that it is also important work with the directly registered data that weather stations provide. Whether tools such as NWP's or weather radars are just providing rainfall estimations over the whole study area, weather stations provide a direct measurement of the variables at a location. Then, due to the high amount of weather stations available in Mediterranean areas it becomes also important to learn how to deal with large sets of data that have been obtained at different positions. Especially when working over a large domain (national or continental scale). It is very important to distinguish which of the stations are the ones that provide relevant data and be able to reject the stations that provide redundant information. Another typical situation where being able to prioritize the data is important also appears when dealing with weather stations. Normally, when trying to find long-term rainfall predictions one comes out with agencies or organizations that provide estimations of variables that can be related with rainfall such as temperature or pressure, however, a direct estimation of rainfall is unavailable. Also in this case the study of the data available from the local stations can help to find an accurate relation between temperature, pressure or any other variable with rainfall. One of the common techniques used to deal with large data sets is Principal Components Analysis (PCA). This technique is a statistical analysis method frequently used in the geophysical sciences to explain correlations in a large set of variables and provides a smaller number of independent components.

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

In order to get familiar with the PCA technique, for this project data registered by 19 weather stations in Spain will be used so as to find the relationship among the variables registered that leads to provide a good estimator of the rainfall. Depending on the results obtained, if a common relationship between the variables in all the stations is observed this one will be used to characterize the precipitation over Spain. However, due to the big climatic differences between regions in the country, it is expected to find that each meteorological variables plays a different role for each climatic area. 1.1 Available data and objectives: In order to do this project, and thanks to the Spanish Meteorological Agency (AEMET), monthly data registered in 19 different weather stations placed in different provinces is available. In most locations the registered information is available from January 1920 until August 2012. Some of the variables registered at the weather stations are: Month, Year, Av. Temperature, Max Temp, Min. Temp, Total Precipitation, Max. Daily precipitation, Rainy days, Snowy days, hail days, Atmospheric Pressure and average isolation. Due to the big amount of these variables the use of PCA technique becomes necessary in order to identify which of these variables are closely related with the rainfall and try to find a way of predicting the monthly rainfall by using one combination of the variables here given. By using PCA, It will not only be possible to identify which variables are the ones who have a stronger meaning inside, but also, obtain new variables that are going to be linearly independent between each other. Then the ones that will explain a higher % of the variance will be chosen so as to predict the monthly rainfall. By using PCA the complexity of the problem will be simplified because of the fact that a smaller number of variables will be involved in the estimation of the rainfall. 1.2 Procedure: The PCA analysis will be performed in every different weather station as well as in the whole sample. The first goal will by identifying if the vectors of the PCA base are the same (or involve the same variables in the same way) for each station. Then according to the results, it will be interesting to find if the same PCA base can be used for understanding the problem all over Spain or if, on the other hand, there are some climatic regions inside the country that follow different patterns (some variables will be strongly correlated with the rainfall in some areas but not in the other ones).

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

2 Method of analysis
2.1 Principal Component Analysis (PCA) As it was said before the statistical technique chosen to analyze the data set available is Principal Component Analysis (PCA) PCA is the most common form of a factor analysis. By using these technique it will be possible to obtain new variables (also known as dimensions or Principal Components) that will be linear combinations of the original variables registered by each weather station. By definition of PCA this new variables are uncorrelated with one another. These variables also will try to capture as much of the original variance in the data as possible. As an output of the PCA the first principal component will show the direction of greatest variability (covariance) in the data, the second principal component is the next orthogonal (uncorrelated) dimension of greatest variability. This procedure is following to find all the principal components. The number of principal component that can be obtained is equal to the number of variables inside the data set (in this study this number of variables is 10, apart from the atmospheric variables registered, the month to which the data corresponds is also included). However, PCA technique is not only a way of transferring a ser of correlated variables into a set of uncorrelated ones, but also a method to reduce the number of parameters of the problem. In other words, if the initial set of variables is highly correlated it will be possible to work with only some of the main principal components obtained because a small number of variables will be able almost the same variability of the data. The possibility of reducing the number of dimensions of the problem is one of the most used properties of the PCA technique. It is not easy to decide which is the number of principal components that should be used in the analysis, depending on the necessities of the research it will be possible to accept losing some information in order to work with a smaller number of variables. Understanding the physical meaning of the principal components, even can be very interesting, is not always possible. This is the reason why when it is not possible to reduce the problem into a small number of variables it can be better to work with the initial set of variables because, even the complexity of the problem will be bigger, it can be easier to understand the result if all the variables have a clear physical meaning. 2.2 Application of PCA to the data set: In this study PCA technique will be applied to the data registered in 19 different meteorological stations placed in the main cities of Spain. In spite of the normal situations where PCA is used, as it can be seen in Table 1, the variables here studied do not have a high correlation between each other. It may seem that the decision of applying PCA technique in this situation is wrong. However, another interesting point of PCA is that this technique

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

also allows to identify insights and hidden relation between the variables that can not be seen just by performing a simple analysis of the registered values.
Atmos. Pressure Atmos. Pressure Av. Temp Frost Height Insolation % Max. Daily Rainfall Month Rainy Days Total Precip. Wind dir Wind vel 1.00 -0.43 0.338 -0.03 -0.03 -0.13 -0.12 -0.2 -0.2 -0.05 -0.02 Av. Temp -0.43 1.00 -0.59 -0.3 -0.08 -0.14 -0.24 -0.38 -0.24 0.04 -0.05 Frost Height Insol % Max. Daily R -0.13 -0.14 -0.1 -0.13 -0.06 1.00 0.1 0.46 0.79 -0.03 0.16 Month Rainy Days -0.2 -0.38 -0.03 0 -0.04 0.46 0.01 1.00 0.68 0.01 0.107 Total Precip. -0.2 -0.24 -0.08 -0.11 -0.06 0.79 0.09 0.68 1.00 -0.04 0.15 Wind dir -0.05 0.04 -0.04 -0.11 -0.03 -0.03 -0.01 0.01 -0.04 1.00 0.01 Wind vel -0.02 -0.05 -0.13 -0.38 -0.53 0.16 -0.05 0.107 0.15 0.01 1.00

0.338 -0.59 1.00 0.33 0.12 -0.1 -0.13 -0.03 -0.08 -0.04 -0.13

-0.03 -0.3 0.33 1.00 0.45 -0.13 0 0 -0.11 -0.11 -0.38

-0.03 -0.08 0.12 0.45 1.00 -0.06 0 -0.04 -0.06 -0.03 -0.53

-0.12 -0.24 -0.13 0 0 0.1 1.00 0.01 0.09 -0.01 -0.05

Table 1: Covariance matrix of the variables registered in the weather station in Barcelona.

Using R software and RCommander Package the result obtained after performing the PCA analysis to the data obtained in Barcelona is shown in the following figure and table:


Figure 1: Sedimentation graphic showing the % of Explained Variance.

Statistical Hydrology, Fall 2012


Carlos Serrano Moreno


Component Component Component Component Component Component Component Component Component Component 1 2 3 4 5 6 7 8 9 10

Eigenvalue
2.99 1.83 1.11 0.96 0.92 0.77 0.67 0.35 0.29 0.11

% of variance
29.90 18.35 11.10 9.61 9.20 7.67 6.73 3.52 2.90 1.02

cumulative % of variance
29.90 48.25 59.35 68.96 78.16 85.83 92.56 96.08 98.98 100

Table 2: Sedimentation graphic showing the variance explained by every principal component.

As Figure 1 shows, there are 10 principal components that represent the whole variance contained in the data set from Barcelona. If the main objective of the research was to reduce the complexity of the problem it would be possible to use the 6 main principal components and reproduce around the 86 % of the variability as it can be seen in Table 2. It is also interesting to analyze the contribution that each of the registered variables has inside the obtained principal components. Figure 2 shows the role that each variable plays inside the 2 main principal components (just these two components represent almost the 50 % of the variance of the data set. )

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno


Figure 2: Variables factor main for the 2 main principal components for the data registered in Barcelona.

It is difficult to get any outcome from Figure 2 so it is not possible or at least trivial to guess the physical meaning that the principal components would have in this case. Nevertheless, it is important to keep in mind that the main objective of this analysis is to find the variables that have a stronger relation with precipitation so as to be able to estimate in for future scenarios. The same PCA analysis is performed for the remaining 18 stations and similar results are obtained. It is important to point out that for all the stations the principal components explain the same % of variance and also the loads that each of the registered variables are the same. Another example can be seen in Figure 3 for the variables registered in Guadalajara.

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno


Figure 2: Variables factor main for the 2 main principal components for the data registered in Guadalajara.

Just by doing a visual analysis of these variables factor map One may be tempted to think that the data set is not depending on the station where it has registered so all the meteorological variables play a similar role around the country. 2.3 Discussion of the results obtained after performing the PCA: After performing the PCA analysis for each meteorological station the decision of trying to find just a unique relation to extrapolate the precipitation for the whole country or finding an individual relation for each station had to be taken. The first idea was to find a single relationship but after analyzing together the contribution that each variable had inside the first components some interesting points were discovered.

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno


Figure 3: Contribution that each of the registered variables has in the first principal component for each station.

As it can be seen in Figure 3 there is a main trend that most of the station follows. However, there are some singularities in some of the variables that may play an important key role when trying to get an estimation of the precipitation. Taking a look at the Averaged temperature it is possible to see 3 different behaviors: in most of the cities the coefficient is negative, it gets nearer to cero for the stations placed in Barcelona, Girona, Tarragona and Castelln (4 cities that are along the Mediterranean coast, North-east of Spain.) and also the coefficient becomes positive in Zaragoza and Palencia (Zaragoza is one city that is 300 km to the west from Barcelona and Palencia is another city that is close to the border with Portugal.) Figure 4 shows a climatic map of Spain where the position of these cities is show. As it can be seen the climate of the cities that are distributed along the Mediterranean coast and also Zaragoza an Palencia have particular climatic conditions that made them different to the rest of the cities used in this study, which climate could be defined as continental. There are also more differences that can be seen when taking a detailed look into the different variables. The cities mentioned before offer different coefficients to the ones offered by the majority of the cities for most of the variables. However, there are some variables like Wind velocity where cities as Cadiz also can be differentiated from the main trend. It is interesting to say that Cadiz is really close to the Strait of Gibraltar, and is one of the most important places in Europe to practice surf. It seems then that these singularities that every city has can also be observed taking a detailed look into the PCA results.

10

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

Something similar happens with Jaen if one focuses on the insolation %. Jaen is one of the Spanish provinces where most of the olive oil is produced. One should take into account that the quality of the olives are closely related to the insolation that they have. Even these conclusions are just qualitative and seem hard to be proved it is very interesting to see how PCA method is pointing out these insights that could be very difficult to see from a simple observation of the registered values.


Figure 4: Climate in Spain and position of the cities were the coefficient that Average temperature is different to the one offered by most of the stations in the first principal component.

As a conclusion of the PCA analysis for every station it seems logical that the estimation of the precipitation will offer a better performance if one works in a regional scale rather than trying to deal with the problem for the whole country.

11

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

3 Analysis and Results Thanks to the PCA analysis it was possible to decide that the best way to predict the precipitation as a function of the other registered variables was to use a regional approach. Then one expression was adjusted to every single location. Most of the research centres that work on Climate Change modelling are not able to provide reliable estimations of rainfall due to the fact that rainfall is a very random phenomena. However, these institutions provide estimations of other variables such as temperature, atmospheric pressure, geopotencial height that are easier to predict. Taking into consideration that the variables that are available for future scenarios the following multiple regression approach was used to predict precipitation for every station: Even, the approach here suggested to obtain an estimate of the precipitation is very simple (according to the correlation matrix shown at the previous section) and is not able to offer a good prediction for precipitation an interesting result can be seen in Figure 5.


Figure 5: Correlation obtained after adjusting the Multivariate Regression model suggested to estimate precipitation.

12

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

It is interesting to see that the stations that are closer to the Mediterranean are not able at all to predict precipitation. It is obvious then that at least in these areas the problem should be tried to solve using different variables. However the areas that are affecting by the Atlantic an continental climate offer a better result even the model here suggested is also not valid. Nevertheless, it has to be said that the result is very logical because the rainfall event that take place at the Atlantic areas are easier to predict because they are originated meanly by oceanic storms that are big scale phenomena that always developed under some certain atmospheric pressure and wind conditions. However, the Mediterranean climate is known to suffer from convective rainfall events that happen at a smaller scale and are much more difficult to predict. 4 Conclusions
As it has been described in the previous sections, PCA is shown to be a useful technique to study the problem here suggested. In this case PCA has not been used to reduce the dimension of the problem because of the fact that the variables of the data set showed a very low correlation between each other. However PCA has been useful to discover insights relations of the variables and the location that could not have been directly observed. It is very important to analyze the output of the PCA using different points of view. Then, Thanks to the output offered by PCA it has been possible to identify the most suitable way of dealing with the problem just by using a simple approach to try to estimate precipitation at each station in stead of doing it for the whole country. However, the linear model used to estimate the precipitation is not offering a good result but can be helpful as a first approach for further research. Even the linear regression model offers a very low performance another interesting thing and also a consistent result was that precipitation in the Mediterranean areas is harder to be predicted while in Atlantic climate areas the correlation between the atmospheric variables are closer.

13

Statistical Hydrology, Fall 2012

Carlos Serrano Moreno

5 References
Reference paper: Kahy E., S. Kalayci , and T.C. Piechota, 2008: Streamflow Regionalization: Case Study of Turkey. Journal of Hydrologic Engineering. Vol 13, No. 4, pp. 205-214 Other papers: Basalirwa C.P.K., J.O. Odiyo, R.j. Mngodo, and E.J. Mpeta. 1999: The climatological Regions of Tanzania Based on the rainfall characteristics. International Journal of Climatology. 19: 69-80. Stathis D., D. Myronidis, 1999: Principal Component Analysis of Precipitaiton in Thessaly Region (Central Greece). Global NEST Journal, Vol 11, No 4, pp 467-476. L S., J. Josse, and F. Husson, 2008. FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software. Volume 25, Issue 1.

14

Das könnte Ihnen auch gefallen