Beruflich Dokumente
Kultur Dokumente
FRANCISCO MARTNEZ-LVAREZ* Pablo de Olavide University of Seville, Spain, fmaralv@upo.es ALICIA TRONCOSO Pablo de Olavide University of Seville, Spain, ali@upo.es JOS C. RIQUELME University of Seville, Spain, riquelme@us.es
Data mining has become an essential tool during the last decade to analyze large sets of data. The variety of techniques it includes and the successful results obtained in many application elds, make this family of approaches powerful and widely used. In particular, this work explores the application of these techniques to time series forecasting. Although classical statistical-based methods provides reasonably good results, the result of the application of data mining outperforms those of classical ones. Hence, this work faces two main challenges: i) to provide a compact mathematical formulation of the mainly used techniques, ii) to review the latest works of time series forecasting and, as case study, those related to electricity price. Categories and Subject Descriptors: A.1 [Introductory and Survey]: ; G.3 [Probability and Statistics]: Time series analysis; H.2.8 [Database Management]: Database Applications Data mining Additional Key Words and Phrases: Time series, forecasting, data mining
1. INTRODUCTION
The prediction of the future has fascinated the human being since its early existence. Actually, many of these eorts can be noticed in everyday events such as weather, currency exchange rate, pollution, and so forth. Accurate predictions are essential in economical activities as remarkable forecasting errors in certain areas may involve large loss of money. Given this situation, the successful analysis of temporal data has been a challenging task for many researchers during the last decades and, indeed, it is dicult to gure out any scientic branch with no time-dependant variables. A thorough review of the existing techniques devoted to forecast time series is provided in this survey. Although a description of classical Box-Jenkins methodology is also discussed, this text is particularly focused on those methodologies that
*Corresponding author Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for prot or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specic permission and/or a fee. c 2011 ACM 0000-0000/2011/0000-0001 $5.00
ACM Computing Surveys, Vol. V, No. N, Month 2011, Pages 10??.
F. Martnez-lvarez et al.
make use of data mining techniques. Moreover, a family of energy-related time series are examined due to the scientic relevance it has shown during the last decade: electricity price time series. These series have been chosen since they present some peculiarities such as nonconstant mean and variance, high volatility or presence of outliers, that turns the forecasting process into a particularly dicult task to full. Actually, the electric power markets have become competitive markets due to the deregulation carried out in the last years, allowing the participation of all producers, investors, traders or qualied buyers. Thus, the price of the electricity is determined on the basis of this buying/selling system. Consequently, a will of obtaining optimized bidding strategies has arisen in the electricity-producer companies [Plazas et al. 2005], needing both insight into future electricity prices and assessment of the risk of trusting in predicted prices. Some works have already reviewed electricity price time series forecasting techniques [Aggarwal et al. 2009]. However, none of them are focused on data mining techniques, as they especially explored linear models' accuracy. So, this review is to ll this gap and to highlight all the skills these techniques are exhibiting nowadays. In short, this survey is to provide the reader with a general overview of current data mining techniques used in time series analysis and, as case study, their application to a real-world energy-related set of series. The remainder of this work is structured as follows. Section 2 provides a formal description of a time series and describes its main features. As for the rest of the sections, every section is devoted to explore one forecasting technique. However the reader must consider that these techniques are divided into two main groups: linear regressions and non-linear approaches. Therefore all sections consist of a brief mathematical description of the technique analyzed and a review of the ve most representative works. In particular, Section 3 describes the approaches based on linear methods. Classical Box and Jenkins-based methods such as AR, MA, ARMA, ARIMA or VAR are thus reviewed. In Section 4 rule-based forecasting methods are analyzed, providing a brief explanation of what a decision rule is, and revisiting the latest and most relevant works in this domain. As for Section 5, it is a compendium of the non-linear forecasting techniques currently in use in the data mining domain. In this way, articial neural networks, support vector machines, genetic programming and wavelet transforms are described in this Section, as well as discussing the most relevant improvements achieved by means of these techniques. The Section 6 provides an overview of the local forecasting methods, focusing on the nearest neighbors technique and variants. A compilation of several works that cannot be classied in none of the aforementioned groups is described in Section 7. Thus, forecasting approaches based on Markov processes, on Grey models or on manifold dimensionality reduction, are there detailed. Finally, Figure 1 depicts the methods studied along with the section where they are introduced.
ACM Computing Surveys, Vol. V, No. N, Month 2011.
This section is to describe temporal data features as well as to provide mathematical description for such a kind of data. Thus, a time series can be understood as a sequence of values observed over time and chronologically ordered. Time is a continuous variable, however, samples are recorded at constant intervals in practice. When the time is considered as a continuous variable, the discipline is commonly referred as functional data analysis [Ramsay and Silverman 2005]. The description of this category is out of scope in this survey. Let yt , t = 1, 2, ..., T be the historical data of a given time series. This series is thus formed by T samples, where each yi represents the recorded value of the variable y at time i. Therefore, the forecasting process consists in estimating the value for yT +1 and, the goal, to minimize the error, which is typically represented as a function of yT +1 yT +1 . This estimation can be extended when the horizon of prediction is greater than one, that is, when the objective is to predict a sample at a time T + h, yT +h . In this situation, the best prediction is reached when a function of h (yT +i yT +i ) is minimized. i=1 Time series can be graphically represented. In particular, the xaxis identies the time (t = 1, 2, ..., T ) whereas the yaxis the values recorded at punctual time stamps (yt ). This representation allows the visual detection of the most highlighting features of a series, such as oscillations amplitude, existing seasons and cycles or the existence of anomalous data or outliers. Figure 2 illustrates, as example, the price evolution for a particular period of 2006 in the Spanish electricity market. An usual strategy to analyze time series is to decompose them in three main components [Brockwell and Davis 2002; Shumway and Stoer 2011]: trend, seasonality and irregular components, also known as residuals. (1) Trend. It is the general movement that the variable exhibits during the observation period, without considering seasonality and irregulars. Some authors prefer to refer the trend as the longterm movement that a time series shows. Trends can present dierent proles such as linear, exponential or parabolic. (2) Seasonality. This component typically represents periodical uctuations of the variable subjected to analysis. It consists of the eects reasonably stable
ACM Computing Surveys, Vol. V, No. N, Month 2011.
F. Martnez-lvarez et al.
140
120
Electricity price
100
80
60
40
20
100
200
300
400
500
600
700
Time (hours)
Trend
Magnitude
Magnitude
Time
Time
Seasonality
0.1 0.08 0.06 0.04 0.1 0.08 0.06
Residuals
Magnitude
Magnitude
Time
Time
Residuals.
Figure 3 depicts how a time series can be decomposed in the variables above described. Obviously, real-world time series present a meaningful irregular component, which
makes their prediction a especially hard task to full. Some forecasting techniques are focused on detecting trend and seasonality (especially traditional classical methods), however, residuals are the most challenging component to be predicted. The eectiveness of one technique or another is assessed according to its capability of forecasting this particular component. It is for the analysis of this component where data mining-based techniques has been shown to be particularly powerful, as this survey will attempt to show in next sections. There exist real complex phenomena that cannot be represented by means of linear dierence equations since they are not fully deterministic. Therefore, it may be desirable to insert a random component in order to allow a higher exibility on its analysis. Linear forecasting methods are those that try to model a time series behavior by means of a linear function. From all the existing techniques, ve of them are quite popular: AR, MA, ARMA, ARIMA and VAR. These models follows a common methodology, whose application to time series analysis was rst introduced by Box and Jenkins. The original work has been extended and published many times since its rst apparition in 1970, but the newest version can be found in [Box and Jenkins 2008]. Autoregressive AR(p), moving average M A(q), mixed ARM A(p, q) and autoregressive integrated moving average ARIM A(p, d, q) models were described following with this idea, where p is the number of autoregressive parameters, q is the number of moving average parameters and d is the number of dierentiations for the series to be stationary. Vector autoregressive models V AR(p) are the natural extension for AR models to multivariate time series, where p denotes the number of lags considered in the system. There are some other models, such as ARCH (autoregressive conditional heteroscedasticity, rstly presented in [Kohonen 1982]) or GARCH (generalized ARCH, introduced in [Bollerslev 1986]), which follow similar methodology. However, these two methods are especially designed to cope with volatile time series, that is, with series that exhibit high volatility and outlying data. As this particular kind of data deserves a thorough study, the authors have decided not to include any further information about them (for detailed information refer to [Xekalaki and Degiannakis 2010; Francq and Zakoian 2010]). An autoregressive process (AR) is denoted by AR(p), where p is the order of the AR process. This process assumes that every yt can be expressed as a linear combination of some past values. It is a simple model but that adequately describes many real complex phenomena. The generalized AR model of order p is described by: (1) where i are the coecients that models the linear combination, t the adjustment error, and p the order of the model.
yt = i yti +
t i=1 p
F. Martnez-lvarez et al.
When the error is small compared to the actual values, a future value can be estimated as follows:
yt = yt +
p t
=
i=1
wi yti
(2)
Vector autoregressive models (VAR) are the natural extension of the univariate AR to multivariate time series. VAR models have shown to be especially useful to describe dynamic behaviors in time series and therefore to forecast. In a VAR process of order p with N variables V AR(p), N dierent equations are estimated. In each equation a regression of the target variable over p lags is carried. Unlike the univariate case, VAR allow that each series to be related with its own lag and the lag of the other series that form the system. For instance, in two time series systems, there are two equations, one for each variable. This two-series system (V AR(1), N = 2) can be mathematically expressed as follows: (3) (4) (5) where y for i = 1, 2 are the series to be modeled, and 's the coecients to be estimated. Note that the selection of an optimum length of the lag is a critical task for VAR processes and, for this reason, has been widely discussed in literature [Yang 2002].
y1,t = 11 y1,t1 + 12 y2,t1 + y2,t = 21 y1,t1 + 22 y2,t1 +
1,t 2,t i,t
When the error cannot be assumed as negligible, AR processes are not valid. In this situation it is practical to use the moving average (MA) process, where the series is represented as linear combination of the error values:
t
(6) where q + 1 is the order of the MA model and the coecients of the linear combination As observed, it is not necessary to make explicit use of past values of y to estimate its future value. Finally, MA processes are seldom used alone in practice.
q
yt =
ti
i=1
Autoregressive and moving average models are combined in order to generate better approximations than that of Wold's representation [Wold 1954]. This hybrid model is called autoregressive moving average process (ARMA) and denoted by ARM A(p, q). Formally:
yt =
i=1
i yti +
i=1
ti
(7)
of
Again, ARMA assumes that t is small compared to yt to estimate future values yt . The estimates of t past values at time t i can be obtained from past actual
values of
yt
yt :
(8)
yt
is calculated as follows:
p q
yt =
i=1
i yti +
i=1
i ti
(9)
where
in the prediction equation. These models follows a common methodology, whose application to time series analysis was rst introduced by Box and Jenkins [Box and Jenkins 2008]. Thus, this methodology proposes an iterative process formed by four main steps as illustrated in Figure 4.
and is one of
8
(2)
F. Martnez-lvarez et al.
Once
estimated by following non-linear strategies. From all of them, three stand out: the Gauss-Newton method, the least squares minimization and the maximum (3)
Validation of the model. Once the ARIMA model has been estimated several
values or the signicance of the coecients forming the model are forced to agree with some requirements. In cases in which this step is not fullled, the process begins again and the parameters are recalculated.
hypotheses have to be validated. Thus, the tness of the model, the residual
(4)
Forecasts. Finally, if the parameters have been properly determined and validated, the system is ready to perform forecasts.
Ghosh presented an ARIMA variation in [Ghosh and Das 2004]. Hence, he used a
order to predict the maximum mensual demand in the city of Maharashtra, India. He used a database that gathers data from April 1980 to June 1999 and made a prediction for the eighteen subsequent months. The reported results were goods
this time the variable to be predicted was the inbound air travel arrivals to Taiwan. The authors in [Conejo et al. 2005] used the wavelet transform and ARIMA models to predict the day-ahead electricity price. Indeed, they rst used the wavelet transform to split the available historical data into constitutive series. Then, specic
ARIMA models were applied to these series and the forecasts were obtained by applying the inverse wavelet transform to the forecasts of these constitutive series. This approach improved former strategies that they had also published [Jimnez and Conejo 1999; Nogales et al. 2002; Contreras et al. 2003]. In [Garca-Martos et al. 2007] ARIMA models, selected by means of Bayesian Information Criteria, were proposed to obtain the forecasts of electricity prices in the Spanish market. In addition, the work analyzed the optimal number of samples used to build the prediction models. Weron et al. [Weron and Misiorek 2008] presented twelve parametric and semiparametric time series models to predict electricity prices for the next day. Moreover, in this work forecasting intervals were provided and evaluated taking into account the conditional and unconditional coverage. They concluded that the intervals obtained by semi-parametric models are better than that of parametric models.
4. RULE-BASED FORECASTING
Prediction based on decision rules usually makes reference to the expert system developed by Collopy and Armstrong in 1992 [Collopy and Armstrong 1992]. The initial approach consisted of 99 rules that combined four extrapolation-based forecasting methods: linear regression, Holt-Winter's exponential smoothing, Brown's exponential smoothing and random walk. During the prediction process, 28 features were extracted in order to characterize the time series. Consequently, this strategy assumed that a time series can be reliably identied by some features. Nevertheless, just eight features were obtained by the system itself since the remaining ones were selected by means of experts' inspections. This fact implies high ineciency insofar as too much time is taken, the ability of the analyst plays an important (and subjective) role and it shows a medium reliability. Formally, an association rule (AR) can be expressed as a sentence such that: If A Then B, with A a logic predicate over the attributes whose fulllment involves to classify the elements with a label B. The learning based on rules tries to nd rules involving the highest number of attributes and samples. ARs were rst dened by Agrawal et al. [Agrawal et al. 1993] as follows. Let I = {i1 , i2 , ..., in } be a set of n items, and D = {tr1 , tr2 , ..., trN } a set of N transactions, where each trj contains a subset of items. Thus, a rule can be dened as X Y , where X, Y I and X Y = . Finally, X and Y are called antecedent (or left side of the rule) and consequent (or right side of the rule), respectively. When the domain is continuous, the association rules are known as quantitative association rules (QAR). In this context, let F = {F1 , ..., Fn } be a set of features, with values in R. Let A and C be two disjunct subsets of F , that is, A F , C F , and A C = . A QAR is a rule X Y , in which features in A belong to the antecedent X , and features in C belong to the consequent Y , such that:
X=
Fi A
4.1 Fundamentals
Fi [li , ui ]
(10)
10
F. Martnez-lvarez et al.
Y =
Fj C
Fj [lj , uj ]
(11)
where li and lj represent the lower limits of the intervals for Fi and Fj respectively, ui and uj the upper ones. For instance, QAR could be numerically
(12)
F1
and
F3
F2
and
F5
11
In addition, the system was implemented in order to deal with stationary, trend,
weight.
neuron if a positive weighted synapsis is carried out or, on the contrary, they can inhibit it if the weight is negative. Finally, if the sum of the weighted inputs is equal or greater than a certain threshold, the neuron is activated. Neurons present, consequently, binary results: activation or not activation. Figure 5 illustrates an usual structure of an ANN. There are three main features that characterize a neural network: topology, learning paradigm and the representation of the information. A brief description of them are now provided. (1)
Topology of the ANN. Neural networks architecture consists in the organization and position of the neurons with regard to the input or output of the network. number of layers, the number of neurons per layer, the connection grade and the type of connections among neurons. With reference to the In this sense, the fundamental parameters of the network are the
number of layers,
ANN can be classied into monolayer or multilayer networks (MLP). The rst ones only have one input layer and one output layer, whereas the multilayer networks [Rosenblatt 1958] are a generalization of the monolayer ones, which add intermediate or hidden layers between the input and the output. When discussing about the
12
F. Martnez-lvarez et al.
propagation is produced in just one way and, therefore, they do not have a memory or recurrent if they keep feedback links between neurons in dierent layers, neurons in the same layer or in the same neuron. Finally, the connection grade can be totally connected if all neurons in a layer are connected with the neurons in the next layer (feedforward networks) or with the neurons in the last layer (recurrent networks) and, otherwise, partially connected networks in cases where there is not total connection among neurons from dierent layers. (2) Learning paradigm. The learning is a process that consists in modifying the weights of the ANN, according to the input information. The changes that can be carried out during the learning process are removing (the weight is set to zero), adding (conversion of a weight equal to zero to a weight dierent to zero) or modifying neurons connections. The learning process is said to be nished or, in other words, the network has learnt when the values assigned to the weights remain unchanged. The learning in ANN can be either supervised (perceptron and backpropagation [Rumelhart et al. 1986] techniques) or unsupervised, from which the self-organized Kohonen's maps (SOM) [Kohonen 1986] stands out. (3) Representation of the input/output information. ANN can be also classied according to way in which information relative to both input and output data is represented. Thus, in a great number of networks input and output data are analog which entails activation functions also analogs, either linear or sigmoidal. In contrast, there are some networks that only allow discrete or even binary values as input data. In this situation, the neurons are activated by means of an echelon function. Finally, hybrid ANNs can be found in which input data may accept continuous values and output data would provide discrete values or viceversa. Many references proposing the use of ANNs, or a variation of them, as a powerful tool to forecast time series, can be found in the literature. The most important works are detailed below. Furthermore, the creation of hybrid methods that highlight most of the strengths of each technique is currently the most popular work among the researchers. However, from all of them, the combination of ANN and
13
In [Abraham and Nath 2001] the performance of ANNs, fuzzy networks and ARIMA models was evaluated to forecast the electricity demand time series in the State of Victoria (Australia). The results showed that the fuzzy neural network
outperformed the plain ANN and ARIMA models. The detection of spike values is an arduous task usually faced by researchers. Hence, Saini and Soni [Saini and Soni 2003] implemented a multilayer perceptron with two hidden layers in order to predict peaks for the demand in the Indian energetic system, by using conjugate gradient methods. In order to avoid the
trapping of the network into a state of local minima, the optimization of the learning rate and error were performed. Rodrguez and Anders [Rodrguez and Anders 2004] presented a method to predict electricity prices by means of an ANN and fuzzy logic, as well as a combination of both. The basic selected network conguration consisted of a back propagation neural network with one hidden layer that used a sigmoid transfer function and a one-neuron output layer with a linear transfer function. They also reported the
results of applying dierent regression-based techniques over the Ontario market. A hybrid model that ANNs and fuzzy logic was introduced in [Amjady 2006]. As regards the neural network presented, it had a feed-forward architecture and three layers, where the hidden nodes of the proposed fuzzy neural network performed the fuzzycation process. The approach was tested over the Spanish electricity
price market and showed to be better than many other techniques such as ARIMA, Wavelet or MLP. Taylor et al. [Taylor 2006] compared six univariate time series methods to forecast electricity load for Rio de Janeiro and England and Wales markets. These
methods were an ARIMA model and an exponential smoothing (both for double seasonality), an articial neural network, a regression model with a previous principal component analysis and two naive approaches as reference methods. The best method was the proposed exponential smoothing and the regression model showed a good performance for the England and Wales demand. Another neural network-based approach was introduced in [Catalao et al. 2007] in which multiple combinations were considered. These combinations consisted of networks with dierent number of hidden layers, dierent number of units in each layer and several types of transfer functions. The authors evaluated the accuracy of the approach reporting the results from the electricity markets of mainland Spain and California. The use of ANN for forecasting electricity prices in the Spanish market was also proposed in [Pino et al. 2008]. The main novelty of this work lies on the proposed training method for ANN, which is based on making a previous selection for the MLP training samples, using an ART-type [Zurada 1992] neural network. In [El-Telbany and El-Karmi 2008], the authors discussed and presented results by using an ANN to forecast the Jordanian electricity demand, which is trained by a particle swarm optimization technique. They also showed the performance
obtained by using a back propagation algorithm and autoregressive moving average models. In 2009, the authors in [Yao 2009] integrated generalized linear auto-regression
14
F. Martnez-lvarez et al.
and ANN for forecasting the coal demand of China. To evaluate the quality of that hybrid approach, they compared the performance of their own method with the two used models, separately. 5.2 Genetic programming A genetic algorithm (GA) [Goldberg 1989] is a kind of searching stochastic algorithm based on natural selecting procedures. Such algorithms try to imitate the biological evolutive process since they combine the survival of the best individuals in a set, by means of an structured and random process of information exchange. Every time the process iterates, a new set of data structures is generated gathering just the best individuals of older generations. Thus, the GA are evolutionary algorithms due to their capacity to eciently exploit the information relating to past generations. This fact allows the speculation about new searching points in the solution space, trying to obtain better models thanks to its evolution. Many genetic operators can be dened. However, selection, crossover and mutation are the most relevant and used and are now going to be briey described. (1) Selection. During each successive generation, a proportion of the existing population is selected to breed a new generation. Individual solutions are selected through a tness-based process, where tter solutions (as measured by a tness function) are typically more likely to be selected. Certain selection methods rate the tness of each solution and preferentially select the best solutions. Other methods rate only a random sample of the population, as this process may be very time-consuming. Most functions are stochastic and designed so that a small proportion of less t solutions are selected. This helps keep the diversity of the population large, preventing premature convergence on poor solutions. Popular and well-studied selection methods include roulette wheel selection and tournament selection. (2) Crossover. Just after two parents are selected by any selection method, crossover takes place. Crossover is an operator that mates these two parents to produce ospring. The newborn individuals may be better than their parents and the evolution process may continue. In most crossover operators, two individuals are randomly selected and recombined with a crossover probability, pc . That is, an uniform number r is generated and if r pc the two randomly selected individuals undergo recombination. Otherwise, the ospring can be sheer copies of their parents. The value of pc can either be set experimentally or set based on schema-theorem principles [Goldberg 1989]. (3) Mutation. Mutation is the genetic operator that randomly changes one or more of the individuals' genes. The purpose of the mutation operator is to prevent the genetic population from converging to a local minimum and to introduce to the population new possible solutions. Genetic programming (GP) is a natural evolution of GA and its rst apparition in the literature dates of 1992 [Koza 1992]. It is a specialization of genetic algorithms where each individual is a computer program. Therefore it is used to optimize a population of computer programs according to a tness landscape determined by a program's ability to perform a given computational task. Hence, specialized
15
genetic operator that generalize crossover and mutation are used for tree-structured programs. The main steps to be followed when using GP are now summarized. Obviously, depending on the type of the application, these steps may change in order to be adapted to the particular problem to be dealt with. (1) Random generation of an initial population, that is, programs. (2) Iterative execution until the stop condition to be determined in each situation is fullled: (a) To execute each program of the population and to assign an aptitude value, according to their behavior in relation with the problem. (b) To create new programs by applying dierent primary operations to the programs. i. To copy an existing program in the new generation. ii. To create two programs from two existing ones, genetically and randomly recombining some chosen parts of both programs, making use of the crossover operator, which will also be randomly chosen for each program. iii. To create a program from another randomly chosen by randomly changing a gene. (3) The program identied as possessing the best aptitude (the best for the last generation) is the designed result of the GP running. The viability of forecasting the electricity demand via linear GP is analyzed in [Bhattacharya et al. 2002]. Hence, the authors considered load demand patterns for ten consecutive months, observed every thirty minutes for the Victoria State of Australia. The performance was compared with an ANN and a neuro-fuzzy system and the system delivered best results in terms of accuracy and computational cost. When applying forecasting techniques, stochastic environments are usually assumed, which do not make them unsuitable for many real-world time series. In [Wagner et al. 2007] a new model of dynamic GP is developed specically devoted to non-static environments. It incorporates features that allow it to adapt to changing environments automatically as well as retain knowledge learned from previously encountered environments. The authors tested the accuracy over the USA's gross domestic product (GDP) and consumer price ination (CPI). The work in [Xie et al. 2007] analyzed the ability of GP to forecast the price ination in New Zealand. By using non-preprocessed temporal data in short intervals, the reported results showed that GP can perform accurate forecasts of price ination over short and medium terms. Another price forecasting strategy was proposed in [Amjady and Keynia 2009]. In fact the authors presented a mutual information-based feature selection technique in which the prediction part was a cascaded neuro-evolutionary algorithm. The accuracy was largely evaluated since they compared their results obtained from Pennsylvania-New Jersey-Maryland and Spanish electricity markets with seven dierent models. Another linear GP was used in [Guven 2009] with the aim of predicting daily ow rates time series at an American station. In addition, the authors proposed two
16
F. Martnez-lvarez et al.
ANN models that were calibrated with some training sets. The study showed that the performance of the linear GP was moderately better than the one provided by the ANN. An evolutionary technique applied to the optimal short-term scheduling of the electric energy production was presented in [Troncoso et al. 2004]. The equations that dene the problem led to a nonlinear mixed-integer programming problem with a high number of real and integer variables. The required heuristics, introduced to assure the feasibility of the constraints, are analyzed, along with a brief description of the proposed GA. Results from the Spanish power system were reported. An integrated algorithm for forecasting oil production based on GA with variable parameters was presented in [Azadeh et al. 2009]. The algorithm identied the best model based on the results of minimum absolute percentage error and variance analysis. In adittion, the dynamic structure of the proposed GA identied conventional time series as the best model for oil production prediction. The method was validated for Russia, Brazil, India and USA oil production over a period ranging from 2001 to 2006. The support vector machine (SVM) model the way is nowadays understood, initially appeared in 1992 in the Computational Learning Theory (COLT) Conference and it has been subsequently studied and extended [Cortes and Vapnik 1995; Vapnik 1998]. The interest for this learning model is continuously increasing and it is considered an emerging and successful technique nowadays. Thus, it has become to a widely accepted standard in machine learning and data mining disciplines. The learning process in SVM represents an optimization problem under constraints that can be solved by means of quadratic programming. The convexity guarantees a single solution which is an advantage with regard to the classical model of ANN. Furthermore, current implementations provide moderate eciency for real-world problems with thousands of samples and attributes. Support vector machines aims at separating points by means of what they dened as hyperplane, which are just linear separators with a high dimensionality whose functions are dened according to dierent kernels. Formally, a hyperplane in a D-dimensional space is dened as follows: h(x) =< w, x > +b (13) D where x is the sample, w R is the orthogonal vector to the hyperplane, b R, w is the weight vector, b is the bias or threshold decision and < w, x > expresses the scalar product in RD . In case of a binary classier is required, the equation can be reformulated as: f (x) = sign (h(x)) (14) where the sign function is dened as:
sign(x) = +1, if x 0 1, if x < 0
5.3 Support vector machines
(15)
There exist many algorithms directed to create hyperplanes (w, b) given a dataset linearly separable. These algorithms guarantee the convergency to a solution hyperACM Computing Surveys, Vol. V, No. N, Month 2011.
17
Fig. 6. Hyperplane (w, b) equidistant to two classes, margin and support vectors.
plane although particularities of all of them will lead to slightly dierent solutions. Note that there can be innity hyperplanes that perform adequate separations. So the key problem for the SVM is to choose the best hyperplane, in other words, the hyperplane that maximizes the minimum distance (or geometric margin) between the samples in the dataset and the hyperplane itself. Another peculiarity of SVM is that only take into consideration those points belonging to the frontiers of the region of decision, which are the points that do not clearly belong to a class or to another. Such points are named support vectors. Figure 6 illustrates a bidimensional representation of an hyperplane equidistant to two classes, as well as showing the support vectors and the existing margin. If non linear transformation is carried out from the input space to the feature space, non linear separators-based learning is reached with SVM. Kernel functions are used thus in order to estimate the scalar product of two vectors in the features space. Consequently the election of an adequate kernel function is crucial and a priori knowledge of problem is required for a proper application of SVM. Nevertheless, the samples may not be linearly separable (see Figure 7) even in the features space. Trying to classify properly all the samples can seriously compromise the generalization of the classier. This problem is known as overtting. In such situations it is desirable to admit that some samples will be misclassied in exchange for having more promising and general separators. This behavior is reached by inserting soft margin in the model, whose objective function is composed by the addition of two terms: the geometric margin and the regularization term. The importance of both terms is pondered by means of a typically called parameter C . This model appeared in 1999 [Vapnik 1999], and it was the model that really allowed the practical use that SVMs have nowadays, since it provided robustness against the noise. Many works have been focussed on forecasting time series by applying SVM.
18
F. Martnez-lvarez et al.
Hence, the authors in [Huang et al. 2005] investigated the predictability of nancial movement direction with SVM by forecasting the weekly movement direction of the NIKKEI 225 index. In order to evaluate the accuracy of the proposed SVM, they compared the reported results with those of linear and quadratic analysis as well as with a back-propagation ANN. Further, they proposed a mixed model composed of a SVM and other classiers, weighting each classier according to experts experience. The study carried out in [Hong 2005] analyzed the suitability of applying SVM to forecast the electric load for the Taiwanese market. The results were compared to that of linear regressions and ANN. The same time series type, but related to the Chinese market, was forecasted in [Guo et al. 2006], in which the authors reached a globally optimized prediction by applying a SVM. Fan et al. [Fan et al. 2006] proposed a hybrid machine learning model based on Bayesian classiers and SVM. First, Bayesian clustering techniques were used to split the input data into 24 subsets. Then, SVM methods were applied to each subset to obtain the forecasts of the hourly electricity load. The occurrence of outliers (also called spike prices) or prices signicantly larger than the expected values is an usual feature found in these time series. With the aim of dealing with this feature, the authors in [Zhao et al. 2007] proposed a data mining framework based on both SVM and probability classiers. The research published in [Wang and Wang 2008] proposed a new prediction approach based on SVM and rough sets techniques with a previous selection of features from data sets by using an evolutionary method. The approach improved the forecasting quality, reduced the speed of convergence and the computational cost as regards a conventional SVM and a hybrid model formed by a SVM and simulated annealing algorithms. The use of SVM to forecast the ratios of key-gas in power transformer oil to
ACM Computing Surveys, Vol. V, No. N, Month 2011.
19
early detect failures was proposed in [Fei et al. 2009]. In the research, GA were used to determine free parameters of the SVM. The experimental results showed that the new method achieved greater accuracy than Grey model and ANN when the training set was small.
1 CW Tx (n, s) = s
N 1 n =0
x n
n n t s
with n = 0 . . . N 1
(16)
As this product has to be done N times for the scale s considered, if N is too large it is faster to estimate the result by using the FFT than by means of the denition. From the convolution theorem [Proakis and Manolakis 1998], the CWT can be obtained from the inverse fast Fourier transform (IFFT) of time series and the wavelet's direct transform:
CW Tx (n, s) = IF F T
(17)
Since s is the single parameter from which the transform depends on, the estimation of the CWT can be carried out by means of FFT algorithms for each scale as well as simultaneously for all the points forming the series. Conejo et al. [Conejo et al. 2005] proposed a new approach to predict day-ahead electricity prices based on the wavelet transform and ARIMA models. Thus, they decomposed the time series in a set of better-behaved constitutive series by applying
20
F. Martnez-lvarez et al.
the wavelet transform. Then, the future values of these new series were forecast using ARIMA models, with a prior application of the inverse wavelet transform. A time series segmentation methodology is proposed in [Tseng et al. 2006]. It combined clustering techniques, wavelets transforms and PG in order to automatically search for segments and patterns in time series. To be precise, the GP was used to nd to most suitable segmentation points from which patterns may be found. Aggarwal et al. [Aggarwal et al. 2008] also forecasted electricity prices. For this purpose, they divided each day into segments and they applied a multiple linear regression to the original series or the constitutive series obtained by the wavelet transform depending on the segment. Moreover, the regression model used dierent input variables for each segment. Pindoriya et al. [Pindoriya et al. 2008] proposed an adaptive wavelet-based neural network for short-term electricity price time series forecasting for Spanish and California markets. As for the neural network, the output of the hidden layer neurons was based on wavelets that adapted their shape to training data. The authors concluded that their approach converged with higher rate and outperformed in the forecasting the electricity prices compared to other methods due to the ability for modeling the non-stationary and high frequency signals. A new method devoted to analyze the fatigue road loading can be found in [Zaharim et al. 2009]. In it, wavelets transforms and GA were combined as well as clustering the data into a sequence of a nested partition. The main goal was to solve some numerical bounded variables, dynamic focalized search, dynamic control of diversity and feasible region analysis, which are very common issues in evolutionary computation.
this rule can be extended by weighting the importance of the neighbors, giving a larger weight to the really nearest neighbors. The search of the nearest neighbor process can be dened as follows:
two dierent type of queries are wanted to be answered: Nearest neighbor: nd the point in Range: given a point
Denition..
Given a dataset
q X
d(p, q) r
and
p P
that satisfy
21
Figure 8 illustrates an example in which k is set to three (three nearest neighbors are searched for) and an Euclidean metric is used.
each example ei has m attributes (ei1 , . . . , eim ) belonging to the metric space E m and a class Ci {C1 , . . . , Cd }. The classication of each new example e fulls that:
e Ci j = i d(e , ei ) < d(e , ej )
Formally, the classication rule is formulated as follows: Denition.. Let D = {e1 , . . . , eN } be a dataset with N labeled examples, in which
(18)
where e Ci indicates the assignation of the class label Ci to the example e ; and d expresses a distance dened in the m-dimensional space, E m . One example is thus labeled according to the nearest neighbor's class. This closeness is dened by means of the distance d which turns the election of this metric essential, since dierent metrics will most likely generate dierent classications. As a consequence the election of the metric is widely discussed in the literature, as shown in [Wang et al. 2007]. Note that the other main drawback that this technique presents is the selection of the number of neighbors to consider [Wang et al. 2006]. In [Troncoso et al. 2002] a forecasting algorithm based on nearest neighbors was introduced. The selected metric was the weighted Euclidean distance and the weights were calculated by means of a GA. The authors forecasted electricity demand time series in the Spanish market and the reported results were compared to those of an ANN. The same algorithm was tested on electricity price time series in [Troncoso et al. 2002]. In [?], the authors proposed a methodology based on weighted nearest neighbors (WNN) techniques. The proposed approach was applied to the 24-hour load forecasting problem and they built an alternative model by means of a conventional dynamic regression technique, where the parameters are estimated by solving a least squares problem, to perform a comparative analysis. A modication of the WNN methodology was proposed in [Troncoso et al. 2007]. To be precise, they explained how the relevant parameters the window length of the time series and the number of neighbors to be chosen are adopted. Then, the approach weighted the nearest neighbors in order to improve the prediction
22
F. Martnez-lvarez et al.
accuracy. The methodology was evaluated with the Spanish electricity prices time series. The study presented in [Dimri et al. 2008] proposed a simple but eective NNbased method to forecast probability of precipitation occurrence and its quantity in the western Himalayas. The model was able to give the projections three days in advance with an accuracy verging on 85%. Although the forecast of occurrence's probability of precipitation was well performed, the quantitative precipitation forecast remained with large error rates. 7. OTHER MODELS Despite of the vast description of methods provided in prior Sections, some authors proposed new forecasting approaches that cannot be classied in none of the aforementioned categories. For this reason, this Section is devoted to introduce all these works. Hence, transfer functions models known as dynamic econometric models in the economics literature based on past electricity prices and demand were proposed to forecast day-ahead electricity prices by Nogales et al. in [Nogales and Conejo 2006], but the prices of all 24 hours of the previous day were not known. They used the median as measure due to the presence of outliers and they stated that the model in which the demand was considered presented better forecasts. The authors in [Pezzulli et al. 2006] focussed on the one year-ahead electricity demand prediction for winter seasons by dening a new Bayesian hierarchical model. They provided the marginal posterior distributions of demand peaks. Also in [Cottet and Smith 2003], Bayesian models are used to forecast electricity demand. Moreover, a multiple linear regression model to forecast electricity consumption using some input variables such as the gross domestic product, the price of electricity and the population was proposed in [Mohamed and Bodger 2005]. A novel non-parametric model using the manifold learning methodology was proposed in [Chen et al. 2007] in order to predict electricity price time series. For this purpose, the authors used cluster analysis based on the embedded manifold of the original dataset. To be precise, they applied manifold-based dimensionality reduction to curve modeling, showing that the day-ahead curve can be represented by a low-dimensional manifold. Recently, a fuzzy inference system adopted due to its transparency and interpretability combined with traditional time series methods was proposed for dayahead electricity price forecasting [Li et al. 2007]. Another dierent proposal can be found in [Wang and Meng 2008], where a forecasting algorithm based on Grey Models was introduced to predict the load of Shanghai. In the Grey model the original data series was transformed to reduce the noise of the data series and the accuracy was improved by using Markov chains techniques. 8. CONCLUSIONS
ACKNOWLEDGMENTS
This research has been partially supported by the Spanish Ministry of Science and Technology under project TIN2007-68084-C-00, and Junta de Andaluca under
23
Aggarwal, S. K., Saini, L. M., and Kumar, A. 2008. Price forecasting using wavelet transform Aggarwal, S. K. Saini, L. M. and Kumar, A. Agrawal, R. Imielinski, T. and Swami, A. Amjady, N. Amjady, N. and Keynia, F.
Aznarte, J. L. Bentez, J. M. and Castro, J. L. Bhattacharya, M. Abraham, A. and Nath, B. Bollerslev, T. Box, G. and Jenkins, G.
Brockwell, P. J. and Davis, R. A. Catalao, J. P. S. Mariano, S. J. P. S. Mendes, V. M. F. and Ferreira, L. A. F. M. Chen, C. F. Chang, Y. H. and Chang, Y. W. Chen, J. Deng, S. J. and Huo, X. Cheong, C. W.
Conejo, A. J. Plazas, M. A. Espnola, R. and Molina, B. Cortes, C. and Vapnik, V. Cottet, R. and Smith, M.
and lse based mixed model in australian electricity market. International Journal of Energy Sector Management 2, 4, 521546. , , 2009. Electricity price forecasting in deregulated markets: A review and evaluation. International Journal of Electrical Power and Energy Systems 31, 1, 1322. , , 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. 207216. 2006. Day-ahead price forecasting of electricity markets by a new fuzzy neural network. IEEE Transactions on Power Systems 21, 2, 887896. 2009. Day-ahead price forecasting of electricity markets by mutual information and cascaded neuro-evolutionary algorithm. IEEE Transactions on Power Systems 24, 1, 306318. , , 2009. An integrated GA-time series algorithm for forecasting oil production estimation: USA, Russia, India, and Brazil. International Journal of Industrial and Systems Engineering 4, 4, 368387. , , 2008. Smooth transition autoregressive models and fuzzy rule-based systems: functional equivalente and consquences. Fuzzy Sets and Systems 158, 27342745. , , 2002. A linear genetic programming approach for modelling electricity demand prediction in victoria. In Proceedings of International Workshop on Hybrid Intelligent Systems. 379394. 1986. Modelling the coherence in short-run nominal exchange rates: A multivariate generalized ARCH model. Review of Economics and Statistics 72, 498505. 2008. Time series analysis: forecasting and control. John Wiley and Sons. 2002. Introduction to time series and forecasting. Springer. , , , 2007. Short-term electricity prices forecasting in a competitive market: a neural network approach. Electric Power Systems Research 77, 12971304. , , 2009. Seasonal ARIMA forecasting of inbound air travel arrivals to Taiwan. International Journal of Forecasting 5, 2, 25140. , , 2007. Electricity price curve modeling by manifold learning. IEEE Transactions on Power Systems 15, 723736. 2009. Modeling and forecasting crude oil markets using ARCH-type models. Energy Policy 37, 6, 23462355. 1992. Rule-based forecasting: Development and validation of an expert systems approach to combining time series extrapolations. Management Science 38, 13921414. , , , 2005. Day-ahead electricity price forecasting using the wavelet transform and arima models. IEEE Transactions on Power Systems 20, 2, 10351042. , , , 2003. ARIMA models to predict next-day electricity prices. IEEE Transactions on Power System 18, 3, 10141020. 1995. Support-vector networks. Machine Learning 20, 3, 273297. 2003. Bayesian modeling and forecasting of intra-day electricity load. Journal of the American Statistical Association 98, 464, 839849.
24
F. Martnez-lvarez et al.
Cover, T. M. and Hart, P. E. 1967. Nearest neighbor pattern classication. IEEE Transac14, 19211931. Short-term forecasting of Jordanian electricity demand using particle swarm optimization. Electric Power Systems Research 78, 425433. Fan, S., Mao, C., Zhang, J., and Chen, L. 2006. Forecasting electricity demand by hybrid machine learning model. Lecture Notes in Computer Science 4233, 952963. Fei, S. W., Liu, C. L., and Miao, Y. B. 2009. Support vector machine with genetic algorithm for forecasting of key-gas ratios in oil-immersed transformer. Expert Systems with Applications: An International Journal 36, 3, 63266331. Francq, C. and Zakoian, J. M. 2010. GARCH Models: Structure, Statistical Inference and Financial Applications. Wiley. Garca, R. C., Contreras, J., van Akkeren, M., and Garca, J. B. 2005. A GARCH forecasting model to predict day-ahead electricity prices. IEEE Transactions on Power Systems 20, 2, 867874. Garca-Martos, C., Rodrguez, J., and Snchez, M. J. 2007. Mixed models for short-run forecasting of electricity prices: application for the spanish market. IEEE Transactions on Power Systems 22, 2, 544552. Ghosh, S. and Das, A. 2004. Short-run electricity demand forecasts in Maharastra. Applied Economics 34, 8, 10551059. Goldberg, D. E. 1989. Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Massachusetts, USA. Guirguis, H. S. and Felder, F. A. 2004. Further advances in forecasting day-ahead electricity prices using time series models. KIEE International Transactions on PE 4-A, 3, 159166. Guo, Y., Niu, D., and Chen, Y. 2006. Support-vector machine model in electricity load forecasting. In International Conference on Machine Learning and Cybernetics. 28922896. Guven, A. 2009. Linear genetic programming for time-series modelling of daily ow rate. Journal of Earth System Science 118, 2, 137146. Hong, W. C. 2005. Electricity load forecasting by using SVM with simulated annealing algorithm. In Proceedings of World Congress of Scientic Computation, Applied Mathematics and Simulation. 113120. Huang, W., Nakamori, Y., and Wang, S. Y. 2005. Forecasting stock market movement direction with support vector machine. Computers and Operations Research 32, 10, 2513 2522. Jimnez, N. and Conejo, A. J. 1999. Short-term hydro-thermal coordination by lagrangian relaxation: Solution of the dual problem. IEEE Transactions on Power System 14, 1, 8995. Kohonen, T. 1982. Autoregressive conditional heteroskedasticity with estimates of the variance of UK. Ination Econometrica 50, 9871008. Kohonen, T. 1986. Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 5969. Koza, J. R. 1992. Genetic programming: on the programming of computers by means of natural selection. MA: MIT Press, Cambridge. Li, G., Liu, C. C., Mattson, C., and Lawarre, J. 2007. Day-ahead electricity price forecasting in a grid environment. IEEE Transactions on Power Systems 22, 1, 266274. Liu, H. T. 2009. An integrated fuzzy time series forecasting system. Expert Systems with Applications: An International Journal 36, 6, 1004510053. Malo, P. and Kanto, A. 2006. Evaluating multivariate GARCH models in the nordic electricity markets. Communications in Statistics: Simulation and Computation 35, 1, 117148. Maronna, R. A., Martin, R. D., and Yohai, V. J. 2007. Robust Statistics: Theory and Methods. Wiley.
Daubechies, I. 1992. Ten lectures on wavelets. Society of Industrial in Applied Mathematics. Dimri, A. P., Joshi, P., and Ganju, A. 2008. Precipitation forecast over western himalayas
using k-nearest neighbour method.
International Journal of Climatology 28,
1, 2127.
25
Martnez-Ballesteros, M. Martnez-lvarez, F. Troncoso, A. and Riquelme, J. C. McCulloch, W. S. and Pitts, W. Bulletin of Mathematical Biophysics 5 Mohamed, Z. and Bodger, P. Energy 30 Nam, H. Lee, K. and Lee, D. BMC Bioinformatics 10, Nogales, F. J. and Conejo, A. J. Journal of the Operational Research Society 57 Nogales, F. J. Contreras, J. Conejo, A. J. and Espnola, R. IEEE Transactions on Power System 17, Pezzulli, S. Frederic, P. Majithia, S. Sabbagh, S. Black, E. Sutton, R. and Stephenson, D.
activity. , 115133. economic and demographic variables. , , , 18331843. microarray data sets. 3, 2128. models. , 350356. , , , electricity prices by time series models. , , 2, 342348. , , , , with climatological weather generator. 113125.
Lecture
Applied Stochastic Models in Business and Industry 22, IEEE Transactions on Power Systems 23,
Engineering Applications
Plazas, M. A. Conejo, A. J. and Prieto, F. J. IEEE Transactions on Power Systems 20, Proakis, J. G. and Manolakis, D. G. Digital Signal Processing Functional data analysis Ramsay, J. O. and Silverman, B. W. Rodrguez, C. P. and Anders, G. J. IEEE Transactions on Power Systems 19, Rosenblatt, F. Psychological Review 65 Rumelhart, D. E. Hinton, G. E. and Willians, R. J. Learning internal representations by error propagation Saini, L. M. and Soni, M. K. IEEE Transactions on Power System 18, Shumway, R. H. and Stoffer, D. S. Time series analysis and its applications (with R examples) Stepnicka, M. Pavliska, V. Novak, V. Perfilieva, I. Vavrickova, L. and Tomanova, I. Proceedings of the Fuzzy Systems Association World Congress Taylor, J. International Journal of Forecasting 22 Troncoso, A. Riquelme, J. C. Riquelme, J. M. Martnez, J. L. and Gmez, A.
2005. Multimarket optimal bidding for a 4, 20412050. 1998. . Prentice Hall. 2005. . Springer. power system market. 1, 366374. nization in the brain. , , 386408. , 1986. . MIT Press. conjugate gradient methods. 1, 99105. 2011. . Springer. , , , , , . 483488. of electricity. , 707724. , , , , 2002.
1958. The perceptron: a probabilistic model for information storage and orga-
2009. Time series analysis and prediction based on fuzzy rules and the fuzzy transform. In
2006. Density forecasting for the ecient balancing of the generation and consumption
Electricity market price forecasting: Neural networks versus weighted-distance k nearest neighbours.
IEEE
2002.
2007.
Lecture Notes in
26
F. Martnez-lvarez et al.
, , , ,
Lecture Notes in
2004.
Vapnik, V. Vapnik, V.
Wiley.
Wang, J. Neskovic, P. and Cooper, L. N. Wang, J. M. and Wang, L. P. Wang, X. and Meng, M.
1248. simple adaptive distance measure. 2008.
2006.
Proceedings of the 7th International Conference on Machine Learning and Cybernetics. 1244
ric and semiparametric time series models. 1954.
In
Wold, H. A Study in the Analisis of Stationary Time Series Xekalaki, E. and Degiannakis, S. ARCH Models for Financial Applications Xie, H. Zhang, M. and Andreae, P.
2010.
Yang, M. Yao, P.
prediction. In
Journal 5, 374386.
2009. ,
2002.
Proceedings of IEEE Congress on Evolutionary Computation. 25382545. Lag length and mean break in stationary VAR models. The Econometrics Lecture Notes in Computer Science 5551, 9931001.
, 2009. , , 2007.
Zurada, J. M.
Company.