Beruflich Dokumente
Kultur Dokumente
1 Introduction
Stock price forecasts are valuable for investors, which tell out the investment
opportunities. Toward the fascinating goal, research efforts have been made to find
superior forecasting methods. Financial time series forecasting is regarded as one of
the most challenging applications of modern time series forecasting. But numerous
studies have found that univariate time series, such as Box-Jenkins ARIMA models
are as accurate as more expensive linear regression or vector autoregressive models
[1,2,3]. The success of linear models, however, is conditional upon the underlying
data generating process being linear and not being random. One view in financial
economics is that market prices are random and that past prices cannot be used as a
guide for the price behavior in the future. Chaos theory, however, suggests that a
seemingly random process may in fact have been generated by a deterministic
function that is not random [4,5]. In such a case, ARIMA methods are no longer a
C. Bussler and D. Fensel (Eds.): AIMSA 2004, LNAI 3192, pp. 295–303, 2004.
© Springer-Verlag Berlin Heidelberg 2004
296 Y. Bao, Y. Lu, and J. Zhang
useful tool for estimation and forecasting. Research efforts turn to new methods. One
of them is the study of neural networks.
Neural networks have been successfully used for modeling financial time series [6,7].
Especially, several researchers report modest, positive results with the prediction of
market prices using neural networks [8,9,10], but not by using price and volume
histories alone, and no one uses technical analysis pattern heuristics. Neural networks
are universal function approximations that can map any non-linear function without a
priori assumptions about the properties of the data [11]. Unlike traditional statistical
models, neural networks are data-driven, non-parametric weak models, and they let
“the data speak for themselves”.Consequently, neural networks are less susceptible to
the problem of model misspecification as compared to most of the parametric models.
Neural networks are also more noise tolerant, having the ability to learn complex
systems with incomplete and corrupted data. In addition, they are more flexible,
having the capability to learn dynamic systems through a retraining process using new
data patterns. So neural networks are more powerful in describing the dynamics of
financial time series in comparison to traditional statistical models [12,13,14,15].
Recently, a novel neural network algorithm, called support vector machines (SVMs),
was developed by Vapnik and his co-workers [16]. Unlike most of the traditional
neural network models that implement the empirical risk minimization principle,
SVMs implement the structural risk minimization principle which seeks to minimize
an upper bound of the generalization error rather than minimize the training error.
This induction principle is based on the fact that the generalization error is bounded
by the sum of the training error and a confidence interval term that depends on the
Vapnik–Chervonenkis (VC) dimension. Based on this principle, SVMs achieve an
optimum network structure by striking a right balance between the empirical error and
the VC-confidence interval. This eventually results in better generalization
performance than other neural network models. Another merit of SVMs is that the
training of SVMs is equivalent to solving a linearly constrained quadratic
programming. This means that the solution of SVMs is unique, optimal and absent
from local minima, unlike other networks’ training which requires non-linear
optimization thus running the danger of getting stuck in a local minima. Originally,
SVMs have been developed for pattern recognition problems [17]. However, with the
introduction of Vapnik’s ε -insensitive loss function, SVMs have been extended to
solve non-linear regression estimation problems and they have been shown to exhibit
excellent performance [18,19].
This paper consists of five sections. Section 2 presents the principles of SVMs
regression and the general procedures of applying it. By raising an example from
stock market in China, the detailed procedures involving data set selection, data
preprocessing and scaling, kernel function selection and so on are presented in
Section 3. Section 4 discusses the experimental results followed by the conclusions
drawn from this study and further research hints in the last section.
Forecasting Stock Price by SVMs Regression 297
Given a set of data points G = {( x i , d i )}in ( x i is the input vector, di is the desired
value and n is the total number of data patterns), SVMs approximate the function
using the following:
y = f ( x ) = wφ ( x ) + b (1)
where φ (x) is the high dimensional feature space which is non-linearly mapped from
the input space x . The coefficients w and b are estimated by minimizing
1 n 1 2
RSVMs (C ) = C ∑ Lε (d i , yi ) + w , (2)
n i =1 2
d − y − ε d − y ≥ ε
Lε (d , y ) = (3)
0 otherwise
n
In the regularized risk function given by Eq. (2), the first term C ( 1 n ) ∑ Lε (d i , y i ) is
i =1
the empirical error (risk). They are measured by the ε -insensitive loss function given
by Eq. (3). This loss function provides the advantage of enabling one to use sparse
data points to represent the decision function given by Eq. (1). The second term
1 2
w , on the other hand, is the regularization term. C is referred to as the regularized
2
constant and it determines the trade-off between the empirical risk and the
regularization term. Increasing the value of C will result in the relative importance of
the empirical risk with respect to the regularization term to grow. ε is called the tube
size and it is equivalent to the approximation accuracy placed on the training data
points. Both C and ε are user-prescribed parameters.
To obtain the estimations of w and b , Eq. (2) is transformed to the primal function
given by Eq. (4) by introducing the positive slack variables ξ i and ξ i* as follows:
n 1 2
Minimize RSVMs ( w, ξ (*) ) = C ∑ (ξ i + ξ i* ) + w
i =1 2
d i − wφ ( xi ) − bi ≤ ε + ξ i ,
Subjected to (4)
wφ ( xi ) + bi − d i ≤ ε + ξ i* , ξ (*) ≥ 0
Finally, by introducing Lagrange multipliers and exploiting the optimality constraints,
the decision function given by Eq. (1) has the following explicit form [18]:
n
f ( x, ai , ai* ) = ∑ (ai − ai* ) K ( x, xi ) + b (5)
i =1
In Eq. (5), ai and ai* are the so-called Lagrange multipliers. They satisfy the
equalities ai * ai* = 0 , ai ≥ 0 and ai* ≥ 0 where i = 1,2,…,n and are obtained by
maximizing the dual function of Eq. (4) which has the following form:
298 Y. Bao, Y. Lu, and J. Zhang
n n
R(ai , ai* ) = ∑ d i (ai − ai* ) − ε ∑ (ai + ai* )
i =1 i =1
(6)
1 n n
− ∑ ∑ (ai − ai )(a j − a j )K ( xi , x j )
* *
2 i =1 j =1
with the constraints
n
∑ ( ai − ai ),
*
i =1
0 ≤ ai ≤ C , i = 1,2...,n,
0 ≤ ai* ≤ C, i = 1,2...,n.
Based on the Karush–Kuhn–Tucker (KKT) conditions of quadratic programming,
only a certain number of coefficients (ai − ai* ) in Eq. (5) will assume non-zero values.
The data points associated with them have approximation errors equal to or larger
than ε and are referred to as support vectors. These are the data points lying on or
outside the ε -bound of the decision function. According to Eq. (5), it is evident that
support vectors are the only elements of the data points that are used in determining
the decision function as the coefficients (ai − ai* ) of other data points are all equal to
zero. Generally, the larger the ε , the fewer the number of support vectors and thus
the sparser the representation of the solution. However, a larger ε can also depreciate
the approximation accuracy placed on the training points. In this sense, ε is a trade-
off between the sparseness of the representation and closeness to the data.
K ( xi , x j ) is defined as the kernel function. The value of the kernel is equal to the
inner product of two vectors X i and X j in the feature space φ ( xi ) and φ ( x j ) , that is,
K ( xi , x j ) = φ ( xi ) * φ ( x j ) . The elegance of using the kernel function is that one can deal
with feature spaces of arbitrary dimensionality without having to compute the map
φ ( x) explicitly. Any function satisfying Mercer’s condition [16] can be used as the
kernel function. The typical examples of kernel function are as follows:
Linear: K ( xi , x j ) = xi T x j .
d
Polynomial: K ( xi , x j ) = (γxiT x j + r ) , γ > 0.
Radial basis function (RBF):
2
K ( xi , x j ) = exp(−γ xi − x j ), γ > 0.
In order to enhance the forecasting ability of model, we transform the original closing
prices into relative difference in percentage of price (RDP)[20]. As mentioned by
Thomason [20], there are four advantages in applying this transformation. The most
prominent advantage is that the distribution of the transformed data will become more
symmetrical and will follow more closely a normal distribution. This modification to
the data distribution will improve the predictive power of the neural network.
The input variables are determined from four lagged RDP values based on 5-day
periods (RDP-5, RDP-10, RDP-15 and RDP-20) and one transformed closing price
(EMA15) which is obtained by subtracting a 15-day exponential moving average
from the closing price. The optimal length of the moving day is not critical but it
should be longer than the forecasting horizon of 5 days [20]. EMA15 is used to
maintain as much information as contained in the original closing price as possible,
since the application of the RDP transform to the original closing price may remove
some useful information. The output variable RDP+5 is obtained by first smoothing
the closing price with a 3-day exponential moving average because the application of
a smoothing transform to the dependent variable generally enhances the prediction
performance of the SVMs. The calculations for all the indicators are showed in
table 1.
300 Y. Bao, Y. Lu, and J. Zhang
Indicator Calculation
Input variables
EMA15 P (i) − EMA15 (i )
RDP-5 ( p(i ) − p (i − 5)) / p (i − 5) *100
RDP-10 ( p(i ) − p (i − 10)) / p (i − 10) *100
RDP-15 ( p(i ) − p(i − 15)) / p (i − 15) *100
RDP-20 ( p(i ) − p(i − 20)) / p (i − 20) *100
Output variable
RDP+5 ( p(i + 5) − p (i)) / p (i) *100
( p(i ) = EMA3 (i)
We use general RBF as the kernel function. The RBF kernel nonlinearly maps
samples into a higher dimensional space, so it, unlike the linear kernel, can handle the
case when the relation between class labels and attributes is nonlinear. Furthermore,
the linear kernel is a special case of RBF as (Ref. [13]) shows that the linear kernel
~
with a penalty parameter C has the same performance as the RBF kernel with some
parameters (C , γ ) . In addition, the sigmoid kernel behaves like RBF for certain
parameters [14].
The second reason is the number of hyper-parameters which influences the
complexity of model selection. The polynomial kernel has more hyper-parameters
than the RBF kernel.
Forecasting Stock Price by SVMs Regression 301
Finally, the RBF kernel has less numerical difficulties. One key point is 0 < K ij ≤ 1 in
contrast to polynomial kernels of which kernel values may go to infinity
(γ xiT x j + r > 1) or zero (γ xiT x j + r < 1) while the degree is large.
There are two parameters while using RBF kernels: C and γ . It is not known
beforehand which C and γ are the best for one problem; consequently some kind of
model selection (parameter search) must be done. The goal is to identify good (C , γ )
so that the classifier can accurately predict unknown data (i.e., testing data). Note that
it may not be useful to achieve high training accuracy (i.e., classifiers accurately
predict training data whose class labels are indeed known). Therefore, a common way
is to separate training data to two parts of which one is considered unknown in
training the classifier. Then the prediction accuracy on this set can more precisely
reflect the performance on classifying unknown data. An improved version of this
procedure is cross-validation.
We use a grid-search on C and γ using cross-validation. Basically pairs of (C , γ ) are
tried and the one with the best cross-validation accuracy is picked. We found that
trying exponentially growing sequences of C and γ is a practical method to identify
good parameters (for example, C = 2 −5 , 2 −3 ,…, 215 , γ = 2 −15 , 2 −13 ,..., 23 ).
In Fig. 1 the horizontal axis is the trading days of the test set and the vertical axis is
stock price. It could be found that actual stock price run down from above the
predicted one to below it at the 2nd trading day; in the following four trading days the
actual stock price goes down further; it tells that it is a timing for selling at 2nd day.
Actual stock price run up from below the predicted stock price to above it at the 6th
trading day; in the following two trading days actual stock price stays above the
closing price of the 6th trading day; it tells that it is timing for buying at 6th day. On
the following days, it could be done like this. The timing area for buying and selling
derived from Fig. 1 tells the investment values.
5 Conclusion
The use of SVMs in forecasting stock price is studied in this paper. The study
concluded that SVMs provide a promising alternative to the financial time series
forecasting. And the strengths of SVMs regression are coming from following points:
1) usage of SRM; 2) few controlling parameters; 3) global unique solution derived
from a quadratic programming.
302 Y. Bao, Y. Lu, and J. Zhang
But further research toward an extremely changing stock market should be done,
which means the data fluctuation may affect the performance of this method. Another
further research hint is the knowledge priority used in training the sample and
determining the function parameters.
References
[1] D.A. Bessler and J.A. Brandt. Forecasting livestock prices with individual and composite
methods. Applied Economics, 13, 513-522, 1981.
[2] K.S. harris and R.M. Leuthold. A comparison of alternative forecasting techniques for
live stock prices: A case study. North Central J. Agricultural Economics, 7, 40-50, 1985.
[3] J.H. Dorfman and C.S. Mcintosh. Results of a price forecasting competition. American J.
Agricultural Economics, 72, 804-808, 1990.
[4] S.C. Blank. Chaos in future markets? A nonlinear dynamic analysis. J. Futures Markets,
11, 711-728, 1991.
[5] J. Chavas and M.T. Holt. Market instability and nonlinear dynamics. American J.
Agricultural Economics, 75, 113-120, 1993.
[6] Hall JW. Adaptive selection of U.S. stocks with neural nets. In: GJ Deboeck (Ed.),
Trading on the edge: neural, genetic, and fuzzy systems for chaotic financial markets.
New York:Wiley, 1994.
[7] Yaser SAM, Atiya AF. Introduction to financial forecasting. Applied Intelligence, 6, 205-
213, 1996.
Forecasting Stock Price by SVMs Regression 303
[8] G. Grudnitski, L. Osburn, Forecasting S&P and gold futures prices: an application of
neural networks, The Journal of Futures Markets, 13 (6) 631–643, 1993.
[9] S. Kim, S. Chun, Graded forecasting using an array of bipolar predictions: application of
probabilistic neural networks to a stock market index, International Journal of Forecasting
14 (3), 323–337, 1998.
[10] E. Saad, D. Prokhorov, D. Wunsch, Comparative study of stock trend prediction using
time delay, recurrent and probabilistic neural networks, IEEE Transactions on Neural
Networks, 9 (6), 1456-1470, 1998.
[11] Cheng W, Wanger L, Lin CH. Forecasting the 30-year US treasury bond witha system of
neural networks. Journal of Computational Intelligence in Finance 1996;4:10–6.
[12] Sharda R, Patil RB. A connectionist approach to time series prediction: an empirical test.
In: Trippi, RR, Turban, E, (Eds.), Neural Networks in Finance and Investing, Chicago:
Probus Publishing Co., 1994, 451–64.
[13] Haykin S. Neural networks: a comprehensive foundation. Englewood CliKs, NJ: Prentice
Hall, 1999.
[14] Zhang GQ, Michael YH. Neural network forecasting of the British Pound=US Dollar
exchange rate. Omega 1998;26(4):495–506.
[15] Kaastra I, Milton SB. Forecasting futures trading volume using neural networks. The
Journal of Futures Markets 1995;15(8):853–970.
[16] Vapnik VN. The nature of statistical learning theory. New York: Springer, 1995.
[17] Schmidt M. Identifying speaker with support vector networks. Interface ‘96 Proceedings,
Sydney, 1996.
[18] Muller KR, Smola A, Scholkopf B. Prediction time series with support vector machines.
Proceedings of International Conference on Artificial Neural Networks, Lausanne,
Switzerland, 1997, p 999
[19] Vapnik VN, GolowichSE, Smola AJ. Support vector method for function approximation,
regression estimation, and signal processing. Advances in Neural Information Processing
Systems 1996;9:281-287.
[20] Thomason M. The practitioner methods and tool. Journal of Computational Intelligence
in Finance 1999;7(3):36–45.
[21] Keerthi, S. S. and C-.J. Lin. Asymptotic behaviors of support vector machines with
Gaussian kernel. Neural Computation 15 (7), 1667–1689.
[22] Lin, H.-T. and C.-J. Lin. A study on sigmoid kernels for SVM and the training of non-
PSD kernels by SMO-type methods. Technical report, Department of Computer Science
and Information Engineering, National Taiwan University. Available at
http://www.csie.ntu.edu.tw/~cjlin/papers