Sie sind auf Seite 1von 12

ARMA Modeling In SAS

An Example Using The Share Price Of GE


David Blankley STA9750 Software Tools
22

21

20

19

GE Share Price

18

17

16

15

14

13

12

01JAN2010

26FEB2010

23APR2010

18JUN2010

13AUG2010

08OCT2010

03DEC2010

28JAN2011

Time Axis

Table of Contents
ARMA Modeling In SAS...........................................................................1 An Example Using The Share Price Of GE...............................................1 Table of Contents...................................................................................2 Summary................................................................................................3 Methodology and Results.......................................................................3 Data....................................................................................................3 Model Identification.............................................................................3 Model Parameterization......................................................................6 Prediction............................................................................................8 Creating Quality Graphical Output......................................................9 Conclusion............................................................................................10 Appendix I SAS Code............................................................................11 Appendix II Sources..............................................................................12

Summary
The goal of this project is to extend knowledge of how to model time series in R to SAS. For the underlying process, the project will build on the stock market homework assignments of 9750 that developed a set of weekly returns for GE and compared them to the SP500. An Auto Regressive Moving Average (ARMA) model will be identified for the stock price of GE, and then using that model a forecast for GEs share price one, two, three and four weeks in the future will be made. Finally, as graphical results are an important part of explaining statistical analyses, a chart with several bells and whistles specific to time series forecasting will be implemented.

Methodology and Results


ARMA modeling can be decomposed into several steps. Identify the order of the ARMA model, p, q. Identify the coefficients of the parameters associated with the model, check the parameterization against several diagnostic tools, and then finally make predictions with the model.

Data
Of course, there is an unmentioned first step: get data. For this study, the weekly closing prices for GE were downloaded from finance.yahoo.com over the period November 27th, 2000 to the present, December 6th, 2010. For closing prices the adjusted close field will be used rather than the actual printed price on each date. This accounts for dividends and splits and ensures the price represents the return a cash investor in the stock would achieve. Additionally, it is well known that market data exhibits non-constant variance. To account for this, modeling will be on the natural log of the adjusted closes.

Model Identification
An ARMA model can be described mathematically as: Yt = i Yt i + j Yt j + et where {et} ~ WN(0,2)
i =1 j =1 p q

The first goal is to identify the order of the two summations in the equation, p and q. This is accomplished by examining the plots of the sample auto-correlation and partial auto-correlation.

The auto-correlation function (ACF) measures the level of correlation in the series at different lags, or set points of time apart, and is effective at identifying the order q of the moving average portion of the equation. Mathematically, ACF at a set lag h is equal to: (h) = Corr(Yt, Yt-h) And the plot will examine the acf at increasing levels of h. Similarly, the partial auto-correlation function (PACF) measures the level of correlation at a lag when conditioning on the effect of the intervening variables. That is we are looking for kk = Corr(Yt,Yt-k|Yt-1,Yt-2,,Yt-k+1) The PACF is useful for identifying the order of p in the base ARMA equation. In both instances, the plot is used to evaluate at each lag the hypothesis: ACF PACF H0: = 0 H0: = 0 Ha: 0 Ha: 0 Lags that reject the null are considered good candidates for the order of p or q. The PROC ARIMA SAS statement, automatically generates the ACF and PACF charts as part of the standard output. This can be generated with the code:
proc arima; identify var=lnGE nlag=30 center;

In this code the var statement identifies the variable being modeled, nlag is the number of lags to calculate the PACF and ACF to, and center ensures the calculations are performed on a 0 mean series. The graphical output from this will be a 2x2 panel which includes the series, the ACF, the PACF, and the IACF, which is used in ARIMA modeling. Alternatively, the PROC TIMESERIES SAS statement can be used to generate the charts independently. The code to do so is:
proc timeseries print=summary plots=pacf; var lnGE; proc timeseries print=summary plots=acf; var lnGE; run;

The first output from SAS for the ACF and PACF is:

What these charts tell us is that the data is not stationary. Stationarity is a condition regarding independence of the data in the time series on time. In this instance the reason the data is not stationary is each time point is clearly dependent on previous time steps. This is most clearly seen in the PACF chart showing the very high PACF at lag 1. The solution to this problem is to difference the data and then recalculate the ACF and PACF on this new time series. Differencing is a complicated sounding term, but is really just creating a new series by subtracting the previous data point from the current one. Mathematically: Xt = Yt Yt-1 However, SAS will do this automatically as part of the ARIMA statement if we adjust our earlier command with:
proc arima; identify var=lnGE(1) nlag=30 center stationarity=adf(1,2);

Notice the (1) after the AdjPrice on line 2. A second point to make on this code is the addition of the stationarity option. It will not always be so obvious that the data should be differenced and in these cases the researcher can resort to an augmented Dickey-Fuller test. In this example, the test is run at lags of one and two on the differenced data. The Dickey-Fuller test and syntactic sugar of PROC ARIMA is not available with the PROC TIMESERIES. However, a new series can be created in the data statement to handle the differencing issue as follows:
* Create a lagged version of GE closes with lag=1 lnGELagged = lag(lnGE); * Future calls to PROC TIMESERIES should use this var: GEReturn = lnGE-lnGELagged;

Either way, the result will be a model based on the continuously compounded rate of return.

The output for the updated plots is:

From these it looks like an MA3, MA4, AR3, AR4, or possibly a simpler combination of the two such as ARMA(1,1) may all work. More quantitative methods of determining the order have been developed by Tsay and Tiao. These include both the Extended Sample Autocorrelation Function (ESACF) and the Smallest Canonical Correlation Method (SCAN). SAS implements both of these by adding the options SCAN and ESACF to the IDENTIFY statement portion of the ARIMA statement as follows:
identify var=lnGE(1) scan esacf

The output from this additional command is:


ARMA(p+d,q) Tentative Order Selection Tests ---SCAN---ESACF-p+d q p+d q 1 1 2 2 4 0 0 4 0 4 1 4 (5% Significance Level)

SCAN proposes an ARMA(1,1) as a first choice while the ESACF proposes ARMA(2,2). Additionally, they differ in second choice candidate as well. Scan prefers the Auto Regressive model with order p=4, abbreviated AR(4), whilst ESACF prefers the Moving Average model with order q=4, abbreviated MA(4). In general, there is a preference for simpler models over more complex, therefore the initial model to explore is the ARMA(1,1) and for the remainder of this paper the focus will be on the ARMA(1,1) model.

Model Parameterization
The equation for an ARMA(1,1) model is: Yt= 1Yt-1 + 1t-1 + t

Where {t} ~ WN(0, e ). Given this model, the next stage of the analysis is to determine estimates for the parameters 1 and 1, which, when denoting the estimate will be represented as 1 and 1 .
2

To estimate the parameters SAS uses the ESTIMATE statement as part of PROC ARIMA. Some important options to this statement that are used are: noint to set the intercept to 0, method to set the use of the maximum-likelihood methodology, and p and q to tell the SAS system the order of the ARMA equation. In this study we use noint to set the intercept to 0. This is because we have already transformed the series to a mean 0 untrended process. The intercept should be 0. Maximum-likelihood estimation (MLE) is a common method for calculating that parameter estimates, and has been shown to have good properties of an estimator. For this study, theres no strong reason to move away from using MLEs. The final two modifiers mentioned are used to specify the AR order p and the MA order q. In order to provide the researcher with control over the specific i and i estimated, this is actually a list in parentheses. For example to model an AR(5) with no 2 or 4 term the command would be: p=(1,3,5) Putting all this information together, the final form of the ESTIMATE statement used for an ARMA(1,1) model is:
ESTIMATE p=(1) q=(1) noint method=ml;

And when used in conjunction with PROC ARIMA:


PROC ARIMA DATA=GEData; IDENTIFY var=lnGE(1) nlag=30 center; ESTIMATE p=(1) q=(1) noint method=ml; run;

As an aside, notice the use of the data statement. While not required, many SAS PROCs output new data objects and thus, unbeknownst to the programmer, alter the last called data object. Thus, as a defensive measure, it is good software engineering practice to always explicitly define what data object is being used. This will have an additional benefit with regard to future maintainability of the code as well. The result of the above SAS statement is 1 =.-89396 with a se1 = . 08490 and =.83655 with a se =.10416. At this point care must be
1
1

taken to establish what convention the SAS system is using to define

these estimates. Some programs use the convention Yt+ i=it-1 + t while others use the alternate form Yt = i it-1 + t. SAS uses the first form, so to convert to the same form as originally stated we have to change the sign of 1 . The resulting model for ln(GE) through December 6th is: Yt =.894Yt-1 + .837t-1

Prediction
Once we have the model, forecasting becomes an exercise in conditional expectation. For the ARMA(1,1) the resulting maximum likelihood estimator is: Yt = 1 Yt 1 Notice that the MA term has dropped off as it is multiplied by a term with an expected value of 0. We can use SAS to generate this prediction with the FORECAST statement of PROC ARIMA. Two useful options to this statement are LEAD and OUT. LEAD allows the researcher to set the number of time steps into the future to forecast. Out is used to specify the DATA object to put the results into. As this model is working with a transformed time series, this statement is necessary to enable translation of forecasts into the original terms. The complete forecast statement is:
proc arima data=GEData; identify var=lnGE(1) scan esacf nlag=30 center; estimate p=(1) q=(1) noint method=ml; forecast lead=4 out=predictOut; run;

Note that the FORECAST statement must come after the ESTIMATE as the results of estimate are used to specify the model. Using this command will result in both an estimate and a nice confidence interval for the process. However, this still needs to be transformed back to meaningful units. This is accomplished with the following data step:
data predictOut; set predictOut; l95 = exp( l95 ); u95 = exp( u95 ); forecast = exp( forecast);

This results in a point estimate of next weeks closing pricing of $17.75 with a 95% confidence interval of {$16.23, $19.42}. The actual closing price of GE on December 13th was $17.62.

Creating Quality Graphical Output


A plot of the prediction is also of value, however, a significant amount of manipulation of the data is required to achieve a professional look. Among the challenges are: the output of FORECAST does not provide a date for each predicted time step, the plot should include both the original GE data as well as the forecast, and creating a shaded region to depict the prediction interval. To accomplish these requires some manipulation with the data step. The following DATA statement merges the two data objects, fixes the timestamp problems and creates two new variables, FL95 and FU95 for the forecast in the prediction time period. Additionally, it creates an extra row that is used to create dummy values of FL95 and FU95 so that a shaded region can be drawn:
data allData; MERGE GEData predictOut; *by TradeDate; IF TradeDate EQ . THEN TradeDate='06DEC2010'D + (_n_523)*7; IF TradeDate EQ '06DEC2010'D THEN DO; FL95=AdjClose; FU95=AdjClose; END; IF TradeDate GE '08DEC2010'D THEN FL95=L95; IF TradeDate GE '08DEC2010'D THEN FU95=U95; FORMAT TradeDate Date9.; IF TradeDate = '03JAN2011'D THEN DO; *Create extra row for shading; Output; TradeDate = '03JAN2011'D; FL95=17.72; FU95=17.72; END; OUTPUT;

Most of this is straightforward. The lone exception is the creation of the extra row. The reason for creating this is so the drawing algorithm can accurately determine the bounds of the polygon it is drawing. The start of the code to create the extra row begins on the line with the associated comment. The output statement saves the current copy, and a new one is also created. Then the values for that new row are adjusted as needed, specifically setting the second set of bounds for FL95 and FU95.

22

21

20

19

GE Share Price

18

17

16

15

14

13

12

01JAN2010

26FEB2010

23APR2010

18JUN2010

13AUG2010

08OCT2010

03DEC2010

28JAN2011

Time Axis

Finally, it is important to note that this code is not particularly maintainable. First, the use of 06DEC2010, 08DEC2010, and 523 tie the command specifically to the current data set. Second, the timeshift is specified in days and always seven. Production code should seek to address these issues. A good starting point would be the article by Croker referenced in the bibliography.

Conclusion
Implementing Time Series analysis in SAS is surprisingly easy. The commands are relatively straightforward and implement the core algorithms needed for ARMA modeling, as well as providing the accompanying plots.

Appendix I SAS Code


data GEData; infile 'F:\SASAssignmentNotes\Project\ge_data.csv' DLM=',' FIRSTOBS=2; input TradeDate :MMDDYY10. Open High Low Close Volume AdjClose; format TradeDate Date9.; lnGE = log(AdjClose); lnGELagged = lag(lnGE); GEReturn = lnGE-lnGELagged; GELagged = lag(AdjClose); DiffedGE = AdjClose - GELagged; output; run; proc print;run; ods graphics on; proc timeseries print=summary plots=pacf data=GEData; var lnGE; proc timeseries print=summary plots=acf data=GEData; var lnGE; run; proc arima data=GEData; identify var=lnGE(1) scan esacf nlag=30 center; estimate p=(1) q=(1) noint method=ml; forecast lead=4 out=predictOut; run; data predictOut; set predictOut; l95 = exp( l95 ); u95 = exp( u95 ); forecast = exp( forecast ); proc print data=predictOut; run; data allData; MERGE GEData predictOut; IF TradeDate EQ . THEN TradeDate='06DEC2010'D + (_n_-523)*7; IF TradeDate EQ '06DEC2010'D THEN DO; FL95=AdjClose; FU95=AdjClose; END; IF TradeDate GE '08DEC2010'D THEN FL95=L95; IF TradeDate GE '08DEC2010'D THEN FU95=U95; FORMAT TradeDate Date9.; IF TradeDate = '03JAN2011'D THEN DO; Output; TradeDate = '03JAN2011'D; FL95=17.72; FU95=17.72; END; OUTPUT;

proc print data=allData; run; goptions reset=all; symbol1 value=none i=join line=1 c=black co=libgr; symbol2 value=none i=join line=3 c=blue co=libgr; *symbol2 value=none i=join line=3 c=CX803009 co=libgr; *if you prefer orange; symbol3 value=none I=ms co=libgr c=gwh; symbol3 value=none I=ms co=libgr c=CXD9A465; symbol3 value=none I=ms co=libgr c=CXE5C5C2; * or pink...; axis1 label=("Time Axis" ) order=('01JAN2010'D to '29JAN2011'D by 56) value=(h=1 angle=0 rotate=0) ; * angle MUST come before the text or the text won't be rotated; axis2 label=(angle=90 rotate=0 "GE Share Price") order=(12 to 22); Proc Gplot data=allData; PLOT FL95*TradeDate=3 FU95*TradeDate=3 AdjClose*TradeDate=1 Forecast*TradeDate=2 /overlay haxis=axis1 vaxis=axis2; run; quit;

Appendix II Sources
Presentation Quality Forecast Visualization with SAS/Graph by Samuel T. Croker http://www.nesug.org/proceedings/nesug07/np/np04.pdf Time Series Analysis: With Applications In R by Jonathan D. Cryer and Kung-Sik Chan Time Series Analysis and Its Applications: With R Examples by Robert Shumway and David Stoffer

Das könnte Ihnen auch gefallen