Base Papers PDF

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2850798, IEEE
Transactions on Knowledge and Data Engineering
MANUSCRIPT FOR IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1
Learning Customer Behaviors

for Effective Load Forecasting
Xishun Wang, Minjie Zhang, Senior member, IEEE, and Fenghui Ren, Member, IEEE
Abstract—Load forecasting has been deeply studied because of its critical role in Smart Grid. In current Smart Grid, there are various
types of customers with different energy consumption patterns. Customer’s energy consumption patterns are referred as customer
behaviors. It would significantly benefit load forecasting in a grid if customer behaviors could be taken into account. This paper
proposes an innovative method that aggregates different types of customers by their identified behaviors, and then predicts the load of
each customer cluster, so as to improve load forecasting accuracy of the whole grid. Sparse Continuous Conditional Random Fields
(sCCRF) is proposed to effectively identify different customer behaviors through learning. A hierarchical clustering process is then
introduced to aggregate customers according to the identified behaviors. Within each customer cluster, a representative sCCRF is
fine-tuned to predict the load of its cluster. The final load of the whole grid is obtained by summing the loads of each cluster.
The proposed method for load forecasting in Smart Grid has two major advantages. 1. Learning customer behaviors not only improves
the prediction accuracy but also has a low computational cost. 2. sCCRF can effectively model the load forecasting problem of one
customer, and simultaneously select key features to identify its energy consumption pattern. Experiments conducted from different
perspectives demonstrate the advantages of the proposed load forecasting method. Further discussion is provided, indicating that the
approach of learning customer behaviors can be extended as a general framework to facilitate decision making in other market
domains.
Index Terms—Load Forecasting, Customer Behaviors, Continuous Conditional Random Fields, Sparse CCRF, Demand Prediction.
F
1 I NTRODUCTION
L OAD forecasting aims to predict the energy demand of

customers under the influence of a series of factors, such
as time, price and weather conditions. Load forecasting can
vast types of customers and irregular behaviors of each
customer type. In Smart Grid, the concept of “customer” has
been extended to include not only general energy consumer-
benefit Smart Grid in several aspects. Accurate load fore- s, but also interruptible consumers, consumers with storage
casting helps to determine the amount of energy to produce, capacity and even small renewable energy producers. We
thus to improve the efficiency of energy usage and keep the give two instances to illustrate the irregular customer behav-
grid away from the risk of too much surplus energy. Brokers iors. Example 1: more and more householders have acquired
in Smart Grid markets rely heavily on load forecasting to photovoltaic power generation systems, which may lead
make decisions on how much energy to purchase, in order to to variable power usages under the influence of weather
keep a good supply-demand balance and make more profit. factors [10], [40], such as cloudiness and humidity. Example
This study focuses on short-term load forecasting, i.e. 2: some customers with storage capacity may recharge or
prediction of hourly power demand over the next 24 hours supply power according to varying prices at different times
of a smart grid with various types of customers. Formally, of the day (Time-of-Use [33], a pricing mechanism used in
the input data X = [x1 , x2 , · · · , xn ] is a n × D matrix, Smart Grid markets).
representing n steps and D features in each step. The output Due to complex customer behaviors, traditional load
y is a n-dimension vector, corresponding to n hourly power forecasting methods, which model the whole grid or a
usages. The input feature X is shared by all customers, and particular customer, face challenges to precisely forecast
y is predicted by the learned model. The most widely used the load of a grid. Intuitively, if customers with similar
short-term load forecasting is to predict the hourly power behaviors could be aggregated into groups, the predictions
usage in the coming 24 hours [24]. Therefore, time step n is towards customer groups would improve the accuracy of
set as 24 in this study. final load forecasting. We therefore propose the method that
In current Smart Grid, there have been various types identifies customer behaviors through learning to aggregate
of customers with different energy consumption patterns, similar customers. This method is called Load Forecast-
which brings great challenges to accurate load forecasting ing through Learning Customer Behaviors, named as LF-
of a grid system. Customer’s energy consumption patterns LCB for short. In LF-LCB, sparse Continuous Conditional
under the influence of a range of factors (such as time and Random Fields (sCCRF) is proposed to identify customer
weather conditions) are defined as customer behaviors. The behaviors through supervised learning. Then all customers
complexity of customer behaviors come from two aspects: can be hierarchically clustered according to the identified
customer behaviors. For each customer cluster, a represen-
• X. Wang, M. Zhang and F. Ren are with the School of Computing and tative sCCRF is fine-tuned to predict its load. Finally, the
Information Technology, University of Wollongong, Australia. load of the grid system is obtained by summing the loads of
E-mail: xw357@uowmail.edu.au, minjie@uow.edu.au and all customer clusters.
fren@uow.edu.au
The prominent novelty in LF-LCB is the aggregation
1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2850798, IEEE
of various customers through learning. It is challenging neighboring variables to provide more accurate descriptions
to effectively cluster different customers due to complex of customer behaviors. Thirdly, we improve the fine-tuning
customer behaviors. LF-LCB introduces a sparse learning step in LF-LCB to result in a fast convergence. Fourthly, we
model (sCCRF) to select and weigh the features related to additionally provide load forecasting in uncertain environ-
the customer’s energy consumption, and consequently uses ments to extend the application of LF-LCB. In experiments,
a hierarchical clustering method to aggregate different cus- we explore more external features to improve the accuracy
tomers. The hierarchical clustering circumvents the “curse of load forecasting. We also conduct new experiments to
of dimensionality” and obtains stable customer clusters. compare LF-LCB with state-of-the-art methods. In the end,
Customer aggregation can achieve two advantages. The first we further discuss the potential to apply learning customer
is an improvement in the accuracy of prediction. Intuitively, behaviors to wide market domains.
it is not feasible to precisely predict individual customers’ In summary, this paper has two major contributions.
power usages because some behaviors might be random. 1) This paper proposed sCCRF, which improves L1 -
When similar customers are clustered into the same group, CCRF [44] in two aspects. Firstly, sCCRF constrains the
their random behaviors seem to be “averaged”, making it parameters in a theoretical view. Secondly, sCCRF extends
possible to predict the load more accurately. The second to consider multiple neighboring variables. Experimental
advantage is a reduction in the computational cost. LF-LCB results demonstrate the effectiveness of sCCRF in prediction
in the end maintains a small number of learned sCCRF and feature selection.
models, which equals the number of customer clusters. 2) LF-LCB substantially improves load forecasting with
Compared to load prediction for each customer, LF-LCB learning customer behaviors in our previous work [44].
saves a considerable computational cost. The extensive analysis of learning customer behaviors can
sCCRF is a fundamental algorithm in LF-LCB. We intro- facilitate the research work in other market domains.
duce sCCRF for two successive reasons. Reason 1: why we use The rest of the paper is organized as follows. Section 2
CCRF to model the load forecasting problem. In short-term load reviews the related work. Section 3 gives a brief introduction
forecasting, the sequence of load variables to be predicted is to CCRF. Section 4 proposes sCCRF. In Section 5, learning
influenced by two major factors. 1) Individual load variables customer behaviors using sCCRF is described in detail.
are influenced by observed external features. 2) Adjacen- The load forecasting process is introduced in Section 6.
t output variables show strong correlations revealed by Experimental results are reported in Section 7. Section 8
partial autocorrelation [22]. CCRF can simultaneously take further discusses the experimental results and the poten-
the above two factors into account. Compared to nonlinear tial to extend learning customer behaviors to other market
regressions [8], [17], [37] and deep neural networks [12] that domains. Finally, conclusions are drawn in Section 9.
can only model the first factor, CCRF gains the advantage
of modeling correlations in the output variables. Reason 2:
why we introduce sparse CCRF. sCCRF applies the L1 norm 2 R ELATED W ORK
penalty on the parameters of CCRF. L1 norm penalty is Load forecasting towards a whole grid or a specific cus-
capable of selecting features and thus results in a sparse tomer has been deeply studied [7], [24], [39] using various
model. Due to the sparsity, we can design a subsequent two- approaches. Charytoniuk et al. [11] presented a nonparamet-
layer clustering process, which avoids the clustering in high ric regression approach for short-term load forecast. Their
dimension. Besides, L1 norm shows good generalizations approach was derived from a load model in the form of a
when training samples are limited [32]. The real-world data probability density function of load and affecting factors.
in Smart Grid must be accumulated by time, which are often A neural-wavelet approach was proposed by Gao and T-
expensive and limited. soukalas for demand forecasting [19]. The wavelet trans-
Our preliminary work [44] studied learning customer form was proved capable of capturing essential features of
behaviors and achieved promising results. In this paper, different types of loads and offers considerable promise to
LF-LCB motivates to provide more accurate descriptions design an on-line wavelet-based discriminator. Amin-Naseri
of customer behaviors, in order to improve the prediction and Soroush [4] used supervised and unsupervised learning
accuracy of the previous work [44]. In LF-LCB, sCCRF to predict the daily peak load. Ali et al. [1] combined neural
is proposed to identify customer behaviors with the con- network, time series models and ANOVA for load forecast-
sideration of multiple neighboring variables rather than a ing. Fan and Hyndman [15] proposed a semi-parametric
single neighboring variable by L1 -CCRF in [44]. Therefore, addictive model for short-term load forecasting.
sCCRF can give more accurate descriptions of customer As an active research topic, there have been several novel
behaviors, with which the whole load forecasting process methods for load forecasting. Liu et al. [30] proposed a new
can be improved, thus to produce more accurate predictions. way that performs quantile regression averaging on a set
To be specific, this paper technically improves our previ- of sister point forecasts for probabilistic load forecasting.
ous work [44] in four aspects. Firstly, in previous L1 -CCRF, Zheng et al. [49] used Long-Short-Term-Memory based re-
the unconstrained parameters could work in practice, but current neural network to capture the dynamic factors in
suffer some theoretical limits [43]. In this paper, we propose Smart Gird for effective short-term load forecasting. Dong et
Sparse Continuous Conditional Random Fields (sCCRF) al. [12] proposed Convolutional Neural Networks for large-
that considers the theoretical constraints on parameters. scale load forecasting. They first clustered the data from
Secondly, L1 -CCRF [43] only models the closest neighboring different regions by K-means to alleviate the data imbalance,
variables in load sequence data to analyze customer behav- and then input regional data into deep neural networks.
iors. In this paper, we extend sCCRF to model multiple close Those newly proposed methods have achieved improved
results on load forecasting, but they still targeted at the 3 A N I NTRODUCTION TO CCRF
whole grid, without consideration of customer behaviors. In this section, the concept of CCRF is introduced. As CCRF
Recently, researchers have conducted some work on ag- is originated from CRF, we briefly introduce CRF first, and
gregating customers to improve load forecasting. Srinivasan then extend CRF to CCRF.
[38] manually divided different customers in a power grid
into six groups, and introduced a group method of data 3.1 Conditional Random Fields
handling (GMDH) neural networks for load forecasting. In
CRF [27] was initially proposed for labeling sequence data.
contrast, our method learns customer behaviors and thus
The chain-structured CRF, as illustrated in Fig. 1, is widely
clusters different customers adaptively. Alzate et al. [2] used
used.
spectral clustering to cluster customers with respect to the
historical load data and reported improved accuracy in
yi-1 yi yi+1
load forecasting. Alzate and Sinn [3] further explored ker-
nel spectral clustering to aggregate customers, while their
method was still limited to the unsupervised cluster of load
data. Gulbnas et al. [21] segmented the energy usage data
and constructed energy usage profiles to classify building Edge Potential
X
occupants. They also defined energy-use efficiency, entropy Node Potential
and intensity to facilitate energy usages. Different from the
previous work, our proposed method uses supervised learn- Fig. 1. An illustration of a CRF with a chain structure.
ing to discover the relations between loads and external
factors, which can provide more accurate descriptions of Assume that X = {x1 , x2 , · · · , xm } is the given se-
customer behaviors. quence of observations, and Y = {y1 , y2 , · · · , yn } is the
Fiot and Dinuzzo [16] used multi-task kernel learning label sequence to be predicted. CRF defines the conditional
to predict long-term load. Their method tried to discover probability P (Y |X) in Equation 1.
the similarity of nodes in Smart Grid, thus to improve the 1
prediction of each node (customer). In predicting the long- P (Y |X) = exp(Ψ), (1)
Z(X)
term load, only time and calendar features are considered.
Differently, our method introduced supervised learning to where Ψ is the energy function, and Z(X) is the partition
discover the customer behaviors under the influence of a function that normalizes P (Y |X).
series of external factors, and aggregate similar customers The energy function Ψ is further defined as
to improve the accuracy of load forecasting. ∑∑
K1 ∑∑
K2
In the Conditional Random Field (CRF) research com- Ψ= αk fk (yi , X) + βk gk (yi , yj , X), (2)
munity, CCRF was proposed by Qin et al. [34] in 2009 to i k=1 i,j k=1
extend CRF to be capable of solving real-value regression where fk (yi , X) is called node potential, gk (yi , yj , X) is
problems. Since then, CCRF has been widely applied to called edge potential, and αk and βk are corresponding
various domains, such as learning to rank [34], expression weight parameters. In the energy function, the node poten-
recognition [6], and social recommendation [47]. A new tial captures the associations between inputs and outputs,
sparse CCRF model is proposed in this paper. We explore while the edge potential captures the interactions between
learning CCRF with an L1 norm penalty term, and provide conditioned outputs.
an effective learning approach. sCCRF gains extra feature The partition function Z(X) is defined in Equation 3.
selection capacity and extends CCRF to broad application ∑
domains. Z(X) = exp(Ψ) (3)
Y
Guo [22], [23] used CCRF to forecast the short-term
power and gas usage in a building. Guo’s work demon- CRF explicitly defines P (Y |X), which indicates that Y is
strated the advantages of CCRF and achieved superior determined by the whole observation X . Therefore, CRF can
performances in load forecasting in a small area. Our work model the whole observed sequence for the output.
promotes the study of CCRF in load forecasting from a
different perspective. We propose sCCRF [44] and utilize 3.2 Continuous Conditional Random Fields
it in learning customer behaviors to effectively aggregate
CRF outputs discrete values, while CCRF extends CRF to
customers, so as to improve the accuracy of load forecasting
output real values. The definition of CCRF [34] differs from
for the whole complex grid.
CRF in three aspects. 1) The output Y = {y1 , y2 , · · · , yn }
Sparse Gaussian conditional random fields (sGCRF) [46] can be a real value sequence. 2) The partition function Z(X)
is also an important variant of CCRF. Wytock and Kolter is alternatively defined as:
used it in demand forecasting and wind energy prediction ∫
[46]. However, sGCRF introduced parameters in each time Z(X) = exp(Ψ) (4)
step, which resulted in many more parameters than the Y
previous CCRF. Such a large number of parameters may 3) The weights α and β must be positive to ensure the
lead the customer clustering step to the “curse of dimension- partition function is integrable.
ality”, and therefore, we develop the new sCCRF instead of To learn a CCRF model, maximum log-likelihood is used
employing the traditional sGCRF. to find the optimal weights α and β . Given training data
D = {(X, Y )}Q
where Q is the total number of training
1 , generalizes to unseen data. In previous CCRF learning, L2
samples, the log-likelihood L(α, β) is maximized: norm regularization has been used [6], [22], [35]. L1 norm
regularizer has been theoretically studied in [32] by Ng, and
(α̂, β̂) = argmax(α,β) (L(α, β)), (5) in practice, the L1 norm regularizer has gained roughly the
where same accuracy as the L2 norm regularizer [28]. L1 norm
∑
Q
also has a favorable property of selecting effective features,
L(α, β) = logP (Yq |Xq ) (6) which can be utilized to analyze customer behaviors in our
l=1 research. Therefore, we introduce L1 norm to regularize the
The inference of CCRF is to find the most likely value for CCRF in the learning procedure. λ =< α, β >, a concatena-
Yk , provided an observed sequence Xk : tion of vector α and β , is introduced to compactly represent
the weights. The objective function to be minimized for
Yˆk = argmaxYk (P (Yk |Xk )) (7) sCCRF is designed in Equation 9,
Radosavljevic et al. [35] have shown that P (Y |X) with F (λ) = −L(λ) + ρ∥λ∥1 , (9)
quadratic potentials can be transformed into a multivariate
Gaussian, which facilitates the learning and inference pro- where ∥∥1 stands for L1 norm. In the objective function,
cesses. Therefore, in our work, we design quadratic node the first term is the loss function, which is a negative log-
and edge potentials to take advantage of the multivariate likelihood of the training set (see Equation 6), and the
Gaussian form. second term is the L1 norm of λ, used as a regularization
term. The parameter ρ compromises the loss and the regu-
larization term.
4 S PARSE C ONTINUOUS C ONDITIONAL R ANDOM
F IELDS
4.2 Learning sCCRF
We propose sparse Continuous Conditional Random Fields
(sCCRF) in this section. L1 norm is introduced to regularize For sCCRF learning, we introduce the Orthant-Wise
CCRF because L1 norm penalty can result in a sparse model. Limited-memory Quasi-Newton (OWL-QN) algorithm,
However, L1 norm is not differentiable at zero, so the previ- which extends L-BFGS [31] algorithm for convex functions
ous learning method for CCRF can no longer be applied to with L1 norm penalty. The quasi-Newton algorithms gain
sCCRF. Some special methods have been proposed to tackle the advantage of the second-order convergence rate with
the learning with L1 norm penalty [5], [48]. Orthant-Wise a small computation cost. These algorithms construct an
Limited-memory Quasi-Newton (OWL-QN), proposed by approximation of the second-order Taylor expansion of the
Andrew and Gao [5], has been verified as an advantageous objective function, and then try to minimize the approxima-
algorithm for L1 -regularized log-linear model in [28]. We tion. In the approximated Taylor expansion, the Hessian ma-
therefore employ the OWL-QN algorithm [5] to train sCCRF. trix is constructed with the first-order information gathered
from previous steps. OWL-QN, which modifies L-BFGS, is
motivated by the following basic idea. When the orthant
4.1 Introducing L1 norm to regularize CCRF is given, L1 norm can be determined and be differentiable.
In previous CCRF [34], [35], the partition function takes the Furthermore, L1 norm is not related to the Hessian, which
following form: can be approximated by the loss term alone. Thus, OWL-QN
∫ ∑∑ K1 in fact imitates L-BFGS steps in a chosen orthant.
Z(X) = exp( −αk (yi − Xi,k )2 + The learning procedure for sCCRF using OWL-QN is
y i k=1 summarized in Algorithm 1. Before the explanation of
(8)
∑∑
K2
−δk βk (yi − yj ) ) 2 Algorithm 1 sCCRF learning using OWL-QN
Q
i,j k=1 Input: Training samples D = {(X, Y )}1 ;
In Equation 8, X denotes the feature matrix and y denotes Output: Weight parameter vector λ;
1: Initialize: Initial point λ ; S ⇐ {}, R ⇐ {}.
0
the sequence to be predicted. When the variables in X and y
are defined in infinite domains, both α and β are required to 2: for k = 0 to T do
be positive to ensure that the partition function is integrable. 3: Compute the pseudo-gradient ⋄F (λ)
However, this sufficient condition seems too strict. Glass et 4: Choose an orthant ξ k
al. [20] extended the definitions of α and β to increase the 5: Construct Hk using S and R
modeling capacity of CCRF. We will discuss the constraints 6: Compute search direction pk
of α and β in Subsection 5.2 after CCRF is transformed to a 7: Find λk+1 with constrained line search
multivariate Gaussian. 8: if termination condition satisfied then
In sCCRF, parameter α and β is shared across time 9: Stop and return λk+1
steps (see Equation 2). Therefore, sCCRF can be applied 10: end if
to sequential data with large time steps (up to hundreds), 11: Update S with sk = λk+1 − λk
which is similar to CCRF applied in expression analysis in 12: Update R with rk = −∇L(λk+1 ) + ∇L(λk )
video data [6]. For extremely large time steps, sequential 13: end for
data can be segmented and then input to sCCRF.
In machine learning, regularization has been common- Algorithm 1, we introduce two special functions [5] for
ly used in the learning process to achieve a model that convenience. The first one is sign function σ : σ(x) results
in a value in {−1, 0, 1} according to whether x is negative, For an output yi , node potential associates it with the cur-
zero or positive. The second one is project function π : π(x; y), rent observed feature vector xi . As the number of features
Rn 7→ Rn , parameterized by y ∈ Rn , where in xi is D, D node potentials are generated for the current
{ observation. The raw features after normalization are direct-
xi if σ(xi ) = σ(yi ) ly used in the node potential, so that we can analyze how
πi (x; y) = (10)
0 otherwise much a feature relates to a certain customer’s load.
It can be interpreted as projecting x onto the orthant defined The output variable y is a vector. For two elements yi and
by y. yj , if |i − j| = 1, yi and yj are closest neighboring variables;
In Algorithm 1, Step 1 chooses initial λ, and initializes if if |i − j| = m, yi and yj are m neighboring variables.
sets S and R. S is for displacements sk = λk+1 − λk , and The edge potential, which captures the interactions between
R is for changes in gradient rk = −∇L(λk+1 ) + ∇L(λk ). outputs, is defined as follows.
Steps 2-13 are the main iteration loop. Step 3 calculates the 1
gk (yi , yj , X) = − ski,j (yi − yj )2 , (16)
pseudo-gradient of F (λ) at λ, according to the following 2
equation: where ski,j is an indicator function defined as follows,
 − − {
 ∂i F (λ) if ∂i F (λ) > 0 1 if |i − j| ≤ m
⋄i F (λ) = ∂ F (λ) if ∂i+ F (λ) < 0 ,
+
(11) ski,j = , (17)
 i 0 otherwise
0 otherwise
where m refers to the number of neighboring variables
where taken into account.
{ In our previous work, only the closest neighboring vari-
∂ ρσ(λi ) if ̸ 0
λi =
∂i± F (λ) = − L(λ) + . (12) ables were considered [44]. Now we extend our previous
∂λi ±ρ if λi = 0
work to take m neighboring variables into account. With
In Equation 12, the term ∂L(λ)/∂λi is derived with respect the consideration of multiple neighboring variables, we can
to the specified model. In Step 4, an orthant ξ k is chosen model the load forecasting for each customer more accurate-
based on ⋄F (λ), ly, and hence improve the accuracy of load forecasting in a
{ grid. Here, m is a critical parameter that influences the per-
σ(λki ) if λki ̸= 0 formance of LF-LCB. In the experimental part, Subsection
ξik = . (13)
σ(− ⋄i F (λ )) if λki = 0
k
7.2 further analyzes the choice of a proper m. We also show
that with the proper m, LF-LCB achieves enhanced perfor-
Step 5 constructs the inverse of Hessian Hk , which is con-
mance compared with our previous work in Subsection 7.5.
structed in the same way as the traditional L-BFGS [31]. Step
For regression problems, edge potentials suffer a weak
6 then determines the search direction pk , formulated by
feature constraint problem [22], which is briefly explained
pk = π(Hk v k ; v k ), (14) as follows. CRF is a maximum entropy model with feature
constraints to perform structural learning. CRF learning
where v k = − ⋄ F (λk ). Steps 7-10 aims at finding the next tries to force the expectation of each feature with respect to
point λk+1 using constrained line search, in which each the model to equal that with respect to the learning data. For
point explored is projected back onto the chosen orthant: the binary features in conventional CRF, knowing the mean
λk+1 = π(λk + αpk ; ξ k ), where α controls the search step. is equivalent to knowing its full distribution. In contrast, the
Steps 11 and 12 update sets S and R, respectively. mean does not contain much information of the distribution
of a continuous variable in CCRF. This is the cause of the
weak feature constraint problem.
5 L EARNING C USTOMER B EHAVIORS TO AGGRE - We follow the work by Guo [22] and introduce Predictive
GATE CUSTOMERS Clustering Trees (PCTs) [42] to tackle this problem. PCTs
Customer aggregation tries to “smooth” the random behav- use a tree structure to supply rich feature constraints, which
iors of customers by clustering similar customers into the indeed divide the distribution of a continuous variable into
same group. Based on the “averaged” data of a customer several sub-distributions. Here, ∆x is introduced to denote
cluster, a better predictor can be learned. LF-LCB is com- the change of neighboring features, and ∆y is to denote
posed of four steps. 1) The load forecasting problem for each the change between yi and yj . The PCTs provide more
customer is modeled using sCCRF; 2) the sCCRF model is sophisticated relationships between ∆y and ∆x through
k (p)
initially learned for analysis of customer behaviors; 3) all dividing si,j into more indicator functions δk . The value
the customers are hierarchically clustered based on different (p)
of δk is determined by its corresponding assertion. When
customer behaviors; and 4) the representative sCCRF is fine- the assertion holds, its value is “1”; otherwise “0”. Fig. 2
tuned for each customer cluster, In the following subsection- uses the temperature feature as an example to illustrate how
s, the four steps are described in detail. PCTs work. When the assertion that ∆x is small (similar
(1)
temperatures in the two hours) holds, the value of δk is
(1)
5.1 Model Design “1” and PCTs stop expanding. When ∆x is not small, δk is
“0” and PCTs continue expanding. Similar processes repeat
In our model, both node potentials and edge potentials bear (2) (3)
a quadratic form. The node potential is defined as follows. for δk and δk .
In our situation, three different indicator functions (P =
fk (yi , X) = −(yi − Xi,k )2 (15) 3) can represent the changes (going up, going down or
1
staying similar) of a feature, and therefore are adequate The diagonal matrix M represents the contribution of α
to supply sufficient relationship information between ∆y terms (node potentials), and the symmetric matrix M2 rep-
and ∆x. As the assertions in PCTs are mutually exclu- resents the contribution of β terms (edge potentials). The
sive, only one assertion holds in the end. Thus, we have mean µ(X) is computed by
∑P (p)
ski,j = p=1 δk . In another view, through the divisions µ(X) = Σ · θ, (21)
by PCTs, more indicator functions ski,j are obtained. In the
following text, we can only use ski,j and no longer mention where θ is an n-dimension vector, and each element is
(p)
δk . calculated by
∑
D
θi = 2 αk Xi,k (22)
2
(yi-yj) k=1
5.2 Learning an sCCRF model

Δx: similar temperature
We use Algorithm 1 is to optimize the weights of the built
CCRF model. The constraints on parameters α and β are
Yes No
discussed first. Our previous work [44] did not put any con-
straints on α or β , basing on the thought that features and
δk(1)=1 Δx: dropped temperature
groud-truth were in limited ranges in practical problems.
Yes No The previous way works but suffers theoretical limits. Here,
we propose a simple way to tackle these constraints. CCRF
(2) (3)
δk =1 δk =1 has been derived into a multi-variable Gaussian form, the
constraint inherently comes from the positive-definiteness
Fig. 2. An illustration of how PCTs work with respect to the temperature
of the precision matrix Σ−1 [35]. This constraint can be
feature. When the temperature does not change much, the branch δk
(1) incorporated into the line search step in OWL-QN optimiza-
(2)
holds. If the temperature goes down, it goes to the branch of δk . If the tion process. In the line search step, the step size is chosen to
(3)
temperature goes up, the branch δk holds. Therefore, three levels of ensure both descent of objective function (Equation 9) and
PCTs are sufficient to supply information of temperature change. positive-definiteness of the precision matrix. This is similar
to the method used in [46]. Therefore, we do not explicitly
With the node potential in Equation 15 and the edge derive the constraints on parameters α and β .
potential in Equation 16, the resultant CCRF model is for- We derive the gradient of αk and βk of −L(λ) with
mulated as follows. respect to our model in Equation 18.
∑Q
1 ∑n ∑ D
|X )
(q) (q)
∂L(α, β) q=1 ∂logP (y
P (y|X) = · exp(− αk (yi − Xi,k )2 − ∇αk = − =− (23)
Z(X) i=1 k=1 ∂αk ∂αk
(18) ∑Q
1 ∑∑
S
∂L(α, β) q=1 ∂logP (y |X )
(q) (q)
βk ski,j (yi − yj )2 ) ∇βk = − =− (24)
2 i,j k=1 ∂βk ∂βk
In Equations 23 and 24, the derivations of ∂logP (y|X)/∂αk
Note that βk is shared across different pairs of neighboring
and ∂logP (y|X)/∂βk are simplified benefiting from the mul-
variables. We tried to introduce a parameter for each neigh-
tivariate Gaussian form, which is shown in Appendix B in
boring variable but it led to slight performance decay. We
detail. Based on the obtained ∇αk and ∇βk , Algorithm 1 is
therefore use the shared βk in edge potentials.
used to minimize the objective F (λ) to obtain the optimal
Following Radosavljevic’s work [35], the CCRF in Equa-
weights for sCCRF.
tion 18 can be derived into the following multivariate Gaus-
The gradients are derived across the whole dataset. As
sian form to facilitate learning and inference (The derivation
the overall gradients are a linear sum of the gradients of
process is shown in Appendix A)
each sample, they can still be computed for large dataset
1 (via a queue buffer). Once gradients are obtained from the
P (y|X) = ·
(2π)n/2 |Σ|1/2 dataset, OWL-QN can proceed. Therefore, sCCRF can also
(19) work on large dataset.
1
exp(− (y − µ(X))T Σ−1 (y − µ(X)))
2
5.3 Aggregating Customers
In this Gaussian form, the precision matrix Σ−1 , is the sum
of two n × n matrices, further expressed as follows. Customer behaviors can be represented by the weight vector λ
of sCCRF. In the learned weights λ =< α, β > for a certain
Σ−1 = 2(M1 + M2 ), where customer, each αk reveals how much a feature influences
{ ∑D the power usage of the customer, and each βk reveals how
1
Mi,j = k=1 αk if i = j
0 if i ̸= j much the feature variance influences the change of hourly
{ ∑ ∑n ∑ usage. As we use sparse CCRF, the weights of unrelated
S
βk r=1 ski,r − Sk=1 βk ski,j if i=j features have been pushed to zero, and the rest features
2
Mi,j = ∑
k=1
− Sk=1 βk ski,j if i ̸= j with non-zero weights can reflect how much the load is
(20) influenced by the related features. The clustering process
is totally adaptive to the identified customer behaviors. We The hierarchical clustering process results in reliable
adjust the number of customer clusters to achieve the best customer clusters since it circumvents the “curse of dimen-
performance. That is the major difference between learning sionality” in cluster problems. Benefiting from sCCRF, the
customer behaviors and manual division of customers. resultant weight parameters are sparse and consequently
With the obtained sparse feature weights, a hierarchial we can design the hierarchical clustering to aggregate cus-
clustering process can be designed to aggregate customers, tomers. The introduction of the sparse learning model and
which is illustrated in Fig. 3. First, each weight in λ is bina- the hierarchical clustering can construct a coherent process
to effectively aggregate customers.
All customers 5.4 Fine-tuning sCCRFs
After having obtained the clustering tree, we fine-tune an
Binarized weights
sCCRF for each cluster in the second layer. In Subsection
5.2, an sCCRF has been learned for each customer. For
Cluster C1 ...... Cluster Ci The first layer a customer cluster, a learned sCCRF is selected and fine-
tuned. Fine-tuning an sCCRF for each cluster has two
advantages. 1) Increasing prediction precision: for an indi-
Non-zero weights Non-zero weights vidual customer, behaviors can be chaotic, and the load is
hard to predict. On the contrary, the customers’ usage data
seem to be “smoothed” in a customer cluster. 2) Reducing
computational cost: for a cluster with N customers, only one
Cluster C11
Cluster C1j
Cluster Ci1
Cluster Cij
...... ...... The second layer fine-tuned sCCRF is needed in the end.
Our previous work [44] randomly selected an sCCRF
within a customer cluster and then fine-tuned it. Here we
present a more effective way to select representative sCCRF
in order to improve the efficiency of fine-tuning. The new
Fig. 3. An illustration of clustering customers. All customers are first
clustered according to the binarized weights (feature selection results).
selection rule is as follows. We define the cluster center
Then each cluster in the first layer is further clustered according to the as the center of the minimal sphere that contains all the
non-zero weights using K-means method. customers’ non-zero weight vector. For a cluster with N
customers, the cluster center is found first. The customer
rized. To be specific, all the non-zero weights are converted whose non-zero weight vector is closest to the center is
into “1”, and the zero values remain. In the first layer, all targeted, and the corresponding sCCRF is selected.
customers are clustered according to the binarized weights. Fine-tuning an sCCRF for a cluster is quite straight-
Customers who have the same binarized weights fall into forward. For the selected sCCRF, the input feature Xq re-
the same cluster. Afterwards, each cluster in the first layer mains, while the truth of load becomes the average load
is further clustered according to the customers’ non-zero of all customers in the cluster (note that load data are
weights to form clusters in the second layer. K-means, with a normalized). Then Algorithm 1 is employed to fine-tune this
Euclidean distance criterion ∆, is utilized in the second layer sCCRF with the input features and new ground-truth. The
clustering. To be specific, ∆ represents the largest distance fine-tuning process results in a quick convergence, because
between a point (within this cluster) to the cluster center. the weights in the selected sCCRF have been close to the
K starts with a manually chosen value and then increases optimal weights of the final sCCRF.
until the pre-set ∆ is met. ∆ is a critical parameter, which
determines the number of final clusters and influences the 6 L OAD F ORECASTING
final accuracy of load forecasting. This parameter is further With the learned sCCRF for each customer cluster, the
analyzed in Subsection 7.2 in experiments. hourly load can be predicted. Summing the predicted load
A two-layer clustering tree is obtained (see Fig. 3), and for each cluster, the final load for the whole grid can be
clusters in each layer have clear physical meanings. In the obtained.
first layer clustered by the binarized weights, the features To predict the load for each customer cluster, we find the
with weights “1” are related to the customers’ power usage. most likely y given the observed feature X, as formulated in
Thus, customers in the same cluster are influenced by the Equation 7. Benefiting from the multivariate Gaussian form,
same range of features. In practice, customers in this layer the inference becomes quite efficient. To maximize P (y|X)
can be certain customer genres such as wind power produc- in the multivariate Gaussian (see Equation 19), we simply
ers, householders and office users. In the second layer, each make y equal to µ(X),
cluster Ck is further divided into smaller clusters according
to the non-zero weights, which indicate how much each ŷ = argmaxy (P (y|X)) = µ(X) = Σ · θ (25)
feature influences the customer’s power usage. After a sec- Assuming there are N customer clusters formed in the
ond clustering, customers in each smaller cluster Ckl share grid, adding up the predicted load of each cluster yˆi element
similar sensitivity to the range of features. One example wise, the final load yW of the whole grid is obtained by the
is the office buildings. Office buildings in one cluster may following equation.
adjust their power usage according to temperatures, while
buildings in another cluster are less sensitive to the influence ∑
N
yW = yˆi (26)
of temperatures.
1
Besides the exact inference, due to the Gaussian dis- TABLE 1

tribution, the 95%-confidence intervals of the approximate Features used in LF-LCB
outputs can be obtained by the following Equation:
Feature Content Index
ỹ = ŷ ± 1.96 × diag(Σ) (27) hour of a day t1
Temporal feature
Equation 27 may assist decision makings in uncertain envi- day of a week t2
ronments. Calendar feature is or not holiday c1
temperature w1
7 E XPERIMENTS AND A NALYSIS wind strength w2

Weather feature
wind direction w3
Four experiments were conducted from different perspec-
cloudiness w4
tives to evaluate LF-LCB. Experiment 1: Analysis of internal
parameters. As both the number of neighboring variables lowest price m1
Market feature
m and the clustering criterion ∆ affect the final load fore- average price m2
casting result, we try to find optimal values of m and ∆
for practical use of LF-LCB. This experiment also justified
the value of m and ∆ we used in the following exper- We configured the Power TAC server and weather data
iments. Experiment 2: Comparing sCCRF to state-of-the- server, and utilized Power TAC games to generate training
art methods in load forecasting. We compare sCCRF with and test data. Training data were generated by six games
CCRF [22], Support Vector Regression (SVR) [13] and Con- based on the data in 2009. Test data were from six games
volutional Neural Networks (CNN) [12] on load forecasting according to the data in 2010. The logged customer usages
for two typical customers. Experiment 3: Evaluation of were regarded as the ground-truths of loads to be predicted.
learning customer behaviors. Two other prediction methods To induce varied features, three broker models, TacTex [41],
based on sCCRF without the consideration of customer cwiBroker [29] and GongBroker [45] were introduced to
behaviors were constructed. We compared LF-LCB with the compete in the games. 40 customers, each with a certain
two methods to demonstrate the advantage of introducing population, were simulated in Power TAC. Two customers
learning customer behaviors to load forecasting. Moreover, with populations up to tens of thousands, were split into
customer clusters and load forecasting results of LF-LCB customers with populations of 100. The split customers were
were visualized and analyzed. Experiment 4: Comparisons then rendered with some random behaviors in the log data,
with related customer aggregation methods. LF-LCB was such as power usage decreasing for going out at night, or
compared with hand-crafted customer clusters for load load increasing for having a party at home. In the end, 538
forecasting, our initial study that considered only closest customers were obtained. These customers were sufficient
neighboring variables [44], and Alzate and Sinn’s work that to analyze LF-LCB.
utilized spectral clustering to aggregate customers [3]. The configurations in LF-LCB are now described in de-
Our experiments were conducted on the platform of tail. sCCRF modeled 24-hour power usage sequence under
Power Trading Agent Competition (Power TAC) [26]. Power the influences of hourly features. For the features in each
TAC has drawn wide attention and has become a bench- hour, 9 node potentials (corresponding to 9 features) were
mark in the Smart Grid research community. Power TAC generated. Edge potentials were generated in m closest
simulates a variety of customers with various behaviors neighboring variables, where m is determined in Exper-
in a grid. There are also a wealth of features, including iment 1. and PCTs were applied to weather and market
real-world weather conditions and real-time market status. features, which fluctuate to influence the power usage.
Besides, the Power TAC server supplies logs of customers’ Meanwhile, it was not necessary to apply PCTs to temporal
hourly power usage from real-world data, which are regard- and calendar features. To ensure the sCCRF for each cus-
ed as the ground-truth to evaluate the proposed LF-LCB. tomer converged, we set 50 iterations for OWL-QN. For the
The overall workflow of this study is as follows. We fine-tuned sCCRF for each customer cluster, 10 iterations
first configure the Power TAC to run simulation games were sufficient to ensure convergence.
and collect training and testing data. The training data go In the proposed LF-LCB, there are three hyper-
through the learning customer behaviors process described parameters. One is ρ in the cost function of sCCRF ( Equa-
in Section 5. Given the learned model, testing data go tion 9), which can be determined by cross-validation. The
through load forecasting process in Section 6 to obtain the other two are m that defines the number of neighboring
predicted load of the grid. variables taken into account and ∆ that determines the
granularity of the customer clusters. As the two parameters
greatly affect the performance of LF-LCB, we show how
7.1 Experiment Settings
to choose proper m and ∆ in Experiment 1. For the other
Features used in this study include temporal features, cal- experiments, we set m = 2 and ∆ = 0.05, which are optimal
endar features, weather features and market features. These values found in Experiment 1.
are all available features extracted from Power TAC data.
Weather features are from real-world hourly weather fore-
casting data (the venue is secret) [26]. Both input features 7.2 Experiment 1: Analysis of internal parameters
and outputs are normalized into standard Gaussians in data ∆ and m are two important parameters in LF-LCB. We use
pre-processing. Table 1 lists the contents of the four types of grid search to find the optimal values for ∆ and m. m was
features, where indexes are used for further discussion. searched in the outer loop and ∆ was searched in the inner
loop. Four different values were set for m: 1, 2, 3, and 5; In summary, ∆ can greatly affect the performance of LF-
while five different values were set for ∆: 0.025, 0.05, 0.75, LCB, because it can significantly influence the number of
0.10 and 0.15. LF-LCB process repeated with different ∆ customer clusters and consequently affect load forecasting
and m. For the predicted power usage in each hour, Mean results. With a proper ∆, an optimal m can further improve
Absolute Percentage Error (MAPE) was used to measure the the accuracy of load forecasting. LF-LCB used sCCRF, which
precision of load forecasting. The overall MAPE (an average requires input data to go through a Gaussian normalization.
of MAPE in 24 hours) indicates the performance of load For other similar problems using our method, as the input
forecasting. With grid search, we found that the optimal data are all normalized into a standard Gaussian, the sug-
value of m was 2, and that of ∆ was 0.05. gested ranges of ∆ and m can also be valuable references.
We further analyzed the influence of m and ∆ indepen-
dently. Table 2 shows the number of clusters and the overall
MAPE under different ∆ values when m = 2. In Table 2, 7.3 Experiment 2: Comparing sCCRF with state-of-the-
art methods
TABLE 2 This experiment compares sCCRF with other methods on
The influence of ∆ on the MAPE and the number of clusters
load forecasting of individual customers, which is the foun-
dation to accurately predict the load of a grid. We compare
∆ Number of clusters overall MAPE(%)
the proposed sCCRF with other state-of-the-art prediction
0.025 60 5.33
methods, including ARIMA [9], Support Vector Regression
0.05 24 4.08
(SVR) [13], CCRF [22], Convolutional Neural Networks (C-
0.075 23 4.33
NN) [12] and LSTM [49]. As these methods are not suitable
0.10 17 5.41
for analysis of customer behaviors, we compared them
0.15 10 6.02
with sCCRF on load forecasting of individual customers for
fairness. We chose three typical customers in Smart Grid,
when ∆ was set as 0.025, 60 customer clusters were formed,
and used the six models to predict the load for the customers
and the total MAPE was 5.33%. When we set ∆ = 0.05,
respectively.
24 customer clusters were obtained, and the total MAPE
Householders, office users and solar energy produces are
became much better. Comparing the above two settings,
three types of customers exhibiting irregular behaviors. We
we can see that a small ∆ results in fine granularity of
therefore chose a householder, an office user and a solar
customer clusters. When the customer cluster is too small,
energy producer to evaluate performances of the six models
the “smoothness” of load data is compromised. That is why
in load forecasting. For different models, cross-validation
a small ∆ leads to a less competitive prediction result. When
was used to tune hyper-parameters. Table 4 lists the best
∆ was set to 0.075, 23 customer clusters were formed, and
results achieved by each model.
the total MAPE was 4.33%. This indicates that when ∆
changes from 0.05 to 0.075, the performance of LF-LCB does
not change much. Besides, as the number of clusters also TABLE 4
The performances of four models on load forecasting of two customers.
determines the required final sCCRFs, the range from 0.05 The results are measured by overall MAPE (%).
to 0.075 can result in a small computational cost. When ∆
was 0.10, prediction accuracy greatly declined. If we further Customer ARIMA SVR CCRF CNN LSTM sCCRF
increased ∆ to 0.15, it led to only a few clusters and further Householder 4.90 4.81 4.88 5.45 5.13 4.40
declined accuracy. From the above analysis, we suggest that Office user 4.01 3.89 3.76 4.48 4.32 3.69
the reasonable range of ∆ is in-between [0.05, 0.075]. Solar energy 8.03 8.25 7.46 8.92 7.85 7.32
Table 3 shows the number of clusters and the overall
MAPE under different m values when ∆ = 0.05. We can
For householders, sCCRF achieves better accuracy than
TABLE 3 ARIMA, SVR and CCRF. CNN and LSTM does not perform
The influence of m on the MAPE and the number of clusters well, which may be due to the limited number of training
samples. For office users, sCCRF is slightly better than SVR
m Number of clusters overall MAPE(%) and CCRF, while CNN and LSTM are less competitive.
1 26 4.26 For solar energy producers, sCCRF still obtains the lowest
2 24 4.08 MAPE. As the three kinds of customers show different
3 27 4.21 behaviors, prediction results from different methods may
5 26 4.32 vary a little. In general, sCCRF outperforms other methods.
The major advantages of sCCRF may be attributed to
see that m does not have much influence on the number of two factors. 1. sCCRF can not only model the mapping
customer clusters, but does affect the final load forecasting from input features to outputs, but also model correlations
accuracy. As the MAPE has been quite small, different of output variables, while other methods (except CCRF)
values of m can result in considerable relative changes in cannot model correlations of output. 2. sCCRF introduces
prediction accuracy. When m increases, the computational L1 regularization term, and thus can generalize well when
cost increases but prediction accuracy declines. In this eval- training samples are limited.
uation, we can choose m = 2, i.e. we model the correlations It is widely known that deep neural networks require a
between the current variable and its two neighboring vari- large quantity of training samples to achieve good perfor-
ables. mances. In this evaluation, the power of CNN and LSTM
are not fully shown, and we cannot easily come to the con- two methods. LF-LCB also improved previous LF-LCBo,
clusion that CNN and LSTM are not competitive. Besides which is further discussed in Subsection 7.5.
the competitive accuracy in load forecasting, sCCRF has
another advantage in that it simultaneously selects effective 7.4.2 Analysis of customer behaviors
features during training, which can be utilized in analysis of Customer aggregation is based on customer behaviors,
customer behaviors. which is a vector that reflects how the power usage of
a customer is influenced by a series of external factors.
7.4 Experiment 3: Evaluation of Learning Customer Be- Vector λ represents customer behaviors and indicates the
haviors power consumption patterns of different types of customers.
For instance, the behaviors of a solar energy producer are
7.4.1 Comparing LF-LCB to baselines
strongly affected by weather conditions, thus the values of
We evaluated the contribution of learning customer be- weights are large on weather features. Customer behaviors
haviors in load forecasting by comparing LF-LCB with are analyzed in the two-layered clustering process. Fig. 5
two other methods, both of which used sCCRF without
considering customer behaviors. In method 1, one sCCRF C1 1
was trained for each customer. The load of the whole grid
was the sum of all individual customers’ loads predicted C2 1
by sCCRFs. We named this method LF-S. In method 2, C3 9
All customers
one sCCRF was trained towards the whole grid, regardless
of any individual customer behaviors. We used LF-W to C4 2
denote this method. Besides, we also add the result of C5 6
previous work [44], denoted as LF-LCBo. MAPE for each
hour of the three methods are illustrated in Fig. 4. We can C6 2
C7 3
12
LF-LCB
11
LF-S Fig. 5. Clustering tree of LF-LCB. Ci represents the cluster in the first
10 LF-W
LF-LCBo
layer. The digits indicate numbers of clusters in the second layer.
9
8 shows the clustering tree of LF-LCB. In the clustering tree,
7 clusters (C1 , · · · , C7 ) are obtained in the first layer. In the
MAPE (%)
7
6 second layer, the digit on each ellipse indicates the number
5
of clusters, and 24 clusters are formed in the end.
4
In the first layer of the clustering tree, based on the
3
2
binarized feature weights, we can determine if customer
1
behaviors in one cluster are influenced by certain features.
0 Table 5 uses a binary matrix to show the relationships
0 5 10 15 20 25
Hours of a day between the clusters and features. In Table 5, “1” indicates
that the cluster is influenced by this feature, while “0”
Fig. 4. Performance comparison of LF-S, LF-W, LF-LCBo and LF-LCB means that the feature is not related to this cluster. The
clusters in the first layer show clear physical meanings.
see from Fig. 4 that LF-S performs slightly better than LF-W, For instance, in Fig. 5, customers in C1 are wind power
and LF-LCB outperforms the other two methods. producers, and their behaviors are influenced by the related
LF-S used sCCRF to predict the load for each customer, wind features. Customers in C2 are solar energy producers
but some behaviors of an individual customer were random whose power usage is affected by time, temperature and
and might be impossible to predict. For instance, some cloudiness. Observing the types of customers in each cluster,
household customers may occasionally go out to parties we can see that they generally belong to the same customer
on any weekday. Thus, the weakness of LF-S came from category. For example, customers in C4 are thermal storage
many accumulated errors resulting from random customer customers, while customers in C5 are householders and
behaviors. For LF-W, it utilized one sCCRF to predict the office users.
load for all the customers, but a single sCCRF failed to
handle the various customers with different behaviors. In TABLE 5
the end, the final load prediction result of LF-W was not Cluster and feature relation matrix
satisfactory.
t1 t2 c1 w1 w2 w3 w4 m1 m2
In contrast, LF-LCB performed well because it overcame
C1 1 0 0 0 1 1 0 0 0
the disadvantages in the above two methods. LF-LCB used
C2 1 0 0 1 0 0 1 0 0
sCCRF to first analyze customer behaviors, then grouped
C3 1 1 1 0 0 0 0 0 1
customers with similar behaviors into one cluster. In a
C4 1 0 0 0 0 0 0 1 1
cluster of customers, the chaotic random behaviors were
C5 1 1 1 1 0 0 1 0 0
averaged, resulting in “smooth” power usage data. With
C6 1 1 1 1 0 0 1 1 0
a fine-tuned sCCRF for each customer cluster, the final
C7 1 0 0 1 0 0 0 1 0
prediction result was more accurate than that of the other
In the second layer of the clustering tree, customers are 120
further clustered, resulting in 24 final clusters. Take C5 as an 110
example. C5 is further clustered into 6 clusters based on the 100
non-zero weights. In each customer cluster C5j , customers
Power Usage/MWh
90
show similar responses to the influences of outside features.

80
For the 24 customer clusters, 24 corresponding sCCRFs are
70
fine-tuned. In the end, only 24 sCCRFs are maintained for
load forecasting for the whole grid. Therefore, our LF-LCB 60 Actual usage
Predicted usage
results in a reasonable computational cost. 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Hours/h
We also analyzed the stability of customer clustering.
In the hierarchical clustering, the first layer is actually a Fig. 7. The predicted 95%-confidence intervals on the 10th of November
classification by binarized weights, so it is totally stable. in 2010.
Customers in each cluster are further clustered according to
the non-zero weights, whose dimensions have been greatly
reduced. Moreover, the number of customers in each cluster 7.5 Experiment 4: Comparing with other customer ag-
is not large. Thus, the clustering processes in the second gregation methods
layer have quite stable results. LF-LCB was compared to four other customer aggregation
methods. The first one is customer aggregation by manual
TABLE 6 rules. This method classifies customers by their known
Relations of customer behaviors and features attributes, such as householder, office user and solar power
producer, and then sCCRF is introduced to predict the load
Customer Time Calendar Weather Market
of each cluster. We call this method LF-M. The second
Householder 1 1 1 1
method used sCCRF to analyze customer behaviors, and
Office user 1 1 0 0
Solar energy 1 0 1 1
then directly used K-means to cluster customers. We use
LF-K to denote this method. The third method was Alzate
In the end, we give some examples of the relations of cus- and Sinn’s work [3], which used kernel spectral clustering
tomer behaviors of certain features. Three typical customers to aggregate customers with respect to historical usage data,
are shown in Table 6. The energy consumption patterns of and then used periodic auto-regression to predict the load
the householder are related to the four categories of features. of each cluster. This method is referred as LF-KSC for
The behaviors of the chosen office user are related to time convenience. The fourth method used a mixture of linear
and calendar features. For the solar energy producer, its regression combined with a mixture of Gaussian clustering,
behaviors are related to time, weather and market features. where customers were clustered according to historical data,
and then the cluster load was predicted by a mixture of
linear regression. This method is called LF-MLR for short.
7.4.3 Visualizations
For LF-M, only the customer attributes were used to
Load forecasting is a non-linear prediction. Fig. 6 visually aggregate customers, which was a simple way to divide cus-
shows the predicted loads and actual loads in two sample tomers manually. LF-K directly used K-means to aggregate
days, which are 1st and 2nd of December in 2010. We can customers, K was chosen through repeated evaluations. For
LF-KSC, the RBF kernel with Spearman’s distance, which
120
was reported best performance [3], was used. Validation
method was used to select the parameters for the kernel.
100 LF-MLR was similar to LF-KSC in customer aggregation by
historical usage data, while LF-MLR can be regarded as an
Power Usage/MWh
80
improvement on LF-KSC via more sophisticated clustering
60 and prediction methods.
40
All methods were compared on the cluster number, pre-
diction accuracy and training time We also add the previous
20 Actual usage
Predicted usage
result of LF-LCB in [44], denoted as LF-LCBo. Table 7 shows
0
the compared items.
0 5 10 15 20 25 30 35 40 45 50
Hours/h
TABLE 7
Fig. 6. The predicted loads and actual loads in two sample days on the Comparisons of customer aggregation methods for load forecasting
1st and 2nd of December in 2010.
Method Number of clusters overall MAPE(%) Training time (h)
see from Fig. 6 that the curve of predicted loads follows the
LF-M 8 6.82 -
trends of the actual load sequence, which demonstrates that
LF-K 15 4.98 23.92
our method captures the fluctuations of actual loads.
LF-KSC 14 5.68 3.92
Fig. 7 shows the predictions of 95%-confidence intervals
LF-MLR 13 5.21 14.20
on November 10th , 2010. The predictions of 95%-confidence
LF-LCBo 26 4.51 4.33
intervals may assist decision making in uncertain environ-
LF-LCB 24 4.08 4.06
ments.
LF-M was regarded as a baseline customer aggregation each customer cluster for the optimal solution. In order to
method, where customers were aggregated manually. In LF- achieve a good performance, there are two key issues to be
M, only 8 customer clusters were identified. It was hard addressed. The first issue is to find the robust customer be-
to define the training time of LF-M in which customers havior patterns. Only with robust behavior patterns, can we
were clustered manually. In contrast, LF-LCB introduced cluster the customers accurately. An appropriate learning
learning method to identify customer behaviors, and 24 method can discover the customer behavior patterns much
clusters were obtained adaptively. We can see from Table 6 more robustly than some statistical criteria or manually clas-
that LF-LCB significantly outperforms LF-M, which verifies sification. The second issue is to find an appropriate granu-
the effectiveness of learning customer behaviors. larity of the customer clusters. The clustering granularity af-
LF-K shows a good prediction result, but it has a large fects the final prediction accuracy and computational cost. A
time cost, which can be attributed to the direct K-means proper clustering granularity should compromise between
on a large number of samples with large dimensions. In the separation of different customers and the smoothness of
contrast, LF-LCB used two-layered clustering. The first layer customer behaviors in the same cluster.
was an efficient division of customers by binary features. For An efficient learning of customer behaviors should ef-
the second layer, the number of samples and dimensions fectively resolve the above two issues. To discover robust
have been greatly reduced. Thus, K-means can complete in customer behavior patterns, we proposed sCCRF in this
a short time. paper. Experiments 2 and 3 verified that sCCRF could iden-
LF-KSC has fewer customer clusters, while LF-LCB gets tify key features for different customers and obtain a good
better accuracy in the end. LF-KSC aggregates customers on- accuracy in load forecasting. Thus, sCCRF can be an efficient
ly on the patterns of customers’ historical usage data, while learning method to model the load forecasting problem and
LF-LCB considers the customer behaviors under the influ- to analyze customer behaviors. To find a proper clustering
ences of different external factors. In comparisons between granularity, we used validation method in experiments.
LF-KSC and LF-LCB, it is found that the learned customer Experiment 1 revealed the process to choose the proper
behaviors are more effective than only using the historical granularity of the cluster.
usage pattern to aggregate customers. In experiments, LF- It is worthwhile to notice the coherence of the sparse
KSC shows a little advantage in training time. In fact, kernel learning model and the hierarchical clustering process. Ben-
spectral clustering is a quite expensive algorithm in time efiting from the sCCRF, two measures for customer aggrega-
cost, while the time cost of LF-LCB can be further reduced tion were thereby obtained. As a consequence, a two-layer
by simply processing more customer models in parallel. clustering could effectively aggregate customers, free from
LF-MLR improves LF-KSC in terms of accuracy, but LF- the “curse of dimensionality”. In Experiment 3, we found
MLR consumes much more training time due to mixture that the hierarchical clustering resulted in stable and reliable
of Gaussian clustering. Comparing LF-MLR with LF-LCB, customer clusters.
LF-LCB shows advantages in both prediction accuracy and
training time. Similar to LF-KSC, LF-MLR clusters customer-
s by historical usage data. The advantage of LF-LCB over 8.2 The benefits of learning customer behaviors
LF-MLR again verifies the effectiveness of learning customer The efficient learning of customer behaviors brings benefits
behaviors. for load forecasting in Smart Grid. 1) LF-LCB is superior to
Comparing to the previous LF-LCBo [44], Our current learning towards the whole grid. In Experiment 3, LF-LCB
work improves load forecasting accuracy by 0.43%. For load outperformed LF-W. In a complex Smart Grid, LF-W could
forecasting, a small improvement in accuracy can lead to not perform well because it neglected customer behaviors.
millions of dollars benefit. Noting that the overall MAPEs In contrast, LF-LCB introduced extra computations to han-
have been quite small, 0.43% is a considerable relative dle the complex customer behaviors and resulted in high
improvement. The improvement is attributed to: 1) m neigh- prediction accuracy. 2) LF-LCB performs better than load
boring variables are considered in sCCRF modeling; 2) More prediction for each single customer. LF-LCB also showed
features are used in load forecasting; 3) An improvement in a better result than that of LF-S in Experiment 3. In LF-
fine-tuning sCCRF. S, as the random behaviors of a single customer were
hard to predict, learning from each customer did not get
8 D ISCUSSION the best accuracy. In comparison, LF-LCB “smoothed” the
This research proposed learning customer behaviors to ag- random behaviors on some extent, and thus achieved a
gregate customers (LF-LCB) to surmount the challenges of better prediction accuracy. LF-LCB also had a much lower
complex customer behaviors, so as to facilitate load forecast- computational cost than that of LF-S in load forecasting.
ing in Smart Grid. Based on the experimental results, we In summary, Experiment 3 demonstrated that LF-LCB had
further discuss the key issues in learning customer behav- the advantages of improving load forecasting accuracy and
iors, the benefits gained from learning customer behaviors achieving a low computational cost.
and our insights to extend learning customer behaviors to In comparison with LF-M in Experiment 4, it was veri-
broad market domains. fied that learning customer behaviors is much more effective
than manually classifications of customers. In comparison
8.1 Two key issues in learning customer behaviors with LF-KSC in Experiment 4, LF-LCB showed better perfor-
The essence of learning customer behaviors is to cluster mance, which demonstrated that clustering customers with
similar customers into the same group based on discov- learned patterns is more effective than clustering customers
ered customer behavior patterns, and consequently target using historical load data only.
8.3 The extensions of learning customer behaviors based on that platform. This work is supported by a Dis-
Learning customer behaviors to aggregate customers can covery Project (DP140100974) from the Australian Research
also be applied to other market domains. With learned Council.
customer behaviors, similar customers can be aggregated.
Consequently, decision making, such as for retail strategy
or demand prediction, can target the customer clusters. In R EFERENCES
this scenario, the decisions tailored for a customer cluster [1] A. Ali, S. Ghaderi, and S. Sohrabkhani. Forecasting electrical
will be better than the decisions aimed at the whole complex consumption by integration of neural network, time series and
market. With proper clustering granularity, there would be anova. Applied Mathematics and Computation, 186(2):1753–1761,
2007.
a limited number of clusters, so the overall computational [2] C. Alzate, M. Espinoza, M. De, and J. Suykens. Identifying
cost would be manageable. customer profiles in power load time series using spectral clus-
We also placed an emphasis on the two key issues in tering. In Proceedings of International Conference on Artificial Neural
general learning of customer behaviors. The first key issue Networks, pages 315–324. Springer, 2009.
[3] C. Alzate and M. Sinn. Improved electricity load forecasting via
is to choose an appropriate learning method to model the kernel spectral clustering of smart meters. In Proceedings of IEEE
targeted problem to discover customer behavior patterns. 13th International Conference on Data Mining, pages 943–948. IEEE,
The machine learning methods could be LASSO, LARS 2013.
[4] M. Amin-Naseri and A. Soroush. Combined use of unsupervised
[14], L1 -SVM [25], or any method that can perform feature
and supervised learning for daily peak load forecasting. Energy
selection. To model a temporal sequence problem, candidate Conversion and Management, 49(6):1302–1308, 2008.
learning methods could be sCCRF or Recurrent Neural [5] G. Andrew and J. Gao. Scalable training of l 1-regularized log-
Networks (RNN) [18]. The second key issue is to find a linear models. In Proceedings of the 24th international conference on
Machine Learning, pages 33–40. ACM, 2007.
proper clustering granularity to cluster different customers. [6] T. Baltrusaitis, N. Banda, and P. Robinson. Dimensional affect
For different problems, the proper cluster granularity could recognition using continuous conditional random fields. In Pro-
be found through validations by experiments, as illustrated ceedings of IEEE conference on Automatic Face and Gesture Recognition,
in Experiment 1. After the customers are clustered, each pages 1–8. IEEE, 2013.
[7] R. Bhinge, N. Biswas, D. Dornfeld, J. Park, K. Law, M. Helu, and
cluster can be treated independently. If better sales strategies S. Rachuri. An intelligent machine monitoring system for energy
are required, decisions can be tailored to target individual prediction using a gaussian process regression. In Big Data (Big
customer clusters. If the total customer demand is to be Data), 2014 IEEE International Conference on, pages 978–986. IEEE,
2014.
predicted, fine-tuning can be applied to each customer
[8] M. Blum and O. François. Non-linear regression models for
cluster, and then the total market demand can be obtained approximate bayesian computation. Statistics and Computing,
by a summation of the demands of each cluster. In general, 20(1):63–73, 2010.
learning customer behaviors can be an efficient solution to [9] G. Box, G. Jenkins, G. Reinsel, and G. Ljung. Time series analysis:
forecasting and control. John Wiley & Sons, 2015.
decision making in large-scale complex markets. [10] W. Cabrera, D. Benhaddou, and C. Ordonez. Solar power predic-
tion for smart community microgrid. SMARTCOMP, pages 1–6,
2016.
9 C ONCLUSIONS [11] W. Charytoniuk, M-S Chen, and P. Van. Nonparametric regression
This paper proposed a load forecasting method through based short-term load forecasting. IEEE Transactions on Power
Systems, 13(3):725–730, 1998.
learning customer behaviors (LF-LCB), which utilized the [12] X. Dong, L. Qian, and L. Huang. Short-term load forecasting in
proposed sCCRF to analyze customer behaviors by using smart grid: A combined cnn and k-means clustering approach. In
the learned weights to reflect different energy consumption Proceedings of IEEE International Conference on Big Data and Smart
Computing, pages 119–125. IEEE, 2017.
patterns of various customers. The results of experiments [13] H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik.
conducted from several perspectives supported the follow- Support vector regression machines. In Proceedings of Advances
ing two conclusions: 1) Learning customer behaviors to in neural information processing systems, pages 155–161, 1997.
aggregate customers can improve the prediction precision [14] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani. Least angle
regression. The Annals of statistics, 32(2):407–499, 2004.
and lead to a reasonable computation cost. 2) The proposed [15] S. Fan and R. Hyndman. Short-term load forecasting based on
sCCRF is an efficient learning tool with feature selection a semi-parametric additive model. IEEE Transactions on Power
capacity. Systems, 27(1):134–141, 2012.
Our work can potentially facilitate research in related do- [16] J. Fiot and F. Dinuzzo. Electricity demand forecasting by multi-
task learning. IEEE Transactions on Smart Grid, 2016.
mains. Learning customer behaviors to aggregate customers [17] B. Frénay and M. Verleysen. Parameter-insensitive kernel in
in fact can supply a general methodology to assist better extreme learning for non-linear support vector regression. Neu-
decision making towards various customers in a complex rocomputing, 74(16):2526–2531, 2011.
[18] K. Funahashi and Y. Nakamura. Approximation of dynamical
market environment. This is worth further exploration in
systems by continuous time recurrent neural networks. Neural
other market domains. Evaluation results also indicate that networks, 6(6):801–806, 1993.
the proposed sCCRF is effective in feature selection and [19] R. Gao and L. Tsoukalas. Neural-wavelet methodology for load
prediction. Thus, sCCRF can also be applied in other related forecasting. Journal of Intelligent and Robotic Systems, 31(1-3):149–
157, 2001.
research fields. [20] J. Glass, M. Ghalwash, M. Vukicevic, and Z. Obradovic. Extending
the modelling capacity of gaussian conditional random fields
while learning faster. In Proceedings of 30th Association for the
ACKNOWLEDGMENTS Advancement of Artificial Intelligence, pages 1596–1602, 2016.
The authors would like to thank the organizers of Power [21] R. Gulbinas, A. Khosrowpour, and J. Taylor. Segmentation and
classification of commercial building occupants by energy-use
TAC, which supplies a platform for the simulation of real- efficiency and predictability. IEEE Transactions on Smart Grid,
world Smart Grid market. Our experiments are conducted 6(3):1414–1424, 2015.
[22] H. Guo. Accelerated continuous conditional random fields for 817–826. International Foundation for Autonomous Agents and
load forecasting. IEEE Transactions on Knowledge and Data Engi- Multiagent Systems, 2016.
neering, 27(8):2023–2033, 2015. [45] X. Wang, M. Zhang, F. Ren, and T. Ito. Gongbroker: A broker
[23] H. Guo. Accelerated continuous conditional random fields for model for power trading in smart grid markets. In Proceedings
load forecasting. In Proceedings of IEEE 32nd International Confer- of IEEE/WIC/ACM International Conference on Web Intelligence and
ence on Data Engineering, pages 1492–1493, 2016. Intelligent Agent Technology (WI-IAT), volume 2, pages 21–24. IEEE,
[24] L. Hernandez, C. Baladron, J. Aguiar., R. Carro, A. Sanchez- 2015.
Esguevillas, J. Lloret, and J. Massana. a survey on electric power [46] M. Wytock and J Z. Kolter. Sparse gaussian conditional random
demand forecasting: Future trends in smart grids, microgrids fields: Algorithms, theory, and application to energy forecasting.
and smart buildings. IEEE Communications Surveys and Tutorials, In Proceedings of international conference on Machine Learning, pages
16(3):1460–1495, 2014. 1265–1273, 2013.
[25] C. Hsieh, K. Chang, C. Lin, S. Keerthi, and S. Sundararajan. A [47] X. Xin, I. King, H. Deng, and M. Lyu. A social recommendation
dual coordinate descent method for large-scale linear svm. In framework based on multi-scale continuous conditional random
Proceedings of the 25th international conference on Machine Learning, fields. In Proceedings of the 18th ACM conference on Information and
pages 408–415. ACM, 2008. Knowledge Management, pages 1247–1256. ACM, 2009.
[26] W. Ketter, J. Collins, P. Reddy, and M. Weerdt. The 2014 power [48] J. Yu, S. Vishwanathan, S. Günter, and N. Schraudolph. A quasi-
trading agent competition. ERIM Report Series Reference No. ERS- newton approach to nonsmooth convex optimization problems
2014-004-LIS, 2014. in machine learning. The Journal of Machine Learning Research,
[27] J. Lafferty. Conditional random fields: probabilistic models for 11:1145–1200, 2010.
segmenting and labeling sequence data. In Proceedings of the 18th [49] J. Zheng, C. Xu, Z. Zhang, and X. Li. Electric load forecasting in
International Conference on Machine Learning (ICML-2001), pages smart grids using long-short-term-memory based recurrent neural
282–289, 2001. network. In Proceedings of 51st Annual Conference on Information
[28] T. Lavergne, O. Cappé, and F. Yvon. Practical very large scale Sciences and Systems, pages 1–6. IEEE, 2017.
crfs. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 504–513. Association for Compu-
tational Linguistics, 2010.
[29] B. Liefers, J. Hoogland, and P. La. A successful broker agent
for power tac. In Agent-Mediated Electronic Commerce. Designing Xishun Wang received the B.Sc. degree from
Trading Strategies and Mechanisms for Electronic Markets, pages 99– the Xidian University, Xi’an, China, the M.Sc.
113. Springer, 2014. degree from Chinese Academy of Sciences, Bei-
[30] B. Liu, J. Nowotarski, T. Hong, and R. Weron. Probabilistic load jing, China, respectively.
forecasting via quantile regression averaging on sister forecasts. PLACE He is currently doing his Ph.D. degree in
IEEE Transactions on Smart Grid, 8(2):730–737, 2017. PHOTO computer science in the University of Wollon-
[31] D. Liu and J. Nocedal. On the limited memory bfgs method for HERE gong, Wollongong, NSW, Australia. His current
large scale optimization. Mathematical programming, 45(1-3):503– research includes machine learning, data min-
528, 1989. ing, and applying machine learning to Smart
[32] A. Ng. Feature selection, l 1 vs. l 2 regularization, and rotational Grid.
invariance. In Proceedings of the twenty-first international conference
on Machine learning, pages 78–85. ACM, 2004.
[33] P. Palensky and D. Dietrich. Demand side management: Demand
response, intelligent energy systems, and smart loads. IEEE
Transactions on Industrial Informatics, 7(3):381–388, 2011.
[34] T. Qin, T. Liu, X. Zhang, D. Wang, and H. Li. Global ranking using
continuous conditional random fields. In Proceedings of Advances Minjie Zhang received her B.Sc. degree in Com-
in neural information processing systems, pages 1281–1288, 2009. puter Science from Fudan University, China in
1982 and her PhD degree in Computer Science
[35] V. Radosavljevic, S. Vucetic, and Z. Obradovic. Continuous
from The University of New England, Australia in
conditional random fields for regression in remote sensing. In PLACE 1996.
Proceedings of European conference on Artificial Intelligence, pages PHOTO She is currently a Professor of Computer Sci-
809–814, 2010. HERE ence and the Director of Center for Big Data
[36] M. Schmidt. minfunc: unconstrained differentiable multivari-
Analytics and Intelligent Systems, at the Uni-
ate optimization in matlab. http://www.cs.ubc.ca/ schmidt-
versity of Wollongong, Australia. She is an ac-
m/Software/minFunc.html, 2005.
tive Researcher and has published over 200
[37] G. Soffritti and G. Galimberti. Multivariate linear regression with research papers. Her research interests include
non-normal errors: a solution based on mixture models. Statistics multi-agent systems, agent-based modelling and simulation in complex
and Computing, 21(4):523–536, 2011. domains, smart grid systems, and data mining and knowledge discovery
[38] D. Srinivasan. Energy demand prediction using gmdh networks. in open environments.
Neurocomputing, 72(1):625–629, 2008.
[39] L. Suganthi and A. Samuel. Energy models for demand forecastin–
a review. Renewable and Sustainable Energy Reviews, 16(2):1223–
1240, 2012.
[40] J. Taylor and R. Buizza. Using weather ensemble predictions in
electricity demand forecasting. International Journal of Forecasting, Fenghui Ren received the B.Sc. degree from
19(1):57–70, 2003. Xidian University, Xian, China, in 2003, and the
[41] D. Urieli and P. Stone. Tactex’13: a champion adaptive power trad- M.Sc. and Ph.D. degrees from the University
ing agent. In Proceedings of international conference on Autonomous of Wollongong, Wollongong, NSW, Australia, in
PLACE 2006 and 2010, respectively.
Agents and Multi-agent Systems, pages 1447–1448, 2014. PHOTO
[42] D. Vail, M. M. Veloso, and J. Lafferty. Conditional random fields He is currently an Australia Research Coun-
HERE cil Discovery Early Career Researcher Award in
for activity recognition. In Proceedings of international conference on
Autonomous Agents and Multiagent Systems, pages 235–242. ACM, Australia Fellow and a Lecturer with the School
2007. of Computer Science and Software Engineer-
[43] X. Wang, F. Ren, C. Liu, and M. Zhang. l 1 -regularized con- ing, University of Wollongong. He is an active
tinuous conditional random fields. In Pacific Rim International researcher and has published over 50 research
Conference on Artificial Intelligence, pages 793–804. Springer, 2016. papers. His current research interests include agent-based concept
[44] X. Wang, M. Zhang, and F. Ren. Load forecasting in a smart grid modeling of complex systems, data mining and pattern discovery in
through customer behaviour learning using l1-regularized con- complex domains, agent-based learning, smart grid systems.
tinuous conditional random fields. In Proceedings of International
Conference on Autonomous Agents and Multiagent Systems, pages

Base Papers PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Base Papers PDF

Hochgeladen von

Copyright:

Verfügbare Formate

This article has been accepted for publication in a future issue of this journal, but has not been

Learning Customer Behaviors

L OAD forecasting aims to predict the energy demand of

5.2 Learning an sCCRF model

Besides the exact inference, due to the Gaussian dis- TABLE 1

7 E XPERIMENTS AND A NALYSIS wind strength w2

In the second layer of the clustering tree, customers are 120

further clustered, resulting in 24 final clusters. Take C5 as an 110

example. C5 is further clustered into 6 clusters based on the 100

non-zero weights. In each customer cluster C5j , customers

show similar responses to the influences of outside features.

results in a reasonable computational cost. 50

Das könnte Ihnen auch gefallen