1412
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
Multiclass Filters by a Weighted Pairwise Criterion for EEG SingleTrial Classiﬁcation
Haixian Wang, Member, IEEE
Abstract—The ﬁltering technique for dimensionality reduction of multichannel electroencephalogram (EEG) recordings, modeled using common spatial patterns and its variants, is commonly used in twoclass brain–computer interfaces (BCI). For a multiclass problem, the optimization of certain separability criteria in the output space is not directly related to the classiﬁcation error of EEG singletrial segments. In this paper, we derive a new discrim inant criterion, termed weighted pairwise criterion (WPC), for optimizing multiclass ﬁlters by minimizing the upper bound of the Bayesian error that is intentionally formulated for classifying EEG singletrial segments. The WPC approach pays more attention to close class pairs that are more likely to be misclassiﬁed than far away class pairs that are already well separated. Moreover, we extend WPC by integrating temporal information of EEG series. Computationally, we employ the rankone update and power iter ation technique to optimize the proposed discriminant criterion. The experiments of multiclass classiﬁcation on the datasets of BCI competitions demonstrate the efﬁcacy of the proposed method.
Index Terms—Bayesian classiﬁcation error, brain–computer in terfaces (BCI), common spatial patterns (CSP), multiclass ﬁlters, weighted pairwise criterion (WPC).
I. INTRODUCTION
A CCURATE classiﬁcation of electroencephalogram (EEG) signals is the core problem in the community of brain–
computer interfaces (BCI) [22]. A large number of modern signal processing and machine learning techniques have been used and developed [1], [8], [15]. One powerful and widely used method for processing multichannel EEG series is the ﬁl tering technique, represented by the common spatial patterns (CSP) [4]. The CSP approach, designed for the twoclass prob lem [12], [16], [18], [19], seeks few ﬁlters such that the ratio
of the ﬁltered variances between the two populations is maxi mized (or minimized). By the fact that CSP make use of only spatial information, the spatiotemporal versions were also de veloped, for example, the common spatiospectral patterns [11] and the local temporal CSP [20]. The literature [4] reviewed many variants of CSP.
Manuscript received June 20, 2010; revised October 14, 2010 and December 30, 2010; accepted January 3, 2011. Date of publication January 13, 2011; date of current version April 20, 2011. This work was supported in part by the National Natural Science Foundation of China under Grants 61075009 and 60803059, in part by the Qing Lan Project, and in part by the Fund for the Program of Excellent Young Teachers at Southeast University. The author is with the Key Laboratory of Child Development and Learn ing Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing 210096, China (email: hxwang@seu.edu.cn). Digital Object Identiﬁer 10.1109/TBME.2011.2105869
The CSP approach was originally suggested for twoclass paradigm. The multiclass extensions have been investigated in the literature. One trivial extension was to divide the multiclass problem into many twoclass situations followed by applying CSP repeatedly [4], [7]. Another conventional extension was the joint approximate diagonalization (JAD) of M covariance matrices, where M was the number of multiple classes [7]. This is based on the observation that CSP is to simultaneously diagonalize two covariance matrices. The JAD was further in vestigated from the perspective of mutual information and brain source separation [9], [10]. The JAD approach is actually a decomposition technique rather than a classiﬁcation method. Recently, Zheng and Lin [23] presented a multiclass exten sion via Bayesian classiﬁcation error estimation. The discrim inant criterion is derived by minimizing the upper bound of the Bayesian error of classifying ω ^{T} x, where x is an EEG sig nal recorded at a speciﬁc time point and ω is a ﬁlter. While this is a reasonable criterion to optimize spatial ﬁlters, a more direct approach uses the same features for optimizing spatial ﬁlters as for classiﬁcation. Denoting a multivariate time se ries of bandpass ﬁltered singletrial EEG by X, these features are in CSP ω ^{T} XX ^{T} ω (cf., [4, eq. (2)], [16, eq. (2)], and [23, eqs. (29)–(31)]). The quantity ω ^{T} XX ^{T} ω is the variance of the bandpass ﬁltered EEG signals. It is equal to band power. So, the band power ω ^{T} XX ^{T} ω with an appropriate spatial ﬁlter ω in fact corresponds to the effect of eventrelated desyncroniza tion/syncronization (ERD/ERS), which is an effective neuro physiological feature for classiﬁcation of brain activities [4]. Speciﬁcally, the socalled idle rhythms, reﬂected around 10 Hz, can be observed over motor and sensorimotor areas in most persons. These idle rhythms are attenuated when processing motor activity. This physiological phenomenon is termed ERD effect because of loss of synchrony in the neural population. By contrast, the rebound of the rhythmic activity is termed ERS. In this paper, we directly target the classiﬁcation of ω ^{T} XX ^{T} ω, for which the upper bound of the Bayesian clas siﬁcation error is estimated. Accordingly, by minimizing the upper bound of the Bayesian classiﬁcation error, we develop a new discriminant criterion that directly related to the classiﬁca tion of EEG singletrial segments. The proposed criterion takes the form of sum of weighted pairwise classes and is referred to as weighted pairwise criterion (WPC). The WPC approach puts heavier weights onto close class pairs, which are more likely to be misclassiﬁed, and deemphasizes the inﬂuence of far away class pairs, where the classes are already well separated. The weighting strategy helps make the criterion suited in produc ing separability in the output space, which has been witnessed in pattern classiﬁcation problems [13], [14]. Computationally,
00189294/$26.00 © 2011 IEEE
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLETRIAL CLASSIFICATION
1413
the proposed WPC method is conveniently implemented by the rankone update and power iteration technique. Moreover, we compare WPC with [23]. The classiﬁcation targets of the two methods are different, and then the discriminant criteria, which are obtained by minimizing the upper bound of the Bayesian classiﬁcation error, are thus different. We also extend WPC by integrating temporal information of EEG series into the covari ance matrix formulation. The efﬁcacy of the proposed approach is demonstrated on the classiﬁcation of four motor imagery tasks on two datasets of BCI competitions. The remainder of this paper is organized as follows. In Section II, the CSP and its multiclass situation via Bayesian classiﬁcation error estimation based on EEG sampling points are brieﬂy reviewed. In Section III, we derive the upper bound of the Bayesian error that is intentionally formulated for clas sifying EEG singletrial segments, then propose the WPC to minimize the upper bound, and give the optimization procedure by using the rankone update and power iteration technique. The comparison with [23] and the extension by integrating temporal information are presented in Section IV. The experimental re sults are presented in Section V. Finally, Section VI concludes the paper.
_{I}_{I}_{.} _{B}_{R}_{I}_{E}_{F} REVIEW OF CSP AND ITS MULTICLASS SITUATION
Let x ∈ R ^{K} be an EEG signal at a speciﬁc time point with K electrodes. We view x recorded during performing certain men tal task as a Kdimensional random variable that is generated
from a Gaussian distribution. Suppose that C = {c _{1} ,
M }
is the set of mental conditions to be investigated. We consider
the multiclass (M > 2) classiﬁcation problem that assigns EEG
singletrial segments into the M predeﬁned brain states. Given
}), the random variable x is assumed
to be Gaussian distributed according to xc _{l} ∼ N(0,Σ _{l} ), where Σ _{l} is the covariance matrix. The Gaussian assumption will not sacriﬁce generality when studying linear ﬁlters and statistics less than second order [10]. For the purpose of classiﬁcation, we wish to learn G (G<K) ﬁlters (linear transformation vec tors) ω _{g} ∈ R ^{K} using the ﬁnite training data such that the ﬁltered
class c _{l} (l ∈ {1, 2,
,c
,M
features are more discriminative for predicting class labels than using the raw EEG data. Hereafter, the term conditions and class labels are used interchangeably.
A. CSP: TwoClass Paradigm
The CSP method provides a powerful way for extracting EEG features related with the modulation of ERD/ERS. The CSP algorithm is applied to twoclass situation only. It solves the ﬁlters such that the projected EEG series have the maximum ratio of variances between the two classes. Maximizing the variances actually characterize the ERD/ERS effects. Let X =
_{N} ] ∈ R ^{K} ^{×}^{N} be a segment of EEG series during one
trial, where x _{i} is the multichannel EEG signal at a speciﬁc time
point i, and N denotes the number of sampled timepoints. The CSP approach solves spatial ﬁlters by simultaneously diagonalizing the estimated covariance matrices under the two conditions. The covariance matrices of the two classes are esti
[x _{1} ,
,x
mated as
ˆ
Σ l =
^{1}
N I _{l} 
t∈I _{l}
X _{t} X
T
t
(1)
where I _{l} (l ∈ {1, 2}) denotes the set of indices of trials belong ing to class c _{l} , and I _{l}  is the cardinality of set I _{l} . The spatial ﬁlters of CSP can be alternatively formulated as an optimization problem [4], [18]
ωˆ = arg{max, min} _{ω}_{∈}_{R} K
ω ^{T}
ˆ
Σ 1 ω
ω ^{T}
ˆ
Σ 2 ω
(2)
where the notation {max, min} means that maximizing or min imizing the Rayleigh quotient is of equally interest. The spatial ﬁlters thus are obtained by solving the generalized eigenvalue equation
(3)
Σ _{1} ωˆ = λ ˆ Σ ˆ _{2} ω.ˆ
The eigenvalue λ measures the ratio of variances between the two classes. For the purpose of classiﬁcation, the ﬁlters are spec
iﬁed by choosing several generalized eigenvectors associated with eigenvalues from both ends of the eigenvalue spectrum. The variances of the spatially ﬁltered EEG data are discrimina tive features, which are input into a classiﬁer.
ˆ
ˆ
B. Multiclass Situation by Bayesian Error Estimation
The CSP is suitable for twoclass classiﬁcation only. Zheng and Lin [23] addressed the multiclass paradigm via Bayesian classiﬁcation error estimation. By the assumption that the dis tribution of the EEG sampling point x conditioned on class c _{l} is Gaussian, i.e., p _{l} = N(x;0,Σ _{l} ), the ﬁltered EEG signal y = ω ^{T} x is also Gaussian distributed according to N (y;0,ω ^{T} Σ _{l} ω). Based on the Gaussian distribution, Zheng and Lin [23] obtained the upper bound of the Bayesian error of classifying ω ^{T} x, given by
¯
_{ε}_{(}_{ω}_{)} _{≤} qM (M − 1)
2
^{−} 32 M
q
3
_{l}_{=}_{1} ω ^{T} (Σ _{l} − Σ)ω
¯
ω ^{T} Σω
2
(4)
where q is the common a priori probability of the M classes, and
¯
Σ = ^{} _{m}_{=}_{1} qΣ _{m} . Minimizing the upper bound of the Bayesian classiﬁcation error is equivalent to maximize the discriminant criterion
M
J(ω) =
M
l=1
¯
ω ^{T} (Σ _{l} − Σ)ω
¯
ω ^{T} Σω
So, the G ﬁlters are deﬁned as [23]
ω _{1} = arg max J(ω)
···
ω
ω _{G} = arg 
max 
J(ω). 

ω ^{T} Σ ω _{g} = 0 ¯ 

g = 1, 
,G 
−1 

¯ 
. 
(5) 
(6) 

(7) 
Using some suitable estimations for Σ _{l} and Σ, the deﬁned ﬁlters
can be determined one by one via the rankone update and power iteration procedure.
1414
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
III. WPC
For the speciﬁc classiﬁcation of multiclass EEG singletrial segment X, a more direct approach is to consider optimizing the feature ω ^{T} XX ^{T} ω, rather than ω ^{T} x, as for classiﬁcation. It is ideal to obtain the Bayesian classiﬁcation errorbased opti mal criterion for the classiﬁcation of ω ^{T} XX ^{T} ω. The Bayesian classiﬁcation error is, in general, too complex to be calculated directly. Therefore, the upper bound of the Bayesian classiﬁca tion error, which is meanwhile required to be easy to optimize in practice, is usually estimated as a suboptimal criterion. In this section, we develop a new discriminant criterion based on the upper bound of the Bayesian classiﬁcation error of EEG single trials. It is noted that we take ω ^{T} XX ^{T} ω, rather than ω ^{T} x, as our target element in deriving the upper bound of the multiclass Bayesian classiﬁcation error.
A. Upper Bound of Multiclass Bayesian Error
Recalling X = [x _{1} ,
,x
_{N} ], we have ω ^{T} XX ^{T} ω =
one trial. Under the assumption of independent Gaussian dis tribution N (0, Σ _{l} ) of x _{i} conditioned on class c _{l} , we have that (ω ^{T} XX ^{T} ω)/(ω ^{T} Σ _{l} ω) abides by χ ^{2} distribution with degree of freedom N . Usually, the number of sampling points N is very large, say N > 30. By the central limit theorem, we have that ω ^{T} XX ^{T} ω conditioned on class c _{l} approaches the Gaussian dis tribution with mean Nω ^{T} Σ _{l} ω and variance 2N (ω ^{T} Σ _{l} ω) ^{2} . For the time being, we assume that 2N (ω ^{T} Σ _{l} ω) ^{2} is less than one, which will be addressed with the general case later on. Denote _{l}_{m} (ω) by the Bayesian error between classes c _{l} and c _{m} , i.e.,
_{i}_{=}_{1} N (ω ^{T} x _{i} ) ^{2} , where N is the number of sample points in
_{l}_{m} (ω) = q _{l} P(f _{ω} (X)
=
c _{l} c _{l} ) + q _{m} P(f _{ω} (X)
= c _{m} c _{m} ) (8)
where
(2/ ^{√} π) ^{}
the
0
erf(x) =
x e ^{−}^{u} ^{2} du. By (8), (9), and (12), the Bayesian error
error
function
(erf)
is
deﬁned
as
between classes c _{l} and c _{m} in the 1D feature space after being
projected onto ω is expressed as
_{l}_{m} (ω)
q
_{≤} _{1} _{−} _{e}_{r}_{f} ^{} Nω ^{T} (Σ _{l} − Σ _{m} )ω
2
^{√} 2
.
(13)
It is still complex to optimize ω via (13), since ω is embedded in the error function. We would like to isolate ω from the error function. We present the following inequality. For 0 ≤ x ≤ a, we have
(14)
erf(x) ≥ _{a} erf(a)x.
The equality holds when taking x = 0 or x = a. The proof is given in Appendix A. Let
(15)
δ _{l}_{m} (ω) = ω ^{T} Σ _{l} ω − ω ^{T} Σ _{m} ω
be the absolute distance between classes c _{l} and c _{m} in the reduced 1D feature space. By (14), we have
1
erf ^{N}^{δ} ^{l}^{m} ^{(}^{ω}^{)} ≥ _{λ} lm erf ^{N}^{λ} ^{l}^{m}
2
^{√} 2
1
2 ^{√} 2 δ _{l}_{m} (ω)
(16)
where λ _{l}_{m} is the maximum value of δ _{l}_{m} (ω). Note we have required, in the beginning of this section, that the magnitude of ω is subject to the constraint 2N (ω ^{T} Σ _{l} ω) ^{2} ≤ 1. The left and right expressions of (16) are not equal for all directions of ω. The two expressions are equal when arriving at the maximum or the minimum value. Combining (13) and (16), we have
_{l}_{m} (ω)
q
≤ 1 − _{λ} lm erf ^{N}^{λ} ^{l}^{m} δ _{l}_{m} (ω).
1
2
^{√} 2
(17)
For the M classes problem, the upper bound of the Bayesian error is calculated as [5]
where P (f _{ω} (X) = c _{l} c _{l} ) is the probability that samples belong
ing to class c _{l} are misclassiﬁed into class c _{m} , f _{ω} (·) denotes the
Bayesian classiﬁer, and q _{l} and q _{m} are the a priori probabilities of classes c _{l} and c _{m} , respectively. Since the data ω ^{T} XX ^{T} ω conditioned on each class are (approximately) Gaussian dis tributed with variance less than one and mean being, for exam ple, Nω ^{T} Σ _{l} ω if conditioned on class c _{l} , it follows that
q _{l} P(f _{ω} (X)
=
c _{l} c _{l} ) + q _{m} P(f _{ω} (X)
= c _{m} c _{m} )
≤ D m q _{l} p _{l} (x)dx + D l q _{m} p _{m} (x)dx
(9)
where p _{l} (x) and p _{m} (x) are the probability density functions of Gaussian distributions N (Nω ^{T} Σ _{l} ω, 1) and N (Nω ^{T} Σ _{m} ω, 1), respectively, and D _{m} and D _{l} are deﬁned as
(ω) ≤
≤
M
−1
l=1
M
−1
l=1
M
m=l+1
_{l}_{m} (ω)
M
m=l+1 q 1 −
_{λ} lm erf ^{N}^{λ} ^{l}^{m} δ _{l}_{m} (ω) .
1
2
^{√} 2
(18)
B. Discriminant Criterion Based on Upper Bound of Multiclass
Bayesian Error
To minimize the Bayesian error, we should minimize its upper bound, which is reduced to maximize the following discriminant criterion:
J _{P} (ω) =
M −1
M
l=1 m=l+1
_{λ} lm erf ^{N}^{λ} ^{l}^{m} δ _{l}_{m} (ω).
1
2
^{√} 2
(19)
D _{m} = {x : q _{m} p _{m} (x) ≥ q _{l} p _{l} (x)} 
(10) 
Let 
D _{l} = {x : q _{l} p _{l} (x) > q _{m} p _{m} (x)}. 
(11) 
α lm =
_{λ} lm erf ^{N}^{λ} ^{l}^{m} .
1
2
^{√} 2
Suppose q _{l} = q _{m} = q. Then, we have [21]
Then, J _{P} (ω) can be rewritten as
D m p _{l} (x)dx + D l p _{m} (x)dx = 1 − erf ^{N}^{}^{ω} ^{T} ^{(}^{Σ} ^{l} ^{−} ^{Σ} ^{m} ^{)}^{ω}^{}
2 ^{√} 2
(12)
J _{P} (ω) =
M −1
l=1
M
m=l+1
α _{l}_{m} δ _{l}_{m} (ω).
(20)
(21)
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLETRIAL CLASSIFICATION
1415
Fig. 1.
Weighting function α(u) = (1/u)erf(u).
The α _{l}_{m} can be viewed as weight imposed on pairwise classes c _{l} and c _{m} . Since Nω ^{T} Σ _{l} ω and Nω ^{T} Σ _{m} ω are the distribution means of classes c _{l} and c _{m} in the reduced 1D feature space, re spectively, the quantity N λ _{l}_{m} in (20) is the maximum distance (with respect to ω) between the two class means. It reﬂects the separability of two classes. Note α(u) = (1/u)erf(u) is a monotonically decreasing function of u, as shown in Fig. 1. So, the pairwise class weighting function α _{l}_{m} in (20) is monoton ically decreasing with respect to N λ _{l}_{m} . That is, in (21), we impose heavier weights onto close class pairs, which are more likely to be misclassiﬁed. The close class pairs are endowed with emphasize, which helps make the criterion suited in producing separability in the output space.
C. Discriminant Criterion: General Case of ω
In the earlier derivation, we require that the variance 2N(ω ^{T} Σ _{l} ω) ^{2} is less than 1. This requirement is satisﬁed by restricting the length of ω. It sufﬁces to restrict ω such
that ω ^{T} ( ^{√} 2NM Σ)ω = 1. In fact, for any class c _{l} , we have
¯
2N(ω ^{T} Σ _{l} ω) ^{2} < 2N(ω ^{T} M Σω) ^{2} = 1. So, we can normalize
term
ω ^{T}
ω
¯
¯
as ω ← (ω/ ω ^{T} ( ^{√} 2NM Σ)ω). (Σ _{l} − Σ _{m} )ω is normalized as
Accordingly,
the
ω ^{T} (Σ _{l} − Σ _{m} )ω ← ^{ω} ^{T} ^{(}^{Σ} ^{l} ^{−} ^{Σ} ^{m} ^{)}^{ω}
ω ^{T} ( ^{√} 2NM Σ)ω ^{.}
¯
As a result, in the weighting function (20), the λ _{l}_{m} is trans formed as the largest absolute generalized eigenvalue of Σ _{l} −
Σ _{m} with respect to ^{√} 2NM Σ. For the general case of ω, based on criterion (19), we present the multiclass discriminant criterion
¯
˜
J _{P} (ω) =
M −1
l=1
M
m=l+1
1
lm erf √ ^{N}^{λ} ^{}
4M
^{}
^{} ω ^{T} (Σ _{l} − Σ _{m} )ω
¯
ω ^{T} Σω
lm
λ
(22)
where λ _{l}_{m} ^{} denotes the largest absolute generalized eigenvalue
¯
of Σ _{l} − Σ _{m} with respect to Σ. The value λ _{l}_{m} ^{} actually measures
the closeness between classes c _{l} and c _{m} in the input space. Based on criterion (22), we deﬁne a set of ﬁlters as follows:
(23)
˜
ω _{1} = arg max J _{P} (ω)
···
ω _{G} = arg
ω
max
^{T}
¯
ω Σ ω _{g} = 0,
g = 1,
−1
,G
˜
J _{P} (ω).
(24)
This discriminant criterion produces a set of discriminant vec tors to minimize the upper bound of the Bayesian error. We
see that the gth discriminant vector ω _{g} is determined such that
˜
J _{P} is maximized in the (K − g + 1)dimensional space that is
¯
perpendicular (with respect to Σ) to the space spanned by previ
ously obtained discriminant vectors ω _{1} through ω _{g}_{−}_{1} . We refer
to
J _{P} as WPC, which is to minimize the upper bound of the
Bayesian error for multiclass EEG singletrial classiﬁcation.
˜
D. Implementation of WPC
The proposed WPC method can be similarly implemented by using the rankone update and power iteration technique
without resorting to complex optimization algorithm, as in [23].
Let ψ = Σ ^{1}^{/}^{2} ω and
¯
˘
J _{P} (ψ) =
M −1
l=1
M
m=l+1
1
lm erf √ ^{N}^{λ} ^{}
4M
^{}
^{} ψ ^{T} (Σ _{l} − Σ _{m} )ψ ψ ^{T} ψ
lm
λ
(25)
is the largest absolute
eigenvalue of Σ _{l} − Σ _{m} . Then, the solution of ω is converted into the optimization of ψ formulated as
(26)
where Σ _{l} =
_{Σ} −1/2 _{Σ} _{l}
¯
Σ ^{−}^{1}^{/}^{2} and λ ^{}
¯
_{l}_{m}
˘
ψ _{1} = arg max J _{P} (ψ)
ψ
···
˘ 

ψ _{G} = arg 
max 
J _{P} (ψ). 

ψ ^{T} ψ _{g} = 0, 

g 
= 1, 
,G 
−1 

Let 
s _{l}_{m} ∈ {1, −1} 
be the 
sign of 
ψ ^{T} (Σ _{l} − Σ _{m} )ψ. 
(27)
Then,
ψ ^{T} (Σ _{l} − Σ _{m} )ψ = s _{l}_{m} ψ ^{T} (Σ _{l} − Σ _{m} )ψ. Let
H(s) =
M −1
l=1
M
m=l+1
1
lm erf √ ^{N}^{λ} ^{}
4M
^{}
lm
s _{l}_{m} (Σ _{l} − Σ _{m} )
λ
(28)
where s has the entries {s _{l}_{m} } _{l}_{,}_{m}_{=}_{1}_{,} can be further formulated as
,M
. The optimization of ψ
ψ _{1} = arg max max
s∈S
ψ
ψ ^{T} H(s)ψ
ψ ^{T} ψ
(29)
···
ψ _{G} = arg max
s∈S
max
ψ ^{T} ψ _{g} = 0,
g = 1,
,G
−1
ψ ^{T} H(s)ψ
_{ψ} ^{T} _{ψ}
(30)
where S is the sign space of all possible s. Clearly, the ﬁrst vector ψ _{1} is the ﬁrst principal eigenvector (which can be obtained by power iteration) of H(s ^{∗} ), where
1416
s ^{∗} is the sign that results in the largest ﬁrst principal eigen value of all possible s. Once the ﬁrst vector ψ _{1} is obtained, we proceed to ﬁnd the second vector ψ _{2} in the orthogonally complementary space of ψ _{1} , i.e., in the space spanned by I _{K} − ψ _{1} ψ ^{T} , where I _{K} denotes the Kdimensional identity ma
1
trix. Therefore, ψ _{2} is solved as the ﬁrst principal eigenvec tor of the deﬂated matrix (I _{K} − ψ _{1} ψ ^{T} )H(s ^{∗} )(I _{K} − ψ _{1} ψ ^{T} ). Note that we use the same symbol s ^{∗} , but it is not neces sarily the same with the one producing ψ _{1} . Generally, sup
_{g} have been obtained. The
pose the ﬁrst g vectors ψ _{1} ,
(g + 1)th vector is determined in the orthogonally complemen
tary space spanned by ψ _{1} ,
_{g} , i.e., in the space spanned
by I _{K} − U _{g} U
sis of ψ _{1} ,
Schmidt orthogonalization procedure. So, ψ _{g}_{+}_{1} is solved as
the ﬁrst principal eigenvector of (I _{K} − U _{g} U
). Given the obtained ψ _{g}_{+}_{1} , according to the Schmidt or
thogonalization procedure, the basis matrix U _{g}_{+}_{1} is formed
by padding U _{g} as U _{g}_{+}_{1} = [U _{g} ,u _{g}_{+}_{1} ], where u _{g}_{+}_{1} =
U g U
)H(s ^{∗} )(I _{K} −
, where U _{g} is the matrix of orthonormal ba _{g} , which, for example, can be obtained by the
1
1
,ψ
,ψ
T
g
,ψ
T
g
T
g
ψ _{g}_{+}_{1} ) . In theory,
ψ _{g}_{+}_{1} is orthogonal with U _{g} , i.e., U
plies that u _{g}_{+}_{1} is simply the normalized ψ _{g}_{+}_{1} . We keep the
previous Schmidt orthogonalization procedure for computa tional precision in practice. Note I _{K} − U _{g}_{+}_{1} U _{g}_{+}_{1} = (I _{K} −
), which makes it feasible to compute
ψ _{g}_{+}_{1} = 0, which im
^{} ψ g+1 − U g (U
T
g
ψ _{g}_{+}_{1} ) ^{} / ψ _{g}_{+}_{1} − U _{g} (U
T
g
T
g
T
g
T
u g+1 u
T
_{g}_{+}_{1} )(I _{K} − U _{g} U
_{g}_{+}_{1} )H(s ^{∗} )(I _{K} − U _{g}_{+}_{1} U _{g}_{+}_{1} ) for the next step
) through mul
tiplying I _{K} − u _{g}_{+}_{1} u _{g}_{+}_{1} from both sides. In practice, the covariance matrices Σ _{l} (l = 1,
usually unknown, which thus need to be estimated. The ex
pression
Σ deﬁned in (1) provides a way of estimation. We
,M ) are
by updating (I _{K} − U _{g} U
(I K − U g+1 U ^{T}
T
T
g
)H(s ^{∗} )(I _{K} − U _{g} U
T
g
T
ˆ
summarize the optimization procedure of multiclass ﬁlters via the WPC approach in Table I.
E. Classiﬁcation
are the G ﬁlters obtained by
WPC. For any EEG data segment X _{t} , we extract the features as
(31)
The extracted features on training EEG data are used to design a classiﬁer. For a testing EEG segment, its features are extracted in the same way, which are input into the trained classiﬁer to predict its class label.
Suppose ω _{g} (g = 1,
σ
2
g
= ω
T
g
X _{t} X
T
t
,G)
ω _{g} ,
(g = 1,
,G).
IV. COMPARISON AND _{E}_{X}_{T}_{E}_{N}_{S}_{I}_{O}_{N}
In this section, we compare the proposed WPC approach with [23]. The starting points and formulations of the two methods are completely different. We also extend WPC by integrating temporal information.
A. Comparison With [23]
Both the proposed WPC approach and [23] are based on the Bayesian error estimation. However, the classiﬁcation targets and then the criteria of the two methods are different, as sum marized in Table II. The discriminant criterion of [23] is derived
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
TABLE I
OPTIMIZATION PROCEDURE OF MULTICLASS FILTERS VIA THE WPC APPROACH
by minimizing the upper bound of the Bayesian error of classi fying ω ^{T} x, while WPC takes feature ω ^{T} XX ^{T} ω used in EEG singletrial classiﬁcation as target directly. It is noted that
q
M −1
l=1
M
m=l+1
ω ^{T} (Σ _{l} − Σ _{m} )ω ≤
M
l=1
ω ^{T} (Σ _{l} − Σ)ω
¯
≤ 2q
M −1
l=1
M
m=l+1
ω ^{T} (Σ _{l} − Σ _{m} )ω.
(32)
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLETRIAL CLASSIFICATION
1417
TABLE II
COMPARISON BETWEEN WPC AND ZHENG AND LIN [23]
The proof is given in Appendix B. Let
˜
J(ω) =
M
−1
M
l=1 m=l+1
ω ^{T} (Σ _{l} − Σ _{m} )ω
ω ^{T}
¯
_{Σ}_{ω}
˜ ˜
.
(33)
Then, we have q J(ω) ≤ J(ω) ≤ 2q J(ω). So, maximizing J(ω)
˜ 

can be roughly performed by maximizing 
J(ω). Using ψ = 

¯ 
˜ 
Σ ^{1}^{/}^{2} ω, maximizing J(ω) is equivalent to maximizing
˘
J(ψ) =
M
−1
l=1
M
m=l+1
ψ ^{T} (Σ _{l} − Σ _{m} )ψ
ψ ^{T} ψ
(34)
which is M (M − 1)/2 pairs of absolute distances between
ψ ^{T} Σ _{l} ψ and ψ ^{T} Σ _{m}
The maximization of (34), however, may not very appropriate for classifying multiclass EEG singletrials in some cases. For
example, consider the situation that one class has large differ ence (in terms of Σ) from the other classes. To maximize the
ψ subject to ψ = 1.
˘
criterion J(ψ), the class pairs (say c _{l} and c _{m} ) that have large differences (between Σ _{l} and Σ _{m} ) heavily control the selection of the direction of ψ. Note ψ is of unit length. As a result, the remote class is projected from the other classes as far as possible while close classes are more likely to be merged.
), the crite
rion J _{P} (ψ) in (25) or J _{P} (ψ) in (22) deemphasizes the inﬂuence
of large class differences, where the classes are already well separated and gives great emphasize to small class differences where the close classes are more likely to be confused. An interesting connection between WPC and [23] is that,
By contrast, with the weight (1/λ _{l}_{m}
˘
˜
^{} )erf( √ ^{N} ^{λ} ^{}
4M
˜
when applied in twoclass paradigm, criterion J _{P} (ω) and crite
rion J(ω) produce the same solution of ω. This, however, does not necessarily imply that the upper bounds of the Bayesian errors of these two methods are equal. On the other hand, com paring the upper bounds of the Bayesian errors derived by the two methods is meaningless since they are derived for classify ing different objects.
B. Extension to WPC: Integrating Temporal Information
The set of ﬁlters of WPC are obtained by considering clas sifying the projected variance ω ^{T} XX ^{T} ω. Note that X is a segment of EEG singletrial time course. The covariance for mulation XX ^{T} , however, is globally independent of time. The temporal information is completely ignored. From the study of neurophysiology, EEG signals are usually nonsta tionary. It is useful to integrate the temporal information into the covariance formulation, reﬂecting the temporal manifold
of the EEG time course [20]. Speciﬁcally, by the fact that XX ^{T} = (1/2N ) ^{} _{i}_{,}_{j}_{=}_{1} (x _{i} − x _{j} )(x _{i} − x _{j} ) ^{T} , we use the tem porally local covariance matrix
N
C =
^{1}
2N
N
i,j=1
(x _{i} − x _{j} )(x _{i} − x _{j} ) ^{T} A(i,j)
(35)
for covariance modeling instead of XX ^{T} that is time indepen dent. The timedependent adjacency value A(i,j) is deﬁned such that only temporally close sample pairs, say {x _{i} ∼ x _{j} :
i − j < τ } with τ being a temporal range parameter, are se lected to contribute to the summation (35). The value A(i,j) is
monotonously decreasing with respect to temporal distances be
tween selected sample pairs. In this paper, the adjacency matrix A is deﬁned using the Tukey’s tricube weighting function [6]
A(i,j) =
⎧
⎪
⎨
⎩ ⎪
1 −
0,
i − j
τ
3 3
,
i − j < τ
else.
(36)
With some algebraic derivations, C is compactly expressed as C = (1/N )XEX ^{T} , where E = D − A is the Laplacian
, and D is the diagonal ma
matrix, A = (A(i,j)) _{i}_{,}_{j}_{=}_{1}_{,}
trix whose diagonal entries are row sums of A. Let L = (1/N )E. Then, under the same probabilistic assumption with the previous section, the Gaussian quadratic form ω ^{T} XLX ^{T} ω conditioned on class c _{l} has mean tr(L)ω ^{T} Σ _{l} ω and vari ance 2tr(L ^{2} )(ω ^{T} Σ _{l} ω) ^{2} , where tr(·) is the trace operator. If ω ^{T} XLX ^{T} ω is treated as target feature for classiﬁcation pur
pose. Then, (28) is accordingly modiﬁed by integrating temporal information as
,N
H _{T}_{I} (s) =
M −1
l=1
M
m=l+1
1
lm erf
λ
^{}
tr(L _{2} ) s _{l}_{m} (Σ _{l} − Σ _{m} ).
tr(L)λ ^{}
lm
4M
(37)
Note that, in this case, the difference between means of classes c _{l} and c _{m} becomes tr(L)(ω ^{T} (Σ _{l} − Σ _{m} )ω). And ω is normalized
as ω ← (ω/ ω ^{T} ( ^{} 2tr(L ^{2} )M Σ)ω).
¯
In the implementation of the temporal extension of WPC, the
covariance matrices Σ _{l} (l = 1,
ˇ
Σ l =
^{1}
N I _{l} 
,M
) are estimated as
t∈I _{l}
X _{t} LX
T
t
.
(38)
The optimization procedure can be similarly carried out with WPC. The features are extracted as
(39)
where ωˇ _{g} are the ﬁlters of the temporal extension of WPC. 1) Choice of τ : The additional parameter τ is determined from the data using a threeway crossvalidation strategy. This strategy contains two nested loops. In the outer loop, the sam ples are divided into T _{1} folds, in which onefold is treated as testing set. The testing samples are used for the estimation of generalization ability and are not concerned with the solutions of the ﬁlters and the parameter. In the inner loop, the remaining T _{1} − 1 folds are further divided into T _{2} folds, in which onefold is treated as validation set while the remaining T _{2} − 1 folds are
σˇ
2
g
= ωˇ
T
g
X _{t} LX
T
t
ωˇ _{g} ,
(g = 1,
,G)
1418
treated as training set. For each τ , the ﬁlters are solved on the training set, and then the recognition rate is calculated on the validation set. This procedure is repeated T _{2} times with a dif ferent validation set each time. The average recognition rates is recorded as the recognition accuracy across the T _{2} folds. We select the τ that results in the maximum recognition accuracy. We then solve the ﬁlters using all the T _{2} folds with the optimal τ selected earlier. With the ﬁlters obtained, we calculate the recognition rate on the testing set which is speciﬁed in the outer loop. The earlier procedure is repeated T _{1} times with a different fold as testing set each time. The average recognition rates are computed as the ﬁnal recognition accuracy across the T _{1} folds.
V. EXPERIMENTS
We evaluate the effectiveness of the proposed multiclass methods on two publicly available datasets of BCI competi tions. These two datasets are of fourclass motor imagery EEG signals. We compare the classiﬁcation performances of the pro posed multiclass methods with the multiclass CSP using one versusrest and using JAD [7], the multiclass information theo retic feature extraction [10], and the multiclass CSP presented in [23].
A. EEG Datasets Used for Evaluation
1) Dataset IIIa of BCI Competition III: This dataset is of fourclass motor imagery paradigm by recording three subjects (k3b, k6b, and l1b) [3]. The subjects, sitting in a normal chair with relaxation, were asked to perform four different tasks of motor imagery (i.e., left hand, right hand, one foot, and tongue) by cues, which were presented in a randomized order. In each trial, the cue was displayed from the third second and lasted for 1.25 s. At the same time, the motor imaginary started and continued until the ﬁxation cross disappeared at the seventh second. So, the duration of the motor imagery in each trial was 4 s. For subject k3b, there were 90 trials for each mental task. And for subjects k6b and l1b, each mental task cue appeared 60 times. In our experiment, we discard four trials of subject k6b because of missing data. The EEG measurements were recorded using 60 sensors by a 64channel neuroscan system. The left and right mastoids were used as reference and ground, respectively. The EEG signals were sampled at 250 Hz and ﬁltered by cutoff frequencies 1 and 50 Hz with the notchﬁlter ON. 2) Dataset IIa of BCI Competition IV: This dataset contains EEG signals recorded during a cuebased fourclass motor im agery task from nine subjects [17]. Each trial started from a short acoustic warning tone along with a ﬁxation cross dis played on the black screen. After 2 s, a visual cue was presented for 1.25 s, instructing the subjects to carry out the desired motor imagery task (i.e., the imagination of movement of the left hand, right hand, both feet, or tongue) from the third second until the ﬁxation cross disappeared at the sixth second. Each subject par ticipated two sessions recorded on different days. There were 288 trials in each session for each subject, i.e., 72 trials per task. Twentytwo electrodes were used to record the EEG signals that were sampled at 250 Hz and ﬁltered by cutoff frequencies 0.5 and 100 Hz with the notchﬁlter ON.
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
B. Experimental Settings and Results
The data are bandpass ﬁltered between 5 and 35 Hz using
a ﬁfthorder butterworth ﬁlter, as in [9] and [10]. The EEG
segments recorded during the motor imagery period, i.e., from the third second to the seventh second in dataset IIIa of BCI competition III and from the third second to the sixth second in
dataset IIa of BCI competition IV, are used in the experiment. We exploit the threefold crossvalidation strategy to evaluate the classiﬁcation accuracy. That is, we partition all the trials of each class per subject into three divisions, in which each division is used as testing data while the remainder two divisions are used as training data. This procedure is repeated three times until each division is used once as testing data. In each repetition, for each ﬁlter obtained on the training data, features are obtained by projection on the 15 frequency bands of 2Hz width in the range 5–35 Hz [9], [10]. Consequently, we obtain a (15G) dimensional feature vector for each trial, where G is the number
of ﬁlters selected on the training data. That is, we use the three
way crossvalidation procedure with T _{1} = T _{2} = 3 to determine the value of G, where G varies from 2 to 10 in step of 2. The (15G)dimensional feature vectors are further reduced to 3D ^{1} vectors by using the Fisher discriminant analysis (FDA) [21]. It should be noted that the spatial ﬁlters, the value of G as well as
the FDA weights are calculated on the basis of the training data and then applied to the testing data. The conventional classiﬁer
of 
the nearest class mean with Euclidean distance [21] is adopted 
to 
predict the class labels of the testing samples. 
Table III reports the classiﬁcation accuracies by using the multiclass ﬁlters solved by the various methods. Note, we also evaluate the classiﬁcation accuracy of WPC integrating tempo ral information (WPC/TI), where the parameter τ is determined by the threeway crossvalidation procedure with T _{1} = T _{2} = 3. Here τ is varied logarithmically from 1 to 5 in step of 1. It is observed that the proposed WPC method achieves much bet ter classiﬁcation accuracy than the existing multiclass methods, and WPC/TI further improves the results in most cases. The im provement of WPC/TI is attributed to the local temporal mod eling. The reason that WPC/TI results in lower classiﬁcation accuracies than WPC in few cases may be due to overﬁtting.
C. Comparison With BCI Competition IV
For dataset IIa of BCI competition IV, to compare with the re sults of the winners, we use the evaluation of sessiontosession transfer from session one to session two in terms of kappa score, simulating competition scenario. The procedure of the session tosession transfer is much simpler than the cross validation. Speciﬁcally, we use the ﬁrst session as training data and the second session as testing data. All the experimental settings are same with the description in the previous section except that the training data and the testing data are now ﬁxed. The classiﬁca tion accuracy is summarized in Table IV. It can be seen that our proposed methods have fairly well classiﬁcation performance compared with the results obtained by the best two competi
^{1} Since the number of classes is four, we can obtain at most three dimensions of features by FDA, which is known as the ranklimit problem.
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLETRIAL CLASSIFICATION
1419
TABLE III
COMPARISON OF THE CLASSIFICATION ACCURACIES (%) OF THE PROPOSED WPC AND WPC/TI METHODS WITH THE EXISTING MULTICLASS METHODS FOR EACH SUBJECT ON THE DATASETS OF BCI COMPETITIONS, WHERE M1–M6 REFER TO MULTICLASS CSP USING ONEVERSUSREST, MULTICLASS CSP USING JAD, MULTICLASS INFORMATION THEORETIC FEATURE EXTRACTION, MULTICLASS CSP IN [23], WPC, AND WPC/TI, RESPECTIVELY
TABLE IV
KAPPA SCORES OF VARIOUS MULTICLASS METHODS FOR EACH SUBJECT ON DATASET IIA OF BCI COMPETITION IV USING SESSIONTOSESSION TRANSFER, WHERE NO. 1 AND NO.2REFER TO THE BEST TWO COMPETITORS, AND M1–M6 ARE SAME WITH THOSE OF TABLE III
tors. Note that the results obtained by the multiclass CSP using oneversusrest and using JAD are slightly different from those reported in [9], since different classiﬁers and time segments are used. In our experiment, a simple classiﬁcation procedure is em ployed to reveal the effectiveness of the multiclass ﬁlters ob tained by WPC and WPC/TI. The classiﬁcation performance may be improved if we solve ﬁlters in narrower frequency bands, tuning the optimal time segment for each trial, and/or using other sophisticated classiﬁers. The goal of this paper is to demonstrate the effectiveness of the weighted scheme for solving multiclass ﬁlters: while we use the same experimental settings for all the methods, the weighted pairwise design produces a much higher classiﬁcation accuracy.
VI. CONCLUSION
In this paper, we propose a new discriminant criterion, called WPC, of optimizing multiclass ﬁlters. The approach is estab lished by minimizing the upper bound of the Bayesian error of classifying EEG singletrial segments, resulting in the form of sum of weights imposed on individual pairwise classes accord ing to their closeness. We pay special emphasize on the effect of closer classes that are more likely to cause misclassiﬁcation. In other words, the contributions of different class pairs to the discriminant criterion are biased. Computationally, the WPC
algorithm is conveniently solved by the rankone update and power iteration technique. The proposed WPC approach is intentionally formulated for classifying EEG singletrial data. It takes into account classi ﬁcation errors of EEG trials between pairs of classes. While the criterion derived based on the Bayesian error of classify ing EEG sampling points is reasonable, the large pairwise class differences may play an overwhelming role in the optimization. By contrast, WPC directly uses the same features for optimizing spatial ﬁlters as for classiﬁcation. Moreover, we extend WPC by integrating the temporal information of EEG series in the co variance matrix formulation. The effectiveness of the proposed WPC method is demonstrated by the classiﬁcation of four motor imagery tasks on two datasets of BCI competition. Finally, we point out that the Bayesian error estimation heav ily relies on the assumption of independent Gaussian distribu tion. This assumption, however, does not hold stringently in ap plications, since EEG data usually has an autocorrelation struc ture. One possible way is to consider using the Gauss mixture model instead of single Gaussian distribution. We are studying this issue theoretically and practically.
APPENDIX A
PROOF OF (14)
For 0 ≤ x ≤ a, in the error function
erf(x) =
π x
^{2}
√
0
e ^{−}^{u} ^{2} du
we use the variable substitution v =
_{x} a u. Then, we have
erf(x) =
π a
^{2}
√
0
_{e} −v ^{2} ( ^{x}
a
) ^{2} ^{x}
a
dv.
Since 0 ≤ ^{x}
a
≤ 1, it follows that
2
erf(x) ≥ ^{x} √
π a
0
a
1
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.