Sie sind auf Seite 1von 9

1412

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

Multiclass Filters by a Weighted Pairwise Criterion for EEG Single-Trial Classification

Haixian Wang, Member, IEEE

Abstract—The filtering technique for dimensionality reduction of multichannel electroencephalogram (EEG) recordings, modeled using common spatial patterns and its variants, is commonly used in two-class brain–computer interfaces (BCI). For a multiclass problem, the optimization of certain separability criteria in the output space is not directly related to the classification error of EEG single-trial segments. In this paper, we derive a new discrim- inant criterion, termed weighted pairwise criterion (WPC), for optimizing multiclass filters by minimizing the upper bound of the Bayesian error that is intentionally formulated for classifying EEG single-trial segments. The WPC approach pays more attention to close class pairs that are more likely to be misclassified than far away class pairs that are already well separated. Moreover, we extend WPC by integrating temporal information of EEG series. Computationally, we employ the rank-one update and power iter- ation technique to optimize the proposed discriminant criterion. The experiments of multiclass classification on the datasets of BCI competitions demonstrate the efficacy of the proposed method.

Index Terms—Bayesian classification error, brain–computer in- terfaces (BCI), common spatial patterns (CSP), multiclass filters, weighted pairwise criterion (WPC).

I. INTRODUCTION

A CCURATE classification of electroencephalogram (EEG) signals is the core problem in the community of brain–

computer interfaces (BCI) [22]. A large number of modern signal processing and machine learning techniques have been used and developed [1], [8], [15]. One powerful and widely used method for processing multichannel EEG series is the fil- tering technique, represented by the common spatial patterns (CSP) [4]. The CSP approach, designed for the two-class prob- lem [12], [16], [18], [19], seeks few filters such that the ratio

of the filtered variances between the two populations is maxi- mized (or minimized). By the fact that CSP make use of only spatial information, the spatio-temporal versions were also de- veloped, for example, the common spatio-spectral patterns [11] and the local temporal CSP [20]. The literature [4] reviewed many variants of CSP.

Manuscript received June 20, 2010; revised October 14, 2010 and December 30, 2010; accepted January 3, 2011. Date of publication January 13, 2011; date of current version April 20, 2011. This work was supported in part by the National Natural Science Foundation of China under Grants 61075009 and 60803059, in part by the Qing Lan Project, and in part by the Fund for the Program of Excellent Young Teachers at Southeast University. The author is with the Key Laboratory of Child Development and Learn- ing Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing 210096, China (e-mail: hxwang@seu.edu.cn). Digital Object Identifier 10.1109/TBME.2011.2105869

The CSP approach was originally suggested for two-class paradigm. The multiclass extensions have been investigated in the literature. One trivial extension was to divide the multiclass problem into many two-class situations followed by applying CSP repeatedly [4], [7]. Another conventional extension was the joint approximate diagonalization (JAD) of M covariance matrices, where M was the number of multiple classes [7]. This is based on the observation that CSP is to simultaneously diagonalize two covariance matrices. The JAD was further in- vestigated from the perspective of mutual information and brain source separation [9], [10]. The JAD approach is actually a decomposition technique rather than a classification method. Recently, Zheng and Lin [23] presented a multiclass exten- sion via Bayesian classification error estimation. The discrim- inant criterion is derived by minimizing the upper bound of the Bayesian error of classifying ω T x, where x is an EEG sig- nal recorded at a specific time point and ω is a filter. While this is a reasonable criterion to optimize spatial filters, a more direct approach uses the same features for optimizing spatial filters as for classification. Denoting a multivariate time se- ries of band-pass filtered single-trial EEG by X, these features are in CSP ω T XX T ω (cf., [4, eq. (2)], [16, eq. (2)], and [23, eqs. (29)–(31)]). The quantity ω T XX T ω is the variance of the band-pass filtered EEG signals. It is equal to band power. So, the band power ω T XX T ω with an appropriate spatial filter ω in fact corresponds to the effect of event-related desyncroniza- tion/syncronization (ERD/ERS), which is an effective neuro- physiological feature for classification of brain activities [4]. Specifically, the so-called idle rhythms, reflected around 10 Hz, can be observed over motor and sensorimotor areas in most persons. These idle rhythms are attenuated when processing motor activity. This physiological phenomenon is termed ERD effect because of loss of synchrony in the neural population. By contrast, the rebound of the rhythmic activity is termed ERS. In this paper, we directly target the classification of ω T XX T ω, for which the upper bound of the Bayesian clas- sification error is estimated. Accordingly, by minimizing the upper bound of the Bayesian classification error, we develop a new discriminant criterion that directly related to the classifica- tion of EEG single-trial segments. The proposed criterion takes the form of sum of weighted pairwise classes and is referred to as weighted pairwise criterion (WPC). The WPC approach puts heavier weights onto close class pairs, which are more likely to be misclassified, and de-emphasizes the influence of far away class pairs, where the classes are already well separated. The weighting strategy helps make the criterion suited in produc- ing separability in the output space, which has been witnessed in pattern classification problems [13], [14]. Computationally,

0018-9294/$26.00 © 2011 IEEE

WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION

1413

the proposed WPC method is conveniently implemented by the rank-one update and power iteration technique. Moreover, we compare WPC with [23]. The classification targets of the two methods are different, and then the discriminant criteria, which are obtained by minimizing the upper bound of the Bayesian classification error, are thus different. We also extend WPC by integrating temporal information of EEG series into the covari- ance matrix formulation. The efficacy of the proposed approach is demonstrated on the classification of four motor imagery tasks on two datasets of BCI competitions. The remainder of this paper is organized as follows. In Section II, the CSP and its multiclass situation via Bayesian classification error estimation based on EEG sampling points are briefly reviewed. In Section III, we derive the upper bound of the Bayesian error that is intentionally formulated for clas- sifying EEG single-trial segments, then propose the WPC to minimize the upper bound, and give the optimization procedure by using the rank-one update and power iteration technique. The comparison with [23] and the extension by integrating temporal information are presented in Section IV. The experimental re- sults are presented in Section V. Finally, Section VI concludes the paper.

II. BRIEF REVIEW OF CSP AND ITS MULTICLASS SITUATION

Let x R K be an EEG signal at a specific time point with K electrodes. We view x recorded during performing certain men- tal task as a K-dimensional random variable that is generated

from a Gaussian distribution. Suppose that C = {c 1 ,

M }

is the set of mental conditions to be investigated. We consider

the multiclass (M > 2) classification problem that assigns EEG

single-trial segments into the M predefined brain states. Given

}), the random variable x is assumed

to be Gaussian distributed according to x|c l N(0,Σ l ), where Σ l is the covariance matrix. The Gaussian assumption will not sacrifice generality when studying linear filters and statistics less than second order [10]. For the purpose of classification, we wish to learn G (G<K) filters (linear transformation vec- tors) ω g R K using the finite training data such that the filtered

class c l (l ∈ {1, 2,

,c

,M

features are more discriminative for predicting class labels than using the raw EEG data. Hereafter, the term conditions and class labels are used interchangeably.

A. CSP: Two-Class Paradigm

The CSP method provides a powerful way for extracting EEG features related with the modulation of ERD/ERS. The CSP algorithm is applied to two-class situation only. It solves the filters such that the projected EEG series have the maximum ratio of variances between the two classes. Maximizing the variances actually characterize the ERD/ERS effects. Let X =

N ] R K ×N be a segment of EEG series during one

trial, where x i is the multichannel EEG signal at a specific time

point i, and N denotes the number of sampled time-points. The CSP approach solves spatial filters by simultaneously diagonalizing the estimated covariance matrices under the two conditions. The covariance matrices of the two classes are esti-

[x 1 ,

,x

mated as

ˆ

Σ l =

1

N |I l |

t∈I l

X t X

T

t

(1)

where I l (l ∈ {1, 2}) denotes the set of indices of trials belong- ing to class c l , and |I l | is the cardinality of set I l . The spatial filters of CSP can be alternatively formulated as an optimization problem [4], [18]

ωˆ = arg{max, min} ωR K

ω T

ˆ

Σ 1 ω

ω T

ˆ

Σ 2 ω

(2)

where the notation {max, min} means that maximizing or min- imizing the Rayleigh quotient is of equally interest. The spatial filters thus are obtained by solving the generalized eigenvalue equation

(3)

Σ 1 ωˆ = λ ˆ Σ ˆ 2 ω.ˆ

The eigenvalue λ measures the ratio of variances between the two classes. For the purpose of classification, the filters are spec-

ified by choosing several generalized eigenvectors associated with eigenvalues from both ends of the eigenvalue spectrum. The variances of the spatially filtered EEG data are discrimina- tive features, which are input into a classifier.

ˆ

ˆ

B. Multiclass Situation by Bayesian Error Estimation

The CSP is suitable for two-class classification only. Zheng and Lin [23] addressed the multiclass paradigm via Bayesian classification error estimation. By the assumption that the dis- tribution of the EEG sampling point x conditioned on class c l is Gaussian, i.e., p l = N(x;0,Σ l ), the filtered EEG signal y = ω T x is also Gaussian distributed according to N (y;0T Σ l ω). Based on the Gaussian distribution, Zheng and Lin [23] obtained the upper bound of the Bayesian error of classifying ω T x, given by

¯

ε(ω) qM (M 1)

2

32 M

q

3

l=1 |ω T l Σ)ω|

¯

ω T Σω

2

(4)

where q is the common a priori probability of the M classes, and

¯

Σ = m=1 qΣ m . Minimizing the upper bound of the Bayesian classification error is equivalent to maximize the discriminant criterion

M

J(ω) =

M

l=1

¯

|ω T l Σ)ω|

¯

ω T Σω

So, the G filters are defined as [23]

ω 1 = arg max J(ω)

···

ω

ω G = arg

max

 

J(ω).

ω T Σ ω g = 0

¯

g = 1,

,G

1

 

¯

.

(5)

(6)

(7)

Using some suitable estimations for Σ l and Σ, the defined filters

can be determined one by one via the rank-one update and power iteration procedure.

1414

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

III. WPC

For the specific classification of multiclass EEG single-trial segment X, a more direct approach is to consider optimizing the feature ω T XX T ω, rather than ω T x, as for classification. It is ideal to obtain the Bayesian classification error-based opti- mal criterion for the classification of ω T XX T ω. The Bayesian classification error is, in general, too complex to be calculated directly. Therefore, the upper bound of the Bayesian classifica- tion error, which is meanwhile required to be easy to optimize in practice, is usually estimated as a suboptimal criterion. In this section, we develop a new discriminant criterion based on the upper bound of the Bayesian classification error of EEG single trials. It is noted that we take ω T XX T ω, rather than ω T x, as our target element in deriving the upper bound of the multiclass Bayesian classification error.

A. Upper Bound of Multiclass Bayesian Error

Recalling X = [x 1 ,

,x

N ], we have ω T XX T ω =

one trial. Under the assumption of independent Gaussian dis- tribution N (0, Σ l ) of x i conditioned on class c l , we have that (ω T XX T ω)/(ω T Σ l ω) abides by χ 2 distribution with degree of freedom N . Usually, the number of sampling points N is very large, say N > 30. By the central limit theorem, we have that ω T XX T ω conditioned on class c l approaches the Gaussian dis- tribution with mean T Σ l ω and variance 2N (ω T Σ l ω) 2 . For the time being, we assume that 2N (ω T Σ l ω) 2 is less than one, which will be addressed with the general case later on. Denote lm (ω) by the Bayesian error between classes c l and c m , i.e.,

i=1 N (ω T x i ) 2 , where N is the number of sample points in

lm (ω) = q l P(f ω (X)

=

c l |c l ) + q m P(f ω (X)

= c m |c m ) (8)

where

(2/ π)

the

0

erf(x) =

x e u 2 du. By (8), (9), and (12), the Bayesian error

error

function

(erf)

is

defined

as

between classes c l and c m in the 1-D feature space after being

projected onto ω is expressed as

lm (ω)

q

1 erf N|ω T l Σ m )ω|

2

2

.

(13)

It is still complex to optimize ω via (13), since ω is embedded in the error function. We would like to isolate ω from the error function. We present the following inequality. For 0 x a, we have

(14)

erf(x) a erf(a)x.

The equality holds when taking x = 0 or x = a. The proof is given in Appendix A. Let

(15)

δ lm (ω) = |ω T Σ l ω ω T Σ m ω|

be the absolute distance between classes c l and c m in the reduced 1-D feature space. By (14), we have

1

erf Nδ lm (ω) λ lm erf Nλ lm

2

2

1

2 2 δ lm (ω)

(16)

where λ lm is the maximum value of δ lm (ω). Note we have required, in the beginning of this section, that the magnitude of ω is subject to the constraint 2N (ω T Σ l ω) 2 1. The left and right expressions of (16) are not equal for all directions of ω. The two expressions are equal when arriving at the maximum or the minimum value. Combining (13) and (16), we have

lm (ω)

q

1 λ lm erf Nλ lm δ lm (ω).

1

2

2

(17)

For the M classes problem, the upper bound of the Bayesian error is calculated as [5]

where P (f ω (X) = c l |c l ) is the probability that samples belong-

ing to class c l are misclassified into class c m , f ω (·) denotes the

Bayesian classifier, and q l and q m are the a priori probabilities of classes c l and c m , respectively. Since the data ω T XX T ω conditioned on each class are (approximately) Gaussian dis- tributed with variance less than one and mean being, for exam- ple, T Σ l ω if conditioned on class c l , it follows that

q l P(f ω (X)

=

c l |c l ) + q m P(f ω (X)

= c m |c m )

D m q l p l (x)dx + D l q m p m (x)dx

(9)

where p l (x) and p m (x) are the probability density functions of Gaussian distributions N (T Σ l ω, 1) and N (T Σ m ω, 1), respectively, and D m and D l are defined as

(ω)

M

1

l=1

M

1

l=1

M

m=l+1

lm (ω)

M

m=l+1 q 1

λ lm erf Nλ lm δ lm (ω) .

1

2

2

(18)

B. Discriminant Criterion Based on Upper Bound of Multiclass

Bayesian Error

To minimize the Bayesian error, we should minimize its upper bound, which is reduced to maximize the following discriminant criterion:

J P (ω) =

M 1

M

l=1 m=l+1

λ lm erf Nλ lm δ lm (ω).

1

2

2

(19)

D m = {x : q m p m (x) q l p l (x)}

(10)

Let

D l = {x : q l p l (x) > q m p m (x)}.

(11)

α lm =

λ lm erf Nλ lm .

1

2

2

Suppose q l = q m = q. Then, we have [21]

Then, J P (ω) can be rewritten as

D m p l (x)dx + D l p m (x)dx = 1 erf N|ω T (Σ l Σ m )ω|

2 2

(12)

J P (ω) =

M 1

l=1

M

m=l+1

α lm δ lm (ω).

(20)

(21)

WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION

1415

PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1415 Fig. 1. Weighting function α ( u ) =

Fig. 1.

Weighting function α(u) = (1/u)erf(u).

The α lm can be viewed as weight imposed on pairwise classes c l and c m . Since T Σ l ω and T Σ m ω are the distribution means of classes c l and c m in the reduced 1-D feature space, re- spectively, the quantity N λ lm in (20) is the maximum distance (with respect to ω) between the two class means. It reflects the separability of two classes. Note α(u) = (1/u)erf(u) is a monotonically decreasing function of u, as shown in Fig. 1. So, the pairwise class weighting function α lm in (20) is monoton- ically decreasing with respect to N λ lm . That is, in (21), we impose heavier weights onto close class pairs, which are more likely to be misclassified. The close class pairs are endowed with emphasize, which helps make the criterion suited in producing separability in the output space.

C. Discriminant Criterion: General Case of ω

In the earlier derivation, we require that the variance 2N(ω T Σ l ω) 2 is less than 1. This requirement is satisfied by restricting the length of ω. It suffices to restrict ω such

that ω T ( 2NM Σ)ω = 1. In fact, for any class c l , we have

¯

2N(ω T Σ l ω) 2 < 2N(ω T M Σω) 2 = 1. So, we can normalize

term

ω T

ω

¯

¯

as ω (ω/ ω T ( 2NM Σ)ω). l Σ m )ω is normalized as

Accordingly,

the

ω T l Σ m )ω ω T (Σ l Σ m )ω

ω T ( 2NM Σ)ω .

¯

As a result, in the weighting function (20), the λ lm is trans- formed as the largest absolute generalized eigenvalue of Σ l

Σ m with respect to 2NM Σ. For the general case of ω, based on criterion (19), we present the multiclass discriminant criterion

¯

˜

J P (ω) =

M 1

l=1

M

m=l+1

1

lm erf Nλ

4M

|ω T l Σ m )ω|

¯

ω T Σω

lm

λ

(22)

where λ lm denotes the largest absolute generalized eigenvalue

¯

of Σ l Σ m with respect to Σ. The value λ lm actually measures

the closeness between classes c l and c m in the input space. Based on criterion (22), we define a set of filters as follows:

(23)

˜

ω 1 = arg max J P (ω)

···

ω G = arg

ω

max

T

¯

ω Σ ω g = 0,

g = 1,

1

,G

˜

J P (ω).

(24)

This discriminant criterion produces a set of discriminant vec- tors to minimize the upper bound of the Bayesian error. We

see that the gth discriminant vector ω g is determined such that

˜

J P is maximized in the (K g + 1)-dimensional space that is

¯

perpendicular (with respect to Σ) to the space spanned by previ-

ously obtained discriminant vectors ω 1 through ω g1 . We refer

to

J P as WPC, which is to minimize the upper bound of the

Bayesian error for multiclass EEG single-trial classification.

˜

D. Implementation of WPC

The proposed WPC method can be similarly implemented by using the rank-one update and power iteration technique

without resorting to complex optimization algorithm, as in [23].

Let ψ = Σ 1/2 ω and

¯

˘

J P (ψ) =

M 1

l=1

M

m=l+1

1

lm erf Nλ

4M

|ψ T l Σ m )ψ| ψ T ψ

lm

λ

(25)

is the largest absolute

eigenvalue of Σ l Σ m . Then, the solution of ω is converted into the optimization of ψ formulated as

(26)

where Σ l =

Σ 1/2 Σ l

¯

Σ 1/2 and λ

¯

lm

˘

ψ 1 = arg max J P (ψ)

ψ

···

 

˘

 

ψ G = arg

 

max

 

J P (ψ).

 

ψ

T ψ g = 0,

 

g

= 1,

,G

1

Let

s lm ∈ {1, 1}

be the

sign of

ψ T l Σ m )ψ.

(27)

Then,

|ψ T l Σ m )ψ| = s lm ψ T l Σ m )ψ. Let

H(s) =

M 1

l=1

M

m=l+1

1

lm erf Nλ

4M

lm

s lm l Σ m )

λ

(28)

where s has the entries {s lm } l,m=1, can be further formulated as

,M

. The optimization of ψ

ψ 1 = arg max max

sS

ψ

ψ T H(s)ψ

ψ T ψ

(29)

···

ψ G = arg max

sS

max

ψ T ψ g = 0,

g = 1,

,G

1

ψ T H(s)ψ

ψ T ψ

(30)

where S is the sign space of all possible s. Clearly, the first vector ψ 1 is the first principal eigenvector (which can be obtained by power iteration) of H(s ), where

1416

s is the sign that results in the largest first principal eigen- value of all possible s. Once the first vector ψ 1 is obtained, we proceed to find the second vector ψ 2 in the orthogonally complementary space of ψ 1 , i.e., in the space spanned by I K ψ 1 ψ T , where I K denotes the K-dimensional identity ma-

1

trix. Therefore, ψ 2 is solved as the first principal eigenvec- tor of the deflated matrix (I K ψ 1 ψ T )H(s )(I K ψ 1 ψ T ). Note that we use the same symbol s , but it is not neces- sarily the same with the one producing ψ 1 . Generally, sup-

g have been obtained. The

pose the first g vectors ψ 1 ,

(g + 1)th vector is determined in the orthogonally complemen-

tary space spanned by ψ 1 ,

g , i.e., in the space spanned

by I K U g U

sis of ψ 1 ,

Schmidt orthogonalization procedure. So, ψ g+1 is solved as

the first principal eigenvector of (I K U g U

). Given the obtained ψ g+1 , according to the Schmidt or-

thogonalization procedure, the basis matrix U g+1 is formed

by padding U g as U g+1 = [U g ,u g+1 ], where u g+1 =

U g U

)H(s )(I K

, where U g is the matrix of orthonormal ba- g , which, for example, can be obtained by the

1

1

T

g

T

g

T

g

ψ g+1 ) . In theory,

ψ g+1 is orthogonal with U g , i.e., U

plies that u g+1 is simply the normalized ψ g+1 . We keep the

previous Schmidt orthogonalization procedure for computa- tional precision in practice. Note I K U g+1 U g+1 = (I K

), which makes it feasible to compute

ψ g+1 = 0, which im-

ψ g+1 U g (U

T

g

ψ g+1 ) / ψ g+1 U g (U

T

g

T

g

T

g

T

u g+1 u

T

g+1 )(I K U g U

g+1 )H(s )(I K U g+1 U g+1 ) for the next step

) through mul-

tiplying I K u g+1 u g+1 from both sides. In practice, the covariance matrices Σ l (l = 1,

usually unknown, which thus need to be estimated. The ex-

pression

Σ defined in (1) provides a way of estimation. We

,M ) are

by updating (I K U g U

(I K U g+1 U T

T

T

g

)H(s )(I K U g U

T

g

T

ˆ

summarize the optimization procedure of multiclass filters via the WPC approach in Table I.

E. Classification

are the G filters obtained by

WPC. For any EEG data segment X t , we extract the features as

(31)

The extracted features on training EEG data are used to design a classifier. For a testing EEG segment, its features are extracted in the same way, which are input into the trained classifier to predict its class label.

Suppose ω g (g = 1,

σ

2

g

= ω

T

g

X t X

T

t

,G)

ω g ,

(g = 1,

,G).

IV. COMPARISON AND EXTENSION

In this section, we compare the proposed WPC approach with [23]. The starting points and formulations of the two methods are completely different. We also extend WPC by integrating temporal information.

A. Comparison With [23]

Both the proposed WPC approach and [23] are based on the Bayesian error estimation. However, the classification targets and then the criteria of the two methods are different, as sum- marized in Table II. The discriminant criterion of [23] is derived

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

TABLE I

OPTIMIZATION PROCEDURE OF MULTICLASS FILTERS VIA THE WPC APPROACH

ROCEDURE OF M ULTICLASS F ILTERS V IA THE WPC A PPROACH by minimizing the upper

by minimizing the upper bound of the Bayesian error of classi- fying ω T x, while WPC takes feature ω T XX T ω used in EEG single-trial classification as target directly. It is noted that

q

M 1

l=1

M

m=l+1

|ω T l Σ m )ω| ≤

M

l=1

|ω T l Σ)ω|

¯

2q

M 1

l=1

M

m=l+1

|ω T l Σ m )ω|.

(32)

WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION

1417

TABLE II

COMPARISON BETWEEN WPC AND ZHENG AND LIN [23]

II C OMPARISON B ETWEEN WPC AND Z HENG AND L IN [23] The proof is

The proof is given in Appendix B. Let

˜

J(ω) =

M

1

M

l=1 m=l+1

|ω T l Σ m )ω|

ω T

¯

Σω

˜ ˜

.

(33)

Then, we have q J(ω) J(ω) 2q J(ω). So, maximizing J(ω)

 

˜

can be roughly performed by maximizing

J(ω). Using ψ =

¯

˜

Σ 1/2 ω, maximizing J(ω) is equivalent to maximizing

˘

J(ψ) =

M

1

l=1

M

m=l+1

|ψ T l Σ m )ψ|

ψ T ψ

(34)

which is M (M 1)/2 pairs of absolute distances between

ψ T Σ l ψ and ψ T Σ m

The maximization of (34), however, may not very appropriate for classifying multiclass EEG single-trials in some cases. For

example, consider the situation that one class has large differ- ence (in terms of Σ) from the other classes. To maximize the

ψ subject to ψ = 1.

˘

criterion J(ψ), the class pairs (say c l and c m ) that have large differences (between Σ l and Σ m ) heavily control the selection of the direction of ψ. Note ψ is of unit length. As a result, the remote class is projected from the other classes as far as possible while close classes are more likely to be merged.

), the crite-

rion J P (ψ) in (25) or J P (ψ) in (22) de-emphasizes the influence

of large class differences, where the classes are already well separated and gives great emphasize to small class differences where the close classes are more likely to be confused. An interesting connection between WPC and [23] is that,

By contrast, with the weight (1/λ lm

˘

˜

)erf( N λ

lm
lm

4M

˜

when applied in two-class paradigm, criterion J P (ω) and crite-

rion J(ω) produce the same solution of ω. This, however, does not necessarily imply that the upper bounds of the Bayesian errors of these two methods are equal. On the other hand, com- paring the upper bounds of the Bayesian errors derived by the two methods is meaningless since they are derived for classify- ing different objects.

B. Extension to WPC: Integrating Temporal Information

The set of filters of WPC are obtained by considering clas- sifying the projected variance ω T XX T ω. Note that X is a segment of EEG single-trial time course. The covariance for- mulation XX T , however, is globally independent of time. The temporal information is completely ignored. From the study of neurophysiology, EEG signals are usually nonsta- tionary. It is useful to integrate the temporal information into the covariance formulation, reflecting the temporal manifold

of the EEG time course [20]. Specifically, by the fact that XX T = (1/2N ) i,j=1 (x i x j )(x i x j ) T , we use the tem- porally local covariance matrix

N

C =

1

2N

N

i,j=1

(x i x j )(x i x j ) T A(i,j)

(35)

for covariance modeling instead of XX T that is time indepen- dent. The time-dependent adjacency value A(i,j) is defined such that only temporally close sample pairs, say {x i x j :

|i j| < τ } with τ being a temporal range parameter, are se- lected to contribute to the summation (35). The value A(i,j) is

monotonously decreasing with respect to temporal distances be-

tween selected sample pairs. In this paper, the adjacency matrix A is defined using the Tukey’s tricube weighting function [6]

A(i,j) =


⎩ ⎪

1

0,

i j

τ

3 3

,

|i j| < τ

else.

(36)

With some algebraic derivations, C is compactly expressed as C = (1/N )XEX T , where E = D A is the Laplacian

, and D is the diagonal ma-

matrix, A = (A(i,j)) i,j=1,

trix whose diagonal entries are row sums of A. Let L = (1/N )E. Then, under the same probabilistic assumption with the previous section, the Gaussian quadratic form ω T XLX T ω conditioned on class c l has mean tr(L)ω T Σ l ω and vari- ance 2tr(L 2 )(ω T Σ l ω) 2 , where tr(·) is the trace operator. If ω T XLX T ω is treated as target feature for classification pur-

pose. Then, (28) is accordingly modified by integrating temporal information as

,N

H TI (s) =

M 1

l=1

M

m=l+1

1

lm erf

λ

tr(L 2 ) s lm l Σ m ).

tr(L)λ

lm

4M

(37)

Note that, in this case, the difference between means of classes c l and c m becomes tr(L)(ω T l Σ m )ω). And ω is normalized

as ω (ω/ ω T ( 2tr(L 2 )M Σ)ω).

¯

In the implementation of the temporal extension of WPC, the

covariance matrices Σ l (l = 1,

ˇ

Σ l =

1

N |I l |

,M

) are estimated as

t∈I l

X t LX

T

t

.

(38)

The optimization procedure can be similarly carried out with WPC. The features are extracted as

(39)

where ωˇ g are the filters of the temporal extension of WPC. 1) Choice of τ : The additional parameter τ is determined from the data using a three-way cross-validation strategy. This strategy contains two nested loops. In the outer loop, the sam- ples are divided into T 1 folds, in which one-fold is treated as testing set. The testing samples are used for the estimation of generalization ability and are not concerned with the solutions of the filters and the parameter. In the inner loop, the remaining T 1 1 folds are further divided into T 2 folds, in which one-fold is treated as validation set while the remaining T 2 1 folds are

σˇ

2

g

= ωˇ

T

g

X t LX

T

t

ωˇ g ,

(g = 1,

,G)

1418

treated as training set. For each τ , the filters are solved on the training set, and then the recognition rate is calculated on the validation set. This procedure is repeated T 2 times with a dif- ferent validation set each time. The average recognition rates is recorded as the recognition accuracy across the T 2 folds. We select the τ that results in the maximum recognition accuracy. We then solve the filters using all the T 2 folds with the optimal τ selected earlier. With the filters obtained, we calculate the recognition rate on the testing set which is specified in the outer loop. The earlier procedure is repeated T 1 times with a different fold as testing set each time. The average recognition rates are computed as the final recognition accuracy across the T 1 folds.

V. EXPERIMENTS

We evaluate the effectiveness of the proposed multiclass methods on two publicly available datasets of BCI competi- tions. These two datasets are of four-class motor imagery EEG signals. We compare the classification performances of the pro- posed multiclass methods with the multiclass CSP using one- versus-rest and using JAD [7], the multiclass information theo- retic feature extraction [10], and the multiclass CSP presented in [23].

A. EEG Datasets Used for Evaluation

1) Dataset IIIa of BCI Competition III: This dataset is of four-class motor imagery paradigm by recording three subjects (k3b, k6b, and l1b) [3]. The subjects, sitting in a normal chair with relaxation, were asked to perform four different tasks of motor imagery (i.e., left hand, right hand, one foot, and tongue) by cues, which were presented in a randomized order. In each trial, the cue was displayed from the third second and lasted for 1.25 s. At the same time, the motor imaginary started and continued until the fixation cross disappeared at the seventh second. So, the duration of the motor imagery in each trial was 4 s. For subject k3b, there were 90 trials for each mental task. And for subjects k6b and l1b, each mental task cue appeared 60 times. In our experiment, we discard four trials of subject k6b because of missing data. The EEG measurements were recorded using 60 sensors by a 64-channel neuroscan system. The left and right mastoids were used as reference and ground, respectively. The EEG signals were sampled at 250 Hz and filtered by cutoff frequencies 1 and 50 Hz with the notchfilter ON. 2) Dataset IIa of BCI Competition IV: This dataset contains EEG signals recorded during a cue-based four-class motor im- agery task from nine subjects [17]. Each trial started from a short acoustic warning tone along with a fixation cross dis- played on the black screen. After 2 s, a visual cue was presented for 1.25 s, instructing the subjects to carry out the desired motor imagery task (i.e., the imagination of movement of the left hand, right hand, both feet, or tongue) from the third second until the fixation cross disappeared at the sixth second. Each subject par- ticipated two sessions recorded on different days. There were 288 trials in each session for each subject, i.e., 72 trials per task. Twenty-two electrodes were used to record the EEG signals that were sampled at 250 Hz and filtered by cutoff frequencies 0.5 and 100 Hz with the notchfilter ON.

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

B. Experimental Settings and Results

The data are band-pass filtered between 5 and 35 Hz using

a fifth-order butterworth filter, as in [9] and [10]. The EEG

segments recorded during the motor imagery period, i.e., from the third second to the seventh second in dataset IIIa of BCI competition III and from the third second to the sixth second in

dataset IIa of BCI competition IV, are used in the experiment. We exploit the three-fold cross-validation strategy to evaluate the classification accuracy. That is, we partition all the trials of each class per subject into three divisions, in which each division is used as testing data while the remainder two divisions are used as training data. This procedure is repeated three times until each division is used once as testing data. In each repetition, for each filter obtained on the training data, features are obtained by projection on the 15 frequency bands of 2-Hz width in the range 5–35 Hz [9], [10]. Consequently, we obtain a (15G)- dimensional feature vector for each trial, where G is the number

of filters selected on the training data. That is, we use the three-

way cross-validation procedure with T 1 = T 2 = 3 to determine the value of G, where G varies from 2 to 10 in step of 2. The (15G)-dimensional feature vectors are further reduced to 3-D 1 vectors by using the Fisher discriminant analysis (FDA) [21]. It should be noted that the spatial filters, the value of G as well as

the FDA weights are calculated on the basis of the training data and then applied to the testing data. The conventional classifier

of

the nearest class mean with Euclidean distance [21] is adopted

to

predict the class labels of the testing samples.

Table III reports the classification accuracies by using the multiclass filters solved by the various methods. Note, we also evaluate the classification accuracy of WPC integrating tempo- ral information (WPC/TI), where the parameter τ is determined by the three-way cross-validation procedure with T 1 = T 2 = 3. Here τ is varied logarithmically from 1 to 5 in step of 1. It is observed that the proposed WPC method achieves much bet- ter classification accuracy than the existing multiclass methods, and WPC/TI further improves the results in most cases. The im- provement of WPC/TI is attributed to the local temporal mod- eling. The reason that WPC/TI results in lower classification accuracies than WPC in few cases may be due to overfitting.

C. Comparison With BCI Competition IV

For dataset IIa of BCI competition IV, to compare with the re- sults of the winners, we use the evaluation of session-to-session transfer from session one to session two in terms of kappa score, simulating competition scenario. The procedure of the session- to-session transfer is much simpler than the cross validation. Specifically, we use the first session as training data and the second session as testing data. All the experimental settings are same with the description in the previous section except that the training data and the testing data are now fixed. The classifica- tion accuracy is summarized in Table IV. It can be seen that our proposed methods have fairly well classification performance compared with the results obtained by the best two competi-

1 Since the number of classes is four, we can obtain at most three dimensions of features by FDA, which is known as the rank-limit problem.

WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION

1419

TABLE III

COMPARISON OF THE CLASSIFICATION ACCURACIES (%) OF THE PROPOSED WPC AND WPC/TI METHODS WITH THE EXISTING MULTICLASS METHODS FOR EACH SUBJECT ON THE DATASETS OF BCI COMPETITIONS, WHERE M1–M6 REFER TO MULTICLASS CSP USING ONE-VERSUS-REST, MULTICLASS CSP USING JAD, MULTICLASS INFORMATION THEORETIC FEATURE EXTRACTION, MULTICLASS CSP IN [23], WPC, AND WPC/TI, RESPECTIVELY

M ULTICLASS CSP IN [23], WPC, AND WPC/TI, R ESPECTIVELY TABLE IV K APPA S CORES

TABLE IV

KAPPA SCORES OF VARIOUS MULTICLASS METHODS FOR EACH SUBJECT ON DATASET IIA OF BCI COMPETITION IV USING SESSION-TO-SESSION TRANSFER, WHERE NO. 1 AND NO.2REFER TO THE BEST TWO COMPETITORS, AND M1–M6 ARE SAME WITH THOSE OF TABLE III

, AND M1–M6 ARE S AME W ITH T HOSE OF T ABLE III tors. Note

tors. Note that the results obtained by the multiclass CSP using one-versus-rest and using JAD are slightly different from those reported in [9], since different classifiers and time segments are used. In our experiment, a simple classification procedure is em- ployed to reveal the effectiveness of the multiclass filters ob- tained by WPC and WPC/TI. The classification performance may be improved if we solve filters in narrower frequency bands, tuning the optimal time segment for each trial, and/or using other sophisticated classifiers. The goal of this paper is to demonstrate the effectiveness of the weighted scheme for solving multiclass filters: while we use the same experimental settings for all the methods, the weighted pairwise design produces a much higher classification accuracy.

VI. CONCLUSION

In this paper, we propose a new discriminant criterion, called WPC, of optimizing multiclass filters. The approach is estab- lished by minimizing the upper bound of the Bayesian error of classifying EEG single-trial segments, resulting in the form of sum of weights imposed on individual pairwise classes accord- ing to their closeness. We pay special emphasize on the effect of closer classes that are more likely to cause misclassification. In other words, the contributions of different class pairs to the discriminant criterion are biased. Computationally, the WPC

algorithm is conveniently solved by the rank-one update and power iteration technique. The proposed WPC approach is intentionally formulated for classifying EEG single-trial data. It takes into account classi- fication errors of EEG trials between pairs of classes. While the criterion derived based on the Bayesian error of classify- ing EEG sampling points is reasonable, the large pairwise class differences may play an overwhelming role in the optimization. By contrast, WPC directly uses the same features for optimizing spatial filters as for classification. Moreover, we extend WPC by integrating the temporal information of EEG series in the co- variance matrix formulation. The effectiveness of the proposed WPC method is demonstrated by the classification of four motor imagery tasks on two datasets of BCI competition. Finally, we point out that the Bayesian error estimation heav- ily relies on the assumption of independent Gaussian distribu- tion. This assumption, however, does not hold stringently in ap- plications, since EEG data usually has an autocorrelation struc- ture. One possible way is to consider using the Gauss mixture model instead of single Gaussian distribution. We are studying this issue theoretically and practically.

APPENDIX A

PROOF OF (14)

For 0 x a, in the error function

erf(x) =

π x

2

0

e u 2 du

we use the variable substitution v =

x a u. Then, we have

erf(x) =

π a

2

0

e v 2 ( x

a

) 2 x

a

dv.

Since 0 x

a

1, it follows that

2

erf(x) x

π a

0

a

1