Sie sind auf Seite 1von 18

Information Sciences 512 (2020) 795–812

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

A novel approach for learning label correlation with


application to feature selection of multi-label data
Xiaoya Che a, Degang Chen b,∗, Jusheng Mi c
a
School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
b
School of Mathematics and Physics, North China Electric Power University, Beijing 102206, China
c
College of Mathematics and Information Science, Hebei Normal University, Shijiazhuang 050016, China

a r t i c l e i n f o a b s t r a c t

Article history: Each example of multi-label data is represented in an object with its feature vector (i.e. an
Received 23 January 2019 instance) while being related to multiple labels. Learning label correlation can effectively
Revised 8 October 2019
reduce the labels needed to be predicted and optimize the classification performance. For
Accepted 13 October 2019
this reason, label correlation plays a crucial role in multi-label learning and has been ex-
Available online 14 October 2019
plored by many existing algorithms. Generally, every label has its own indispensable fea-
Keywords: tures, and it is reasonable to assume that a higher repetitiveness of indispensable features
Multi-label learning represents higher correlations among labels. Inspired by this fact, the essential elements
Global label correlation for each label, which are composed of indispensable features, are constructed in this pa-
Local label correlation per. A method is proposed for learning label correlation and applying it to feature selec-
Feature selection tion of multi-label data based on the overlap of different families of essential elements
related to label. Firstly, the essential elements of each label are defined and characterized
to reflect the internal connection between features and label. In addition, a process for
calculating the essential elements of a single label is provided. Secondly, by considering
the overlap of the essential element collections that are determined by the different la-
bels, relevancy of label and corresponding relevance judgement matrix with the label set
are described. Therefore, several labels with strong relationships are assigned to a relevant
group of labels. Meanwhile, local and global label correlations can be computed. Thus a
novel multi-label learning algorithm, called CLSF, is presented to select a compact subset
of indispensable features for each relevant group of labels by applying local label correla-
tion to feature selection of multi-label data. Finally, comprehensive experiments on eleven
benchmark data sets clearly illustrate the effectiveness and efficiency of CLSF against five
other multi-label learning algorithms.
© 2019 Elsevier Inc. All rights reserved.

1. Introduction

In traditional supervised learning, it is always assumed that each sample only has one kind of semantic information, i.e.,
only has one label. However, in real-world problems, examples may be complicated and have multiple class labels at the


Corresponding author.
E-mail address: chengdegang@263.net (D. Chen).

https://doi.org/10.1016/j.ins.2019.10.022
0020-0255/© 2019 Elsevier Inc. All rights reserved.
796 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

same time. For instance, a document may cover several topics, such as government and reform [22]; a scenic image may be
attached with a set of semantic classes, such as urban and beach [17]; and a gene may be annotated with multiple functional
classes, such as energy, transcription and metabolism [33]. From the above cases, examples of multi-label data in a training
set are related to multiple labels. Multi-label learning tasks can deal with examples that have different labels simultaneously,
and they widely exist in various applications, such as image annotation [31], music emotion classification [7] and social
network [2]. The ultimate aim of multi-label classification is to structure a model that can accurately predict the possible
label set of test instances (i.e. object with its feature vector). The challenge of multi-label classification is that the classifier’s
output space is exponential in size to the quality of possible class labels (i.e. 2s , s is number of possible class labels). The
huge size of the output space leads to difficulties in the multi-label classification of big data [21]. Meanwhile, the dimensions
of objects or features in the multi-label data set are also increasing rapidly [25], and then different concepts are related
to various labels. There is a great need for approaches that can offer effective and efficient multi-label classification. The
landmark difference between multi-label learning and traditional supervised learning is that there are various relationships
among different labels in multi-label learning. The information about these relationships with labels is usually useful for
learning other related labels [20]. Therefore, global and local label correlations effectively reduce the labels to be predicted
and improve the performance of multi-label classification.
To improve accuracy and speed of multi-label learning, many algorithms that exploit label correlation of multi-label
data tend to investigate global [13,14] and local label correlations [4,24,32]. They are roughly grouped into three major
categories [20], that is, first-order, second-order and high-order approaches. The second-order methods deal with the multi-
label learning problem by exploring the pairwise relationship between labels, and then the multi-label learning problem
is decomposed into Cs2 pairwise comparison problems [20]. Two examples include the local positive and negative pairwise
label correlation (LPLC) method [9], and the multi-label learning approach using local correlation (ML-LOC) also provided by
Huang [23]. However, in the real world, label correlations can be rather complex and more than second-order. High-order
approaches are needed to model interactions among all the class labels, that is, to consider that each label is affected by all
the other labels. Examples of high-order approaches include the method designed by Yu [30], where two novel multi-label
classification algorithms based on variable precision neighborhood rough sets, named multi-label classification using rough
sets (MLRS) and MLRS-LC by using local correlation. High-order approaches may have more computational complexity and
less adaptability. Various corrections in the label set are essentially determined repeats of indispensable features for the
corresponding labels because each label has its indispensable features, which are closely related and absolutely necessary
to label. However, most of the existing methods do not consider which vital features contribute to the relevance of labels
and they do not naturally split the label set into several disjoint subsets of relevant labels where the labels have a strong
correlation.
Feature selection is an excellent way to increase the efficiency of models regarding multi-label learning. The irrele-
vant or unnecessary features for each label are eliminated from the original feature set in the training data. Label-specific
features was an outstanding feature selection method that was first defined by Zhang [21], called LIFT, where each la-
bel had its own specific characteristics; however, LIFT did not discover discriminant information hidden between posi-
tive and negative instances. Then, Zhang [15] proposed a novel algorithm, called ML-DFL. In ML-DFL, a matrix was pro-
duced and its element represents the similarity based on the distance between the positive and negative instances. Re-
cently, entropy-based methods have also attracted significant attention in multi-label learning, and several related multi-
label feature selection methods have been put forward. Lee and Kim [10] selected the effective compact set of features
by maximizing the dependency between the selected features and label set by using second-order approximation func-
tions into multivariate mutual information. Furthermore, Lee and Kim [11] conducted theoretical analysis to investigate
why a score function that considers the lower-degree interaction information can effectively select a compact feature set
and also proposed a new score function called D2F, which has excellent performance in multi-label classification. They
[12] first applied mutual information to improve a genetic algorithm (GA) based multi-label feature selection algorithm
and presented a novel memetic algorithm to refine the population of the feature subsets generated by the GA through
adding (and removing) relevant (and redundant) features to (from) multiple labels. A new multi-label feature selection
method, SCLS, was provided by Lee and Kim [13] to further improve previous studies. This method employed an effec-
tive approximation for the dependency calculations that are still not considered in the multi-label feature selection prob-
lem. Lin and Hu also presented many excellent entropy-based multi-label feature selection methods, where the MDMR
[27] that combined mutual information with a max-dependency and min-redundancy algorithm to select the superior fea-
ture subset for multi-label learning. A method called multi-label feature selection with label correlation (MUCO) was also
proposed [26] based on fuzzy mutual information. Lin et al. [28] provided a multi-label feature selection algorithm with
streaming labels. Meanwhile, Chen [16] explored feature selection for multi-data with the help of kernel and mutual in-
formation. To explore feature selection of multi-label learning, Li [7] used two functions to reflect the certainty and un-
certainty between the labels with equivalence classes. Then, the uncertainties conveyed by the labels were analyzed and
a new type of feature selection was proposed, called complementary decision reduct (CDR). However, existing methods
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 797

for feature selection of multi-label data have only focused on feature selection and usually overlook the impact of label
correlation.

To address above-state problems, we consider essential elements of label for exploring the relevance between features
and labels, where an essential element is composed of indispensable features for the label. Then, a novel calculating process
of the local label correlation with application to feature selection of multi-label data is proposed. A simpler example is
utilized as a supplement to explain the basic concepts, increase readability and facilitate understanding of the theoretical
part of the proposed method. Table 1.1 lists a simple multi-label data set with five objects, three features and two labels,
where the five objects are divided into a quotient set with four equivalence classes based on the original feature set A. Tables
1.2 and 1.3 are single data sets with labels {l1 } or {l2 }, where the discernibility information contained was determined by
the original feature set A and the labels {l1 } or {l2 }. Considering Table 1.2 as an example, because there is the same label
value of label {l1 }, the equivalence classes {x4 } and {x5 } do not need be discriminated between each other. Only equivalence
class {x2 , x3 } should be discriminated with the other equivalence classes. By keeping positive and negative consistency
(equivalent to discernibility) conveyed by the features and label unchanged, the essential elements are defined and described
concerning the label through the discernibility matrix, where an essential element is composed of indispensable features for
the label. The discernibility matrix of label {l1 } or {l2 } that is defined in our paper implies the entire certainty information
of Tables 1.2 or 1.3. The relevancy between labels is computed based on all the essential elements in discernibility matrix
to evaluate corresponding relevance judgement matrix of label set, and then the relativity of any given label with other
labels can be judged. Therefore, the whole label set is automatically divided into several disjoint relevant label groups and
labels with strong relationships are assigned to a group of relevant labels. In Table 1.4, the two labels in the label set Y
have a strong relationship, that is, only one relevant label group Y1 = {l1 , l2 } in Y. Thus, the corresponding global and local
label correlations can be counted. The equivalence classes induced by A are reassigned to the two regions in Table 1.4 by
considering the weigh of each equivalence class and label correlation. Tables 1.5 to 1.10 characterize the different quotient
sets determined by the all feature subsets. The equivalence classes in the quotient set induced by {a2 } and {a1 , a3 } are
appropriate for retaining the discernibility implied in Table 1.4, where the quotient sets computed by {a1 , a2 } and {a2 , a3 }
are too fine and the quotient sets calculated based on {a1 } and {a3 } are too coarse for Table 1.4. The aim of our multi-
label feature selection method, CLSF, is to find the minimal feature subset to maintain discernibility information, containing
between original feature set and label group {Y1 }, unchanged. Therefore, {a2 } is the most suitable choice. This algorithm
simplifies the process and effectively improves the speed of predicting possible labels for unseen instances in multi-label
classification.
The remainder of this paper is organized as follows, Section 2 reviews some basic concepts of multi-label classification
and five performance evaluation metrics. Section 3 provides definition and computation process of essential elements. Then,
798 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

concept of label relevancy and corresponding relevance judgement matrix are provided. Meanwhile, the labels set are split
into some relevant label groups, and then global, local label correlations are explored. In Section 4, labels in a relevant
label group are integrated to reconstruct a binary relation by considering local label correlation and classification error
rate. A novel algorithm, called correlation-labels-specific features, is proposed. Section 5, eleven benchmark data sets are
selected to test CLSF and six other multi-label classifications in terms of five performance evaluation metrics. In Section 6,
we conclude this paper with a summary and outlook for further research.

2. Preliminaries

To further clarify content of our discussion, we present several formal notations and five evaluation metrics of multi-label
classification used in this paper.
Multi-label classification [6] is a supervised learning problem that every instance may be related with multiple la-
bels. In multi-label learning process, X = {(x, A(x ))|x ∈ U } ⊆ Rm is the domain of input instances with m−dimensional,
where A = {a1 , a2 , . . . , am } is a set of features. Any object x ∈ U is described by a vector with m feature values A(x ) =
[a1 (x ), a2 (x ), . . . , am (x )]. The set Y = {l1 , l2 , . . . , ls } is a finite set of s possible labels. For each object x ∈ U, an s−dimensions
label value vector Y (x ) = [l1 (x ), l2 (x ), . . . , ls (x )] represents its possible labels. If label lj belongs to object x ∈ U then
l j (x ) = 1; otherwise l j (x ) = −1. Then the set of possible labels denotes as K (x ) = {l j |l j ∈ Y, l j (x ) = 1} ⊆ Y. Given set D =
{(A(xi ), Y (xi ))|xi ∈ U, i = 1, 2, . . . , n} denotes as a training data set with n instances and their related labels. Here, subscripts
are used to avoid ambiguity with label dimensions. Therefore, lj (xi ) corresponds to binary relevance of the jth label in Y
associating with the ith object in U. In this paper, features of multi-label data are discrete value based on a discretization
algorithm FIMUS [18].
With any features subset B⊆A, there is an associated indiscernibility relation (i.e. equivalence class) [34] RB :

RB = {(x, y ) ∈ U × U |∀a ∈ B, a(x ) = a(y )}.

Obviously, RB forms a quotient set (i.e. equivalence classes partition) of U, denoted by U/RB = {[x]B |x ∈ U }, where [x]B is
an equivalence class determined by x ∈ U with respect to indiscernibility relation RB , that is, [x]B = {y ∈ U |(x, y ) ∈ RB }.
Specifically, objects with or without a given label are considered as positive or negative objects. For one given class label
l j ∈ Y, the sets of positive and negative training objects are denoted as [21]:

Pos j = {xi |(A(xi ), Y (xi )) ∈ D, l j ∈ K (xi )},

N eg j = {xi |(A(xi ), Y (xi )) ∈ D, l j ∈


/ K ( xi )}.

H : X → R is a set of classifiers. The final purpose of multi-label classification task is to output a real-value classifier,
which optimizes the multi-label evaluation metric (such as Coverage) for h(x ) = K (x ). The performance evaluation metrics
of multi-label classification algorithms differ from single-label classification algorithms. Five multi-label evaluation metrics
[20] are available to evaluate the performance of multi-label classification algorithms, that is, Hamming loss, Average Preci-
sion, Coverage, One-error, Ranking loss.

(1) Hamming loss: computes the radio that an object-label pair is misclassified between the predictive labels h(x ) = K (x )
and the ground truth labels. It is normalized over total number of objects and the labels. The smaller of value of
Hamming loss the better performance of algorithm.

1  
n s
Hammingloss = 1 − (l j (xi ) = l j (xi )).
ns
i=1 j=1

(2) Average Precision: evaluates the average fraction of relevant labels ranked higher than a particular label lj belongs to
xi . The bigger value of this metric the better performance of algorithm.

1  |R(xi , l j )|
s
1
AverageP recision = ,
s |K ( xi )|
i=1
rank(xi , l j )
l j ∈K ( x i )

where R(xi , l j ) = {lk |rank(xi , lk ) ≤ rank(xi , l j ), lk ∈ K (xi )} and | · | denotes the cardinality of a set. Here rank(xi , l j ) =
s
j=1 δ ( f j (xi ) ≥ f k (xi )) indicates the rank of lj for xi , when all class labels are stored in descending order according to
{ f1 (xi ), f2 (xi ), . . . , fs (xi )}, and δ (λ ) = 1 if λ holds and 0 otherwise.
(3) Coverage: expresses how far, on average, we need to move down the rank list of labels so as to cover whole ground
true labels of the object. The smaller value of coverage the better performance of algorithm.
 
1
n
Coverage = max rank(xi , l j ) − 1 .
n l j ∈K ( x i )
i=1
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 799

(4) One-error: evaluates how many steps the top-ranked predicted label is not in the relevant label set of the object. The
smaller value of one-error the better performance.

1
n  
one − error = δ arg max f (xi ) ∈/ K (xi ) .
n l∈Y
i=1

(5) Ranking loss: calculates the average percentage of wrong instruction label pairs, that is, the irrelevant labels of an
object is ranked higher than relevant labels.

1 |{(lk , l j )| fk (xi ) ≤ f j (xi ), (lk , l j ) ∈ K (xi ) × K (xi )}|


n
Rankingloss = ,
n
i=1 |K (xi )||K (xi )|
where K (xi ) is the complementary set of K(xi ) with respect to Y.

3. Label correlation in multi-label data based on essential elements of label

Multi-label training data D = {(A(x ), Y (x ))|x ∈ U }, where U is the domain of input objects, A is a set of features and
Y = {l1 , l2 , . . . , ls } is label set, can be transformed into multiple single label data D j = {(A(x ), l j (x ))|x ∈ U, 1 ≤ j ≤ s} with the
same objects and their related features. This section defines and characterizes the essential elements of label, where each
essential element consists of indispensable features for label, and then the calculation method of the essential elements of
label is given and finally counts local and global label correlations.

3.1. Essential elements of label by considering discernibility matrix

In any single label data D j , incongruity exists between features and label lj because of the limited cognitive level and
capacity of human [3,8]. The incongruity reflects the ability of features depicting the label and characterizes uncertain infor-
mation contained in data D j . A quotient set, which is induced by the indiscernibility relation and composed of equivalence
classes, is the most basic tool to explore incongruity conveyed by the features and label. By considering objects in each
equivalence class with features in A and the positive and negative label values, a definition of inconsistent and consistent
equivalence classes in the quotient set U/RA is proposed to reflect the certain and uncertain information of data D j .

Definition 3.1.1. For each label l j ∈ Y, U/Rl j = {Pos j , N eg j } are sets of positive and negative objects, U/RA is quotient set with
respect to A. An equivalence class [x]A ∈ U/RA is called positive consistent equivalence class with respect to lj and denoted as
l j ([x]A ) = 1, if [x]A ⊆ Pos j ; or [x]A is negative consistent equivalence class with respect to lj and denoted as l j ([x]A ) = −1 if
[x]A ⊆ N eg j ; otherwise [x]A is inconsistent equivalence class with respect to lj and denoted as l j ([x]A ) = 0.

All the inconsistent equivalence classes with l j ([x]A ) = 0 imply that there is an incongruity between label lj and A, which
reflects uncertain information in data D j . The equivalence classes belonging to Pos j or N eg j contain certain information
of D j defined by lj and A. Consistent equivalence classes are constructed to partition Pos j and N eg j and can be served as
appropriate prototypes for giving expression to the certain and uncertain information included in D j . Therefore, U/RA are
split into three regions: positive consistent region PosA (lj ), negative consistent region NegA (lj ), and inconsistent region of
boundary BnA (l j ) = U − PosA (l j ) ∪ NegA (l j ).
The number of equivalence classes in the quotient set U/RA induced by features set A, which consistent with the quotient
set U/Rl j derived by label lj , determines the certain information in D j . It can also be used to equivalently characterize
the uncertain information in D j . From this point, keep the certain and uncertain information included in D j constant, the
positive and negative consistent regions confirmed by U/RA and U/Rl j are naturally guarantee to be unchanged. To do so,
the equivalence classes between different regions must be distinguished and identified. For example, any equivalence class
[x]A ⊆PosA (lj ) cannot be merged with another equivalence class [y]A ࣰPosA (lj ). That being said, there must be at least one
feature in A with the ability to identify [x]A ⊆PosA (lj ) and [y]A ࣰPosA (lj ), that is, ∃a ∈ A such that a([x]A ) = a([y]A ). It should be
noticed that if [x]A ⊆PosA (lj ) and [y]A ࣰPosA (lj ), then it also must satisfy lj ([x]A ) = lj ([y]A ).
The concept of distribution discernibility matrix is provided to store all the pairs of equivalence classes that need to be
distinguished and all the features with the capacity to identify each pair of equivalence classes.

Definition 3.1.2. ∀x, y ∈ U, l j ∈ Y, we denote pairs of equivalence classes with different regions as Pj∗ = {([x]A , [y]A )|l j ([x]A ) =
l j ([y]A )}. Distribution discernibility matrix with respect to label lj is denoted as P j = (Pj ([x]A , [y]A ))|U/RA |×|U/RA | , where ele-
ment Pj ([x]A , [y]A ) is defined by

{a ∈ A|a ( [x]A ) = a ( [y]A )}, ([x]A , [y]A ) ∈ Pj∗ ;
P j ( [x]A , [y]A ) =
∅, otherwise,

For any pair of equivalence classes ([x]A , [y]A ), two constraints are utilized given in Definition 3.1.2 to determine their
corresponding element Pj ([x]A , [y]A ) in the distribution discernibility matrix. In other words, the non-empty discernibility
800 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

Table 1
A multi-label data set.

U A Y

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 l1 l2 l3 l4 l5 l6

x1 0.10 0.20 5.00 0.37 2.0 0.30 0.20 0.11 6.6 0.70 1 1 1 1 −1 −1
x2 0.17 0.30 6.00 0.48 3.0 0.22 0.14 0.29 6.5 0.94 −1 −1 −1 1 −1 1
x3 0.26 0.25 5.50 0.55 1.0 0.10 0.27 0.32 6.2 0.83 −1 −1 −1 −1 −1 1
x4 0.80 0.45 6.00 0.29 5.1 4.90 0.88 0.40 4.1 0.50 −1 −1 1 1 1 1
x5 0.77 0.56 5.00 0.31 4.6 5.70 0.76 0.39 5.1 0.52 1 1 −1 1 1 −1
x6 0.10 0.98 2.00 0.79 7.2 0.59 0.26 0.71 4.9 0.12 1 1 1 −1 1 1
x7 0.73 0.45 6.40 0.22 5.0 5.80 0.79 0.48 5.2 0.59 1 −1 −1 −1 −1 −1
x8 0.23 0.94 3.00 0.85 6.8 0.72 0.31 0.83 4.8 0.04 1 −1 −1 1 −1 1
x9 0.44 0.74 0.23 0.92 6.0 0.24 0.38 0.90 5.7 0.65 1 1 1 1 1 1
x10 0.94 1.00 7.00 0.97 8.0 6.00 0.89 0.99 8.3 1.89 −1 −1 1 1 −1 −1
x11 0.52 0.69 0.11 0.87 2.0 0.14 0.61 0.28 7.0 0.57 1 1 −1 −1 −1 −1

Table 2
A discretized multi-label data set.

U A Y

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 l1 l2 l3 l4 l5 l6

x1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 −1 −1
x2 1 1 2 1 1 1 1 1 2 1 −1 −1 −1 1 −1 1
x3 1 1 2 1 1 1 1 1 2 1 −1 −1 −1 −1 −1 1
x4 2 1 2 1 2 2 2 1 1 1 −1 −1 1 1 1 1
x5 2 1 2 1 2 2 2 1 1 1 1 1 −1 1 1 −1
x6 1 2 1 2 2 1 1 2 1 1 1 1 1 −1 1 1
x7 2 1 2 1 2 2 2 1 1 1 1 −1 −1 −1 −1 −1
x8 1 2 1 2 2 1 1 2 1 1 1 −1 −1 1 −1 1
x9 1 2 1 2 2 1 1 2 1 1 1 1 1 1 1 1
x10 3 3 3 3 3 3 3 3 3 3 −1 −1 1 1 −1 −1
x11 2 2 1 2 1 1 2 1 2 1 1 1 −1 −1 −1 −1

feature subset Pj ([x]A , [y]A ) holds for ([x]A , [y]A ) ∈ Pj∗ . It is necessary to distinguish equivalence classes [x]A and [y]A belong-
ing to different consistent regions in Definition 3.1.1, and then each feature in Pj ([x]A , [y]A ) can discern equivalence classes
[y]A and [x]A in the quotient set U/RA induced by features set A; otherwise, Pj ([x]A , [y]A ) = ∅ means that the equivalence
classes [y]A and [x]A belong to the same consistent region. All the non-empty elements in the distribution discernibility
matrix P j imply the entire uncertain information of data D j included between label lj and the features set A. The discerni-
bility matrix proposed in Definition 3.1.2 is symmetric, that is, Pj ([x]A , [y]A ) = Pj ([y]A , [x]A ) and Pj ([x]A , [x]A ) = ∅. Therefore,
pairs of equivalence classes ([x]A , [y]A ) and ([y]A , [x]A ) are treated as the same in the calculation process of the algorithms
provided in this paper. Thus, elements in the lower triangle of the discernibility matrix P j need to be calculated to reduce
the computational complexity.
The following example is provided to illustrate the calculation process of distribution discernibility matrix of each label.

Example 3.1.1. A multi-label training data set D = {(A(x ), Y (x ))|x ∈ U } is presented in Table 1, where U = {x1 , x2 , . . . , x11 } is
domain of objects, A = {a1 , a2 , . . . , a10 } is features set and Y = {l1 , l2 , . . . , l6 } is label set.
For convenience of constructing quotient set, each feature in domain of input instances X = {(x, A(x ))|x ∈ U } ⊆ R11 is
first divided into three equal-width intervals, depended on a discretization algorithm called FIMUS [18], and illustrated in
Table 2.
The quotient set induced by A are computed that, U/RA = {X1 , X2 , X3 , X4 , X5 } = {{x1 , x2 , x3 }, {x4 , x5 , x7 }, {x6 , x8 , x9 }, {x10 },
{x11 }}.
For any label, sets of positive and negative training objects are calculated respectively that, U/Rl1 =
{Pos1 , N eg1 } = {{x1 , x5 , x6 , x7 , x8 , x9 , x11 }, {x2 , x3 , x4 , x10 }}, U/Rl2 = {{x1 , x5 , x6 , x9 , x11 }, {x2 , x3 , x4 , x7 , x8 , x10 }}, U/Rl3 =
{{x1 , x4 , x6 , x9 , x10 }, {x2 , x3 , x5 , x7 ,x8 , x11 }}, U/Rl4 = {{x1 , x2 , x4 , x5 , x8 , x9 , x10 }, {x3 , x6 , x7 , x11 }}, U/Rl5 = {{x4 , x5 , x6 , x9 },
{x1 , x2 , x3 , x7 , x8 , x10 , x11 }}, U/Rl6 = {{x2 , x3 , x4 , x6 , x8 , x9 }, {x1 , x5 , x7 , x10 , x11 }}.
Then, equivalence classes in quotient set U/RA are divided to three regions for every label according to
Definition 3.1.1PosA (l1 ) = {x6 , x8 , x9 , x11 } = {X3 ∪ X5 }, NegA (l1 ) = {X4 }, BnA (l1 ) = {X1 ∪ X2 }; PosA (l2 ) = {X5 }, NegA (l2 ) = {X4 },
BnA (l2 ) = {X1 ∪ X2 ∪ X3 }; PosA (l3 ) = {X4 }, NegA (l3 ) = {x11 } = {X5 }, BnA (l3 ) = {X1 ∪ X2 ∪ X3 }; PosA (l4 ) = {X4 }, NegA (l4 ) =
{X5 }, BnA (l4 ) = {X1 ∪ X2 ∪ X3 }; PosA (l5 ) = ∅, NegA (l5 ) = {X1 ∪ X4 ∪ X5 }, BnA (l5 ) = {X2 ∪ X3 }; PosA (l6 ) = {X3 }, NegA (l6 ) =
{X4 ∪ X5 }, BnA (l6 ) = {X1 ∪ X2 }.
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 801

For simplicity, the digital k is used here to represent feature ak . Then, elements in the distribution discernibility matrix
of each label are proposed that
⎛ ∅ ∅ {2, 3, 4, 5, 8, 9} A {1, 2, 3, 4, 7} ⎞
⎜ ∅ ∅ {1, 2, 3, 4, 6, 7, 8} A {2, 3, 4, 5, 6, 9}⎟
P1 = ⎜{2, 3, 4, 5, 8, 9} {1, 2, 3, 4, 6, 7, 8} ∅ A ∅ ⎟;
⎝ ⎠
A A A ∅ A
{1, 2, 3, 4, 7} {2, 3, 4, 5, 6, 9} ∅ A ∅
and P2 = P3 = P4 = {∅, {1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}, A}; P5 = {∅, {1, 5, 6, 7, 9}, {1, 5, 7, 8, 9}, {2, 3, 4, 5, 8, 9},
{2, 3, 4, 5, 6, 9}, A}; P6 = {∅, {2, 3, 4, 5, 8, 9}, {1, 2, 3, 4, 7},{2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}, {1, 2, 3, 4, 6, 7, 8}, A}. 

Based on each feature, the feature values of the objects in Table 1 are discerned into several intervals and form the
corresponding granular structure where some granules (i.e. equivalence classes) are contained, and then the granules in
all granular structures are intersected to obtain the needed discrete data. If a partition interval is too small according to
each feature, then too many features will result in discrete data that is too fine after intersection. There are ten features
in Example 3.1.1, which are different from hundreds of features of data sets in Section 5, the three bins for each feature
and the five equivalence classes based on binary relation in Table 2 are appropriate to vividly reflect the whole calculation
process of the proposed method.
As known, the dimensions of elements of P1 in aforementioned example are different. Consider P1 (X1 , X5 ) ⊂ P1 (X2 , X3 ),
where the inclusion relationship implies that any feature in P1 (X1 , X5 ) cannot only separate pair of equivalence classes (X1 ,
X5 ), but also identify (X2 , X3 ) at the same time. Does this phenomenon mean that element P1 (X2 , X3 ) is redundant reflecting
uncertain information in the discernibility matrix P1 ? A well-defined answer to this question can be given.

Definition 3.1.3. Pj ([x]A , [y]A ) ∈ P j is referred to as an essential element in P j , if there does not exist another element in P j
as its proper subset, and denoted as Ej ([x]A , [y]A ). All the essential elements of label lj briefly are denoted as E j = {E j |E j ⊆
P j }.

In any discernibility matrix, these essential elements are the most basic component and reflect the entire incongruity
between the label and features. It is necessary to calculate and store every essential element of P j for exploring uncertain
information of the discernibility matrix portrayed by the label and A. Therefore, the following proposition is intended to
demonstrate this characterization.

Proposition 3.1.1.
(1) For any pair of equivalence classes ([x]A , [y]A ) ∈ Pj∗ , there exists at least an essential element E j ∈ E j can distinguish it.
(2) For any E j ∈ E j , if the essential element is removed, then there exists at least a pair of equivalence classes in Pj∗ cannot be
identified.
Proof. (1) ∀([x]A , [y]A ) ∈ Pj∗ , there must exists an element Pj ∈ P j such that Pj ([x]A , [y]A ) = {a ∈ A|(a([x]A ) = a([y]A )) ∧
(l j ([x]A ) = l j ([y]A ))} based on Definition 3.1.2. If Pj is an essential element, that is, there does not exist another element
in P j as its proper subset, and then Pj = E j ∈ E j . It must has an essential element Ej ⊆Pj that can distinguish ([x]A , [y]A ) but
also some the other pairs of equivalence classes by considering Definition 3.1.3
(2) For any pair of equivalence classes, there is a one-to-one corresponding element in P j that can distinguish the equiva-
lence classes. Assume that an essential element E j ∈ E j is removed and all pairs of equivalence classes in Pj∗ can be identified
by E j − E j . In other words, every pair of equivalence classes, which is identified by Ej ⊆Pj , also can be distinguished by an-
other essential element E j ∈ E j − E j and E j ⊆ Pj . Then, there exists Pj satisfying Pj = E j , and Pj also has another essential
element E j ∈ E j such that E j = E j ⊆ Pj = E j . Therefore, E j ⊂ E j , which conflicts with the fact that Ej is an essential elements
in E j . Thus, assumption is invalid. 

Elements in the discernibility matrix P j with the least features must be essential elements, and essential elements do
not contain each other. A given essential element is selected, and then it can eliminate elements in P j that contain it and
the other essential elements cannot be deleted. Meanwhile, elements with the least features in the remaining elements are
still essential elements. This process is repeated until P j = ∅, and then all the essential elements for label lj are counted.
According to the analysis above, the algorithm for calculating the essential elements of any label is provided.
To calculate the essential elements for a label, the following major operations are needed: (1). The first step is compute
discernibility matrix of the label, and its time complexity is O(|U/RA |2 ). (2). The second part is collect set of essential el-
ements in the discernibility matrix, and corresponding time complexity is O(|U/RA |2 |A|). Therefore, the time complexity of
Algorithm 3.1.1 is O(|U/RA |2 |A|).
The essential elements for each label can be computed in the following example.

Example 3.1.2 (Continued from Example 3.1.1). Set of essential elements of distribution discernibility matrix
of every label is provided respectively that E1 = {{2, 3, 4, 5, 8, 9}, {1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}}; E2 = E3 = E4 =
{{1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}}; E5 = {{1, 5, 6, 7, 9}, {1, 5, 7, 8, 9}, {2, 3, 4, 5, 8, 9}, {2, 3, 4, 5, 6, 9}};
E6 = {{2, 3, 4, 5, 8, 9}, {1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}}. 
802 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

Algorithm 3.1.1 Essential elements for each label based on discernibility matrix.
Input: (1). Training data D = {(A(x ), Y (x ))|x ∈ U }
(U = {x1 , x2 , . . . , xn }, A = {a1 , a2 , . . . , am }, Y = {l1 , l2 , . . . , ls } )
Output: Collection of essential elements of any label E j (1 ≤ j ≤ s )
Initialize: E j = ∅
1. calculate quotient set U/RA with respect to A
2. compute discernibility matrix of label l j , P j = (Pj ([xs ]A , [xt ]A ))|U/RA |×|U/RA |
3. count U/Rl j = {Pos j , N eg j } with respect to label l j
4. while (Pj = ∅)
5. select Pj ([xs ]A , [xt ]A ) satisfying
|Pj ([xs0 ]A , [xt0 ]A )| = min{|Pj ([xs ]A , [xt ]A )| : Pj ([xs ]A , [xt ]A ) ∈ P j , Pj ([xs ]A , [xt ]A ) = ∅}
6. let E j = [E j : Pj ([xs0 ]A , [xt0 ]A )]
7. for each Pj ([xs ]A , [xt ]A ) ∈ P j and Pj ([xs ]A , [xt ]A ) = ∅
8. if Pj ([xs0 ]A , [xt0 ]A ) ⊆ Pj ([xs ]A , [xt ]A ), let Pj ([xs ]A , [xt ]A ) = ∅
9. if Pj ([xs0 ]A , [xt0 ]A ) = Pj ([xs ]A , [xt ]A ), let Pj ([xs ]A , [xt ]A ) = ∅
10. endfor
11. let Pj ([xs0 ]A , [xt0 ]A ) = ∅.
12. endwhile
13. Retern E j .

To distinguish corresponding pairs of equivalence classes, every indispensable feature in an essential element is equal. A
compact subset of indispensable features can be selected from the essential elements in E j one by one until all the pairs of
equivalence classes in Pj∗ are recognized. The essential elements of a discernibility matrix contain each compound mode of
the compact subset of indispensable features.

3.2. Local and global label correlations through using essential elements

By considering the overlap of different essential element families of the discernibility matrix related to label, label rele-
vancy and the corresponding relevance judgement matrix are proposed to estimate the relationships among multiple labels.
Therefore, a novel model for calculating local and global label correlations is provided.
In a practical data set, the difference between some essential elements in E j and some the other essential elements
in Ek may be very small. For instance, suppose that there is another label l7 in Table 1, and families of essential el-
ements related to l7 , l2 are respectively E7 = {E71 , E72 , E73 } = {{1, 2, 3, 4, 7}, {2, 3, 4, 5, 8, 9}, {1, 5, 7, 8, 9}}, E2 = {E21 , E22 , E23 } =
{{1, 2, 3, 4, 7}, {2, 3, 4, 5, 6, 9}, {1, 5, 7, 8, 9}}, where E71 = E21 , E73 = E23 and E72 E22 = {6, 8}. If E7 and E2 are simply intersected,
then E72 and E22 are not equal and cannot be used to calculate relevancy between l7 and l2 . Obviously, there is little differ-
ence between E7 and E2 . To address this phenomenon, which regularly appears in large amounts of multi-label data, the
restriction of counting the overlap of different essential element sets from inside and outside can be relaxed to propose the
definition of relevancy of labels.

Definition 3.2.1. For any l j , lk ∈ Y, a function γ : Y × Y → {0, 1} defined by


 |E j Ek |
1, |E j ∪Ek | ≥ α;
γ ( j, k ) =
0, otherwise,
|E ∩E |
is a label relevancy with respect to set of essential elements, where E j  Ek = {E j | |E j ∪Ek | ≥ β , E j ∈ E j , Ek ∈ Ek }. Then, the rel-
j k
evance judgement matrix is refer to as Rel = (γ ( j, k ))|Y |×|Y | , 1 ≤ j, k ≤ s.

Related algorithms of digging label correlation only depict the possible relevance of the label pair or whole label set.
Compared with them, our algorithm provides a relevance judgement matrix to indicate the relationships among labels in
Y based on the essential element collection determined by the label and features set. Then, the label set Y is divided into
several disjoint groups of relevant labels and is denoted as clu(Y ) = {Y |Y ⊆ Y }. Every label in any group of relevant labels
has a close relationship with the fixed label, which is selected randomly from the group of relevant labels. For a given fixed
label, a group of relevant labels is obtained around the label with the help of Definition 3.2.1. In this paper, a group of
relevant labels is found from the first column (or line), and then labels that have already been attached to group one are
removed, and another group of relevant labels is found from the second column (or line) and so on until traversing the
entire label set Y. The first label in each group of relevant labels can be selected as a fixed label and the others are related
labels.
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 803

Using the fixed label as benchmark, several label pairs are constructed with the fixed label and other related labels in a
group of relevant labels. For any label pair in a group of relevant labels, in each equivalence class, the labels of an object
are the same (or exclusive) providing positive (or negative) correlation support.

Definition 3.2.2. For each group of relevant labels Y, lj is the fixed label and lk is any related label. Then, local positive and
negative correlations of the label pair (lj , lk ) with respect to equivalence class [x]A can be formulated as follows,

|{y|y ∈ [x]A , l j (y ) × lk (y ) = 1}|


pos j,k ([x]A ) = ,
|[x]A |

|{y|y ∈ [x]A , l j (y ) × lk (y ) = −1}|


neg j,k ([x]A ) = .
|[x]A |
Obviously, pos j,k ([x]A ) + neg j,k ([x]A ) = 1. Because of the different qualities of equivalences classes in the quotient set U/RA ,
a local label correlation is proposed by considering weight of equivalence class.Then, local correlation is explored concerning
the label pair for equivalence class.

Definition 3.2.3. For each group of relevant labels Y, lj is the fixed label and lk is any related label. Then, local correlation
of label pair (lj , lk ) related to equivalence class [x]A can be formulated as follows,

|[x]A |
|U | ,
⎨ pos j,k ([x]A ) ≥ η;
loc j,k ([x]A ) = − |[x]A | , neg j,k ([x]A ) ≥ η;
⎩ |U |
0, otherwise.

η is a parameter, and usually needs more than 0.5. In other words, when local positive (or negative) label correlation
enjoys a decided advantage, it is considered that the whole equivalence class is positive (or negative) related between lj and
lk . If local positive (or negative) label correlations are close to each other, then it is believe that labels pair (lj , lk ) do not
have an obvious relevance on this equivalence class.

Global label correlation is obtained as follows by aggregating local label correlations of equivalence class concerning the
label pair.

Definition 3.2.4. For each group of relevant labels Y, lj is the fixed label and lk is any related label. Then, global correlation
of label pair (lj , lk ) can be formulated as follows,

glo( j, k ) = loc j,k ([x]A ).
[x]A ∈U/RA

The following Example 3.2.1 shows how to determine groups of relevant labels and calculate local, global label correla-
tions for each labels pair.

Example 3.2.1 (Continued from Example 3.1.1). Let α = 75%, β = 85%, η = 60%. Then, the relevance judgement matrix of
label set Y defined in Definition 3.2.1. is counted that
⎛ ⎞
1 0 0 0 0 1
⎜0 1 1 1 0 1 ⎟
⎜0 1 1 1 0 1 ⎟
Rel = ⎜
⎜0


1 1 1 0 1
⎝ ⎠
0 0 0 0 1 0
1 1 1 1 0 1

Groups of relevant labels are counted that clu(Y ) = {Y1 , Y2 , Y3 } = {{l1 , l6 }, {l2 , l3 , l4 }, {l5 }}, where l1 , l2 , l5 are respectively
the fixed labels in Y1 , Y2 , Y3 , and the corresponding set of label pairs are L(Y1 ) = {(l1 , l6 )}, L(Y2 ) = {(l2 , l3 ), (l2 , l4 )}, L(Y3 ) =
∅.
Therefore, local and global label correlations for each label pair are that loc1,6 (X1 ) = −3/11, loc1,6 (X2 ) =
−3/11, loc1,6 (X3 ) = 3/11, loc1,6 (X4 ) = 1/11, loc1,6 (X5 ) = −1/11; loc2,3 (X1 ) = 3/11, loc2,3 (X2 ) − 3/11, loc2,3 (X3 ) = 3/11, loc2,3
(X4 ) = −1/11, loc2,3 (X5 ) = −1/11; loc2,4 (X1 ) = 3/11, loc2,4 (X2 ) = −3/11, loc2,4 (X3 ) = 3/11, loc2,4 (X4 ) = −1/11, loc2,4 (X5 ) =
−1/11.glo1,6 = −3/11; gl o2,3 = gl o2,4 = 1/11. 

Local label correlation can be formulated in Algorithm 3.1.1 based on the essential element collection of label defined in
Definition 3.1.3
The time complexity for counting relevance judgement matrix is O(|Y |2 ). The time complexity of calculating local label
correlation of equivalence class concerning on Y is O(|Y |2 · |U/RA | ) by supposing extreme situation that group of relevant
labels Y ∈ cly(Y ) satisfies Y = Y . Time complexity of Algorithm 3.1.1 is O(|Y |2 · |U/RA | ).
804 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

Algorithm 3.1.1 Local label correlation based on essential elements.


Input: (1) Training data D = {(A(x ), Y (x ))|x ∈ U }
(U = {x1 , x2 , . . . , xn }, A = {a1 , a2 , . . . , am }, Y = {l1 , l2 , . . . , ls } )
(2). Quotient set: U/RA = {[x]A |x ∈ U }
(3). Family of essential elements sets with respect to label: E = {E j |1 ≤ j ≤ s}
(4). α , β , η, y
Output: clu(Y ) = {Y |Y ⊆ Y }, loc = {loc j,k , 1 ≤ j, k ≤ s}
// Partitioning groups of relevant labels.
1. for 1 ≤ j, k ≤ s
2. compute γ ( j, k ) by E j and Ek
3. endfor
4. count relevance judgement matrix Rel = (γ ( j, k ))|Y |×|Y |
5. calculate groups of relevant labels clu(Y )
// Computing set of label pairs in each group of relevant labels.
6. for each group of relevant labels Y ⊆ Y
7. if |Y | = 1, then l j ∈ Y is fixed label and set of label pairs is L(Y ) = ∅
8. else select a fixed label l j and L(Y ) = {(l j , lk )|lk = l j , lk ∈ Y }
9. endfor
// Computing local label correlation.
10. If |Y | > 1
11. for each label pair (l j , lk ) ∈ L(Y ).
12. for every equivalence class [x]A ∈ U/RA
13. count positive and negative label correlations pos j,k ([x]A ), neg j,k ([x]A )
14. calculate local label correlation loc j,k ([x]A )
15. endfor
16. endfor
17. Return cl u(Y ), l oc

4. Correlation-labels-specific features of relevant label group

In the multi-label classification, D = {(A(x ), Y (x ))|x ∈ U } is a training data set, where U is the domain of input objects,
A is a set of features and quotient set are U/RA = {[x]A |x ∈ U }, Y = {l1 , l2 , . . . , ls } is the label set and divided into several
disjoint groups of relevant labels clu(Y ) = {Y |Y ⊆ Y }. In Section 3, several labels with strong relationships are assigned to a
group of relevant labels. Naturally, whether the labels in a group of relevant labels can be integrated into a binary relation
is a question worthy of in-depth deliberation. Equivalence classes are divided into several disjoint regions by considering
their local label correlation and classification error rate, and then the corresponding discernibility matrix can be calculated
to distinguish the equivalence classes belonging to different regions. Therefore, a compact subset of indispensable features
for each group of relevant labels can effectively captured maintaining the recognition capability of a discernibility matrix.
For each group of relevant labels, the compact subset of indispensable features is called correlation-labels-specific features
in this paper.
To aggregate labels in any group of relevant labels, knowledge is taken from the ideology of boosting algorithms [22]. The
purpose of boosting algorithms is to find a highly accurate classification rule by combining different base hypotheses that
are moderately accurate (usually higher than 50%). In this section, it is assumed that a separate procedure that is induced
by local label correlation of each label pair is the base learner. Then, the weights of these classifiers are given according
to their classification error rate. Hence, these base hypotheses are combined into a single rule called the final or combined
hypothesis.

Definition 4.1. For each group of relevant labels Y having more than one label, its corresponding set of label pairs is L(Y ) =
{(l j , lk )|lk = l j , lk ∈ Y }, where lj is the fixed label and lk is the related label. ∀(lj , lk ), the base learner G j,k ([x]A ) : U/RA →
{−1, +1} concerning the local label correlation defined on U/RA is formulated as follows,

1, loc j,k ([x]A ) ≥ 0;
G j,k ([x]A ) =
−1, loc j,k ([x]A ) < 0,

The classification error rate with label pair (lj , lk ) is defined as


 I[(l j (y ),lk (y )) =(l j (y ),lk (y ))]
, loc j,k ([x]A ) = 0;
e( j,k ) ([x]A ) = y∈[x]A |[x]A |
1/2, loc j,k ([x]A ) = 0,
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 805

where, lk (y ) is the predicted value of related label lk with object y according to the fixed label lj and local label correlation.
If locj,k ([x]A ) > 0, then lk (y ) = l j (y ), else if locj,k ([x]A ) < 0, then lk (y ) = −l j (y ). Therefore, the coefficient of G(j,k) on equivalence
class [x]A is referred to as
1 1 − e( j,k ) ([x]A )
α( j,k) ([x]A ) = ln ,
2 e( j,k ) ([x]A )
Hence, a linear combination of the base learners with respect to group of relevant labels Y on [x]A can be constructed as
follows,

fY ([x]A ) = α( j,k) ([x]A )G j,k ([x]A ).
(l j ,lk )∈L(Y )

Even though e( j,k ) ([x]A ) = 0.01%, the absolute value of coefficient |α (j,k) ([x]A )| is 9.2102, if e( j,k ) ([x]A ) = 0, then obviously
|α( j,k) ([x]A )| =Inf. To convenience of calculation, then let |α( j,k) ([x]A )| = 10, if e( j,k) ([x]A ) = 0. For a group of relevant labels,
every equivalence class has a linear combination of the base learner fY , and then we can assign these equivalence classes to
|Y | + 2 regions by considering the distance between maximum and minimum values of fY ([x]A ). Each step of the region for
group of relevant labels Y is
max[x]A ∈U/RA f ([x]A ) − min[x]A ∈U/RA f ([x]A )
s(Y ) = .
|Y | + 2
fY ([x]A )
The final learner GY ([x]A ) with Y on [x]A is defined as GY ([x]A ) =  s (Y )
, where  ·  represents the smallest integer not
less than a number.

For example, a group of relevant labels have two labels, that is, L(Y ) = {(l j , lk )}, where lj is the fixed label
and lk is the related label. In this paper, it is reasonable to set the number of regions with respect to (lj , lk )
as four, that is, equivalence classes with positive and negative label correlations pos j,k ([x]A ) = 1 and neg j,k ([x]A ) = 0;
0 < negj,k ([x]A ) ≤ posj,k ([x]A ) < 1; 0 < posj,k ([x]A ) < negj,k ([x]A ) < 1; pos j,k ([x]A ) = 0 and neg j,k ([x]A ) = 1. If label group have three
labels, that is, L(Y ) = {(l j , lk ), (l j , lt )}, where lj is the fixed label and lk , lt are the related labels. It is believe that the
equivalence classes with 0 < negj,k ([x]A ) ≤ posj,k ([x]A ) < 1, 0 < posj,t ([x]A ) < negj,t ([x]A ) < 1 and 0 < negj,t ([x]A ) ≤ posj,t ([x]A ) < 1,
0 < posj,k ([x]A ) < negj,k ([x]A ) < 1 are the same, and then these classes are assigned to a same region. Therefore, the number of
regions with Y is five, that is, |Y | + 2.
The equivalence classes in a quotient set U/RA derived by features set A are distribute to |Y | + 2 regions. The distribution
discernibility matrix, that can be used to compute correlation-labels-specific features for labels having strong correlations,
is briefly stated as follows.

Definition 4.2. Let Y be any group of relevant labels with more than one label. ∀x, y ∈ U, the pairs of equivalence classes
from different regions are denoted as PY∗ = {([x]A , [y]A )|GY ([x]A ) = GY ([y]A )}. The distribution discernibility matrix for Y is
described as PY = (PY ([x]A , [y]A ))|U/RA |×|U/RA | , where element PY ([x]A , [y]A ) is defined by

{a ∈ A|a ( [x]A ) = a ( [y]A )}, ([x]A , [y]A ) ∈ PY∗ ;
PY ([x]A , [y]A ) =
∅, otherwise.

For a given group of relevant labels Y ⊆ Y, the set of essential elements of a distribution discernibility matrix in Defi-
nition 4.2 with respect to Y is denoted as EY = {EY |EY ⊆ PY }. Then, correlation-labels-specific features for group of relevant
labels with more than one label are counted through using the essential elements. The correlation-labels-specific features
for a single label are proposed in Section 3. Thus, the method of computing correlation-labels-specific features for each
group of relevant labels called CLSF, which CLSF is a feature selection algorithm on multi-label data. Therefore, for each
group of relevant labels, the possible labels of each test object can be predicted by CLSF and multi-label classifier ML-KNN.
The following example is provided to illustrate the computation process for correlation-labels-specific features for each
group of relevant labels.

Example 4.1 (Continued from Example 3.1.1.). In each group of relevant labels, a linear combination of the base learn-
ers related to equivalence class can be determined that fY1 (X1 ) = fY1 (X2 ) = fY1 (X5 ) = −10, fY1 (X3 ) = fY1 (X4 ) = 10; fY2 (X1 ) =
10.896, fY2 (X2 ) = 0, fY2 (X3 ) = 9.104, fY2 (X4 ) = fY2 (X5 ) = −20.
The equivalence classes in a quotient set U/RA are assigned into four and five regions for Y1 and Y2 . The final learner
for each group of relevant labels is GY1 (X1 ) = GY1 (X2 ) = GY1 (X5 ) = 1, GY1 (X3 ) = GY1 (X4 ) = 4; GY2 (X4 ) = GY2 (X5 ) = 1, GY2 (X3 ) =
GY2 (X2 ) = 3, GY2 (X1 ) = 4. I.e., {X1 ∪ X2 ∪ X5 , X3 ∪ X4 } for Y1 , {X1 , X2 ∪ X3 , X4 ∪ X5 } for Y2 .
Based on Definitions 4.1 and 4.2, the sets of all elements in a distribution discernibility matrix with re-
spect to Y1 , Y2 and Y3 are PY1 = {∅, A, {a1 , a2 , a3 , a4 , a6 , a7 , a8 }, {a1 , a5 , a7 , a8 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}; PY2 =
{∅, A, {a2 , a3 , a4 , a5 , a8 , a9 }, {a1 , a2 , a3 , a4 ,a7 }, {a1 , a5 , a7 , a8 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}; PY3 =
{∅, A, {a2 , a3 , a4 , a5 , a6 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a1 , a5 , a7 , a8 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}, where |Y3 | = 1.
Obviously, essential elements to each group of relevant labels are EY1 = {a1 , a2 , a3 , a4 , a6 , a7 , a8 }, {a1 , a5 , a7 , a8 , a9 }, {a2 ,
a3 , a4 , a5 , a8 , a9 }}; EY2 = {{a2 , a3 , a4 , a5 , a8 , a9 }, {a1 , a2 , a3 , a4 , a7 }, {a1 , a5 , a7 , a8 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a2 , a3 , a4 , a5 , a8 ,
a9 }}; EY3 = {{a2 , a3 , a4 , a5 , a6 , a9 }, {a1 , a5 , a6 , a7 , a9 }, {a1 , a5 , a7 , a8 , a9 }, {a2 , a3 , a4 , a5 , a8 , a9 }}.
806 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

Hence, correlation-labels-specific features for each group of relevant labels can be confirmed that {a8 }; {a1 , a5 }; {a5 }.


The correlation-labels-specific features can be formulated in Algorithm 4.1 based on essential elements and boosting

Algorithm 4.1 Correlation-labels-specific features for relevant label group.


Input: (1). Training data D = {(A(x ), Y (x ))|x ∈ U }
(U = {x1 , x2 , . . . , xn }, A = {a1 , a2 , . . . , am }, Y = {l1 , l2 , . . . , ls } )
(2). Quotient set: U/RA = {[x]A |x ∈ U }
(3). All groups of relevant labels: clu(Y ) = {Y |Y ⊆ Y }
(4). Label pairs of each group of relevant labels: L(Y ) = {(l j , lk )|lk = l j , lk ∈ Y } and
local label correlation loc
Output: A compact set of features with respect to each group of relevant labels B ⊆ A
Initialize: B = ∅
1. for each group of relevant labels Y ⊆ Y
2. select the number of regions |Y | + 2
3. determine the distance of each step s(Y )
// Aggregating labels in each group of relevant labels
1. for each equivalence class [x]A in U/RA
2. for every label pair (l j , lk ) in L(Y )
3. if loc j,k ([x]A ) > 0, then the base learner G j,k ([x]A ) = 1
4. elseif loc j,k ([x]A ) < 0 then G j,k ([x]A ) = −1
5. else then G j,k ([x]A ) = 0
6. calculate classification error rate e( j,k ) ([x]A ) and coefficient α( j,k ) ([x]A )
7. endfor
8. construct linear combination of base learner fY ([x]A ) and final learner GY ([x]A )
9. endfor
// Compute essential elements of every group of relevant labels
10. set up discernibility matrix PY
11. similar to Algorithm 3.1.1, calculate set of essential elements EY
// Compute a compact set of features with respect to each group of relevant labels
12. while (EY = ∅)

13. select the most frequently a ∈ EY ∈ EY
 
14. let B = B ∪ {a} and EY = EY − {EY |a ∈ EY ∈ EY }

15. let EY = {EY − EY |EY ∈ EY }
16. endwhile
17. Retern B

through.
The time complexity of Algorithm 4.1 is summarized as follows: (1). Assume that Y = Y, which is the most complex
case, and the time complexity of aggregating labels in Y is O(|Y |2 · |U/RA | ). (2). The time complexity of storing the essential
elements in PY is O(|U/RA |2 · |A|). (3). On steps 12 to 16, the number of while loop is |EY | ≤ |U/RA |2 , the time complexity of
one while loop is O(A). The time complexity of Algorithm 4.1 is O(|U/RA | · (|Y |2 + |U/RA | · |A| ).

5. Experimental results

The proposed multi-label classification algorithm in our paper, called CLSF-MK, is composed of multi-label feature selec-
tion algorithm CLSF and multi-label classifier ML-KNN. This section describes the characteristics of our data sets, compara-
tive methods and evaluation metrics respectively, and then the superiority of CLSF-MK in multi-label feature selection and
classification is empirically illustrated.

5.1. Data sets and experimental settings

Eleven benchmark data sets are selected to test CLSF-MK, and contain multi-label instances coming from different ap-
plication domains. Among them, the first eight data sets come from Mulan Library [5]. Computer, Recreation and Reference
are three independent subsets of the web page multi-label classification data set yahoo. For better readability, detailed in-
formation about these data sets is presented in Table 3.
Six existing multi-label classification methods are chosen to indicate effectiveness of our proposed algorithm. Algorithms
applied for comparison are ML-KNN (a lazy learning approach to multi-label learning) in [19], LPLC (local positive and neg-
ative pairwise label correlations) in [9], CDR (complementary decision reduct) in [7], SCLS (scalable criterion for large label
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 807

Table 3
The performance of multi-label feature selection of CLSF and CDR.

Datasets |U | |Y | |clu(Y )| |Y| |A| BY BY CDR

Emotion 593 6 2 3;3 72 63;47 55 53


Birds 645 20 2 5;15 260 197;184 187 119
Scene 2407 6 2 5;1 294 218;230 220 178
Yeast 2417 14 1 14 103 90 90 86
Reuters 2000 7 3 3;3;1 243 167;147;175 159 132
Recreation 5000 22 1 22 606 546 546 477
Computer 5000 33 4 30;1;1;1 681 556;438;387;403 394 522
Reference 5000 33 5 29;1;1;1;1 793 605;617;615;634;662 608 609
11
  
Medical 978 45 12 = 1 + 11 34;1; . . . ; 1 1449 16;18;9;5;3;75;8;2;18;11;9;12 15 51
28
  
Enron 1702 53 32 = 1 × 4 + 28 2;3;6;14;1; . . . ; 1 1001 77;92;92;88;51;15;21;8;4;5;5; 86 81
25;24;55;23;25;20;36;4;4;8;6;
19;31;4;8;31;53;20;21;25;11;
26 5 86
        

Cal500 502 174 117 = 26 + 5 + 86 3 ; . . . ; 3; 2 ; . . . ; 2; 1 ; . . . ; 1 68 19 32

24;23;24;24;23;25;24;24;25;24;23;25;26;26;26;24;26;25;25;26;25;24;25;26;26;19;25;25;26;26;25;22;19;18;21;20;20;18;18;18;11;10;19;11;9;9;
10;15;14;17;13;13;11;20;14;15;17;21;20;16;15;20;19;15;17;18;13;13;14;12;16;18;20;17;13;10;19;15;19;18;12;15;16;14;15;15;16;17;16;11;13;9;9;
12;16;14;17;18;17;12;10;11;12;15;14;15;13;10;15;11;12;17;13;11;15;12;8

set)in [13], MDMR (max-dependency and min-redundancy) [29], and MUCO (multi-label feature selection with label corre-
lation) in reference [27]. Multi-label evaluation metrics: Average Precision, Ranking Loss, Coverage, One Error, and Hamming
Loss, are selected to evaluate the performance.
ML-KNN and LPLC are multi-label classification and do not consider feature selection. In LPLC, parameter α , which is set
as 0.7 in this section, controls tradeoff between positive and negative correlation. CDR, SCLS, MDMR, MUCO and CLSF-MK
are multi-label feature selection algorithms, where only CDR does not consider label correlation. CDR and CLSF-MK screen
out a same feature subset for each label in the label set (or relevant label group) with a fixed number of features. We
chose ML-KNN as multi-label classifier for these six algorithms and k = 10, where parameter k means the number of nearest
neighbors of the given object by considering their feature values. To ensure the fairness of experimental results, we compute
with the same number of selected features of each data set for SCLS, MDMR, MUCO and CLSF-MK. Relationships between
the labels and features cannot be considered in advance to eliminate causal confusion in data discretization pretreatment,
that is, unsupervised discretization method is necessary for our method. With a fixed number of discrete intervals, the
granular structure becomes finer and finer as the number of features increases. The discrete conditions for eleven data
sets applied in this section is appropriately relaxed to enhance stability of our algorithm because of the large number of
features. Therefore, a discretization algorithm called FIMUS [18] is selected in CLSF-MK and CDR to divided numerical-value
features in eleven data sets to two equal-width intervals. SCLS uses supervised discretization method [1] to discretize these
numerical-value features. In CLSF-MK, parameters are searched in α = 0.4, β = 0.6 and η = 0.6 to measure the relativity
degree of the essential elements and the overlap of essential element sets related to different labels. If the classification error
rate of an equivalence class with any labels pair is 0, and then the coefficient of base learners concerning the equivalence
class is 10. The number of regions for group of relevant labels is |Y | + 2, where Y is any group of relevant labels in clu(Y ).
All experiments were run on a serve equipped with a 4.2 GHz Intel Core i7-7700k CPU and 80 GB of RAM.

5.2. Performance analysis on CLSF-MK

In this paper, each label in data set for MDMR, SCLS, MUCO shares the same number of compact feature set, which is
the weighted average number of selected features for every label for CLSF, to predict possible labels of test objects. ML-KNN
and LPLC are multi-label classification and do not consider feature selection, and then only the number of selected features
for each label of our method and CDR need to provided.
In Table 3, basic information of data sets, and the ability of CLSF or CDR for selecting features and segmenting labels
of Y, are minutely illustrated, where “Datasets” presents name of each data set, “|U|” illustrates the number of instances in
every data set, “|Y |” denotes the number of labels of data set, “|clu(Y )|” is the number of relevant label groups, “|Y| ” means
the number of labels in every group of relevant labels, “|A|” describes the number of features for each data set, “BY ” denotes
the number of indispensable features for each group of relevant labels, “BY ” is the weighted average number of selected
features for every label, and “CDR” is the average number of selected features for CDR. The weighted average for the label
implies that every label is averages with BY indispensable features in A and can be calculated by

1
BY = Y ∈clu(Y ) |BY | · |Y |.
|Y |
808 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

Table 4
Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of Average Precision ↑.

Datesets ML-KNN LPLC CDR SCLS MDMR MUCO CLSF-MK

Birds 0.6889 ± 0.0060 0.6733 ± 0.0213 0.6448 ± 0.0106 0.7144 ± 0.0297 0.6840 ± 0.0178 0.7158 ± 0.0267 0.7184 ± 0.0301
Yeast 0.7541 ± 0.0028 0.7526 ± 0.0174 0.7028 ± 0.0077 0.7602 ± 0.0082 0.7535 ± 0.0070 0.7535 ± 0.0063 0.7610 ± 0.0097
Emotion 0.7578 ± 0.0230 0.6870 ± 0.0301 0.5669 ± 0.0320 0.7749 ± 0.0209 0.7656 ± 0.0053 0.7754 ± 0.0225 0.7798 ± 0.0156
Cal500 0.4793 ± 0.0018 0.4675 ± 0.0153 0.4878 ± 0.0105 0.4865 ± 0.0112 0.4880 ± 0.0093 0.4899 ± 0.0085 0.4867 ± 0.0081
Scene 0.8514 ± 0.0002 0.8239 ± 0.0099 0.3670 ± 0.0419 0.7969 ± 0.0435 0.7986 ± 0.0272 0.7979 ± 0.0320 0.8381 ± 0.0469
Medical 0.7700 ± 0.0043 0.7115 ± 0.0222 0.8057 ± 0.0387 0.7871 ± 0.0200 0.7273 ± 0.0248 0.7014 ± 0.0137 0.8066 ± 0.0169
Reuters 0.8559 ± 0.0172 0.8564 ± 0.0105 0.6283 ± 0.0143 0.8103 ± 0.0157 0.8283 ± 0.0107 0.8563 ± 0.0056 0.8730 ± 0.0082
Enron 0.5446 ± 0.0072 0.5365 ± 0.0289 0.6210 ± 0.0255 0.6193 ± 0.0280 0.5971 ± 0.0388 0.6014 ± 0.0348 0.6211 ± 0.0204
Computer 0.6390 ± 0.0055 0.4843 ± 0.1112 0.6032 ± 0.0104 0.6406 ± 0.0136 0.6468 ± 0.0126 0.6521 ± 0.0131 0.6491 ± 0.0103
Recreation 0.4699 ± 0.0144 0.3107 ± 0.0173 0.3770 ± 0.0090 0.4864 ± 0.0107 0.4816 ± 0.0139 0.4775 ± 0.0100 0.4871 ± 0.0084
Reference 0.6175 ± 0.0019 0.2657 ± 0.0355 0.5640 ± 0.0135 0.6274 ± 0.0112 0.6328 ± 0.0092 0.6375 ± 0.0149 0.6445 ± 0.0040

Average 0.6753 0.5972 0.5799 0.6822 0.6731 0.6808 0.6969


Ave. Rank 4.4 5.8 5.5 3.6 4.1 3.2 1.5

Table 5
Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of Ranking Loss ↓.

Datasets ML-KNN LPLC CDR SCLS MDMR MUCO CLSF-MK

Birds 0.1266 ± 0.0015 0.1521 ± 0.0110 0.1604 ± 0.0099 0.1146 ± 0.0141 0.1206 ± 0.0069 0.1112 ± 0.0110 0.1110 ± 0.0134
Yeast 0.1748 ± 0.0030 0.1886 ± 0.0155 0.2114 ± 0.0067 0.1713 ± 0.0052 0.1728 ± 0.0046 0.1753 ± 0.0047 0.1707 ± 0.0076
Emotion 0.1955 ± 0.0226 0.2775 ± 0.0393 0.4291 ± 0.0326 0.1824 ± 0.0119 0.2059 ± 0.0048 0.1822 ± 0.0218 0.1821 ± 0.0142
Cal500 0.1907 ± 0.0013 0.2213 ± 0.0075 0.1816 ± 0.0041 0.1859 ± 0.0047 1883 ± 0.0058 0.1851 ± 0.0042 0.1850 ± 0.0037
Scene 0.0876 ± 0.0055 0.1156 ± 0.0069 0.5736 ± 0.0734 0.1227 ± 0.0213 0.1231 ± 0.0166 0.1211 ± 0.0170 0.1048 ± 0.0228
Medical 0.0614 ± 0.0054 0.0573 ± 0.0067 0.0559 ± 0.0112 0.0550 ± 0.0061 0.0685 ± 0.0071 0.0817 ± 0.0068 0.0548 ± 0.0106
Reuters 0.0964 ± 0.0116 0.0948 ± 0.0090 0.2619 ± 0.0138 0.1266 ± 0.0110 0.1134 ± 0.0075 0.0945 ± 0.0096 0.0849 ± 0.0064
Enron 0.1101 ± 0.0084 0.1730 ± 0.0142 0.1031 ± 0.0119 0.1023 ± 0.0130 0.1093 ± 0.0114 0.1046 ± 0.0158 0.0968 ± 0.0095
Computer 0.0891 ± 0.0031 0.2555 ± 0.0739 0.1021 ± 0.0048 0.0890 ± 0.0050 0.0858 ± 0.0043 0.0848 ± 0.0034 0.0846 ± 0.0036
Recreation 0.1879 ± 0.0049 0.3916 ± 0.0316 0.2216 ± 0.0050 0.1790 ± 0.0047 0.1817 ± 0.0073 1817 ± 0.0048 0.1811 ± 0.0036
Reference 0.0916 ± 0.0003 0.2656 ± 0.0198 0.1057 ± 0.0054 0.0881 ± 0.0041 0.0848 ± 0.0036 0.0854 ± 0.0040 0.0813 ± 0.0040

Average 0.1283 0.1993 0.2187 0.1288 0.1322 0.1280 0.1216


Ave. Rank 4.6 5.7 5.5 3.3 4.4 3.4 1.3

Table 6
Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of One Error ↓.

Datasets ML-KNN LPLC CDR SCLS MDMR MUCO CLSF-MK

Birds 0.3675 ± 0.0114 0.3984 ± 0.0316 0.4388 ± 0.0178 0.3442 ± 0.0447 0.4217 ± 0.0288 0.3444 ± 0.0320 0.3442 ± 0.0440
Yeast 0.2418 ± 0.0024 0.2294 ± 0.0188 0.2487 ± 0.0056 0.2289 ± 0.0109 0.2297 ± 0.0084 0.2458 ± 0.0078 0.2288 ± 0.0093
Emotion 0.3428 ± 0.0459 0.4403 ± 0.0437 0.5651 ± 0.0491 0.3152 ± 0.0465 0.3221 ± 0.0141 0.3015 ± 0.0260 0.2969 ± 0.0281
Cal500 0.1215 ± 0.0063 0.1960 ± 0.0265 1155 ± 0.0248 0.1154 ± 0.0148 0.1156 ± 0.0168 0.1156 ± 0.0168 0.1165 ± 0.0159
Scene 0.2435 ± 0.0072 0.2661 ± 0.0125 0.8549 ± 0.0737 0.3356 ± 0.0817 0.3319 ± 0.0472 0.3361 ± 0.0553 0.2699 ± 0.0265
Medical 0.2740 ± 0.0174 0.3888 ± 0.0333 0.2465 ± 0.0431 0.2638 ± 0.0219 0.3209 ± 0.0368 0.4384 ± 0.0499 0.2372 ± 0.0145
Reuters 0.2135 ± 0.0215 0.2075 ± 0.0111 0.5770 ± 0.0278 0.2810 ± 0.0233 0.2655 ± 0.0136 0.2250 ± 0.0061 0.2236 ± 0.0327
Enron 0.3555 ± 0.0411 0.3300 ± 0.0445 0.2962 ± 0.0250 0.3407 ± 0.0829 0.3731 ± 0.0984 0.3354 ± 0.0377 0.3167 ± 0.0243
Computer 0.4138 ± 0.0132 0.5616 ± 0.1489 0.4762 ± 0.0126 0.4338 ± 0.0175 0.4312 ± 0.0157 0.4220 ± 0.0159 0.4260 ± 0.0127
Recreation 0.6345 ± 0.0125 0.8056 ± 0.0113 0.8040 ± 0.0136 0.6324 ± 0.0106 6636 ± 0.0159 0.6770 ± 0.0141 0.6606 ± 0.0088
Reference 0.4525 ± 0.0055 0.9486 ± 0.0326 0.5340 ± 0.0139 0.4666 ± 0.0120 0.4666 ± 0.0098 0.4536 ± 0.0214 0.4523 ± 0.0101
Average 0.3328 0.4338 0.4688 0.3416 0.3584 0.3545 0.3248
Ave. Rank 3.6 4.8 5.9 3.5 4.6 4.1 2.2

In Table 3, CLSF and CDR can perceptibly remove the unnecessary features with respect to the label on all data sets.
On Computer, Reference, Medical and Cal500, CLSF selects a smaller compact feature subset than CDR. The ability of CLSF
and CDR to delete useless features is comparable on Emotion, Yeast and Enron, and the performance of reducing redundancy
features of CLSF is slightly poor than CDR on the other four data sets. For every data set, it is clearly that CLSF achieves obvi-
ous performance advantages than CDR on five multi-label evaluation metrics combined with Tables 4–8. As a supplement to
Table 3, Fig. 1 clear depicts the numbers of features in compact feature subsets for each relevant label group of Cal500 and
Enron. CLSF has a high efficiency in extracting features from Cal500 (or Enron), which has 8 ∼ 26 (or 4 ∼ 92) indispensable
features for relevant label group and the weighted average 19 (or 86) features for every label. In Table 3, it is known that
CLSF also has generally efficient on the other nine data sets. For instance, on Yeast and Recreation, labels in their label set
possess strong correlations with the other labels, and they have about 90% indispensable features in the features set. CLSF
splits label set of Birds, Emotion and Scene into two relevant label groups and selects average about 75% of indispensable
features for each label. Reuters has 7 labels and 3 relevant label groups, which are related to 159 indispensable features for
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 809

Table 7
Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of Hamming Loss ↓.

Datasets ML-KNN LPLC CDR SCLS MDMR MUCO CLSF-MK

Birds 0.0567 ± 0.0030 0.2529 ± 0.0065 0.0585 ± 0.0072 0.0554 ± 0.0056 0.0638 ± 0.0019 0.0550 ± 0.0042 0.0549 ± 0.0066
Yeast 0.2000 ± 0.0022 0.2167 ± 0.0117 0.2320 ± 0.0044 0.1963 ± 0.0022 0.1973 ± 0.0016 0.1999 ± 0.0009 0.1960 ± 0.0054
Emotion 0.2301 ± 0.0124 0.4504 ± 0.0107 0.3141 ± 0.0165 0.2153 ± 0.0159 0.2257 ± 0.0080 0.2139 ± 0.0106 0.2130 ± 0.0096
Cal500 0.1398 ± 0.0012 0.1608 ± 0.0032 1372 ± 0.0040 0.1370 ± 0.0063 0.1408 ± 0.0064 0.1372 ± 0.0060 0.1401 ± 0.0034
Scene 0.1021 ± 0.0009 0.3015 ± 0.0091 0.1713 ± 0.0088 0.1184 ± 0.0220 0.1174 ± 0.0173 0.1208 ± 0.0190 0.1026 ± 0.0218
Medical 0.0172 ± 0.0006 0.0718 ± 0.0013 0.0142 ± 0.0017 0.0143 ± 0.0010 0.0164 ± 0.0008 0.0206 ± 0.0018 0.0142 ± 0.0010
Reuters 0.0743 ± 0.0043 0.2101 ± 0.0339 0.1646 ± 0.0015 0.0904 ± 0.0040 0.0869 ± 0.0040 0.0733 ± 0.0036 0.0552 ± 0.0034
Enron 0.0565 ± 0.0039 0.1768 ± 0.0162 0.0571 ± 0.0058 0.0543 ± 0.0056 0.0555 ± 0.0049 0.0554 ± 0.0066 0.0542 ± 0.0042
Computer 0.0380 ± 0.0013 0.2177 ± 0.0368 0.0443 ± 0.0016 0.0396 ± 0.0014 0.0391 ± 0.0014 0.0383 ± 0.0015 0.0388 ± 0.0013
Recreation 0.0595 ± 0.0008 0.3094 ± 0.0608 0.0645 ± 0.0009 0.0594 ± 0.0014 0.0601 ± 0.0014 0.0602 ± 0.0013 0.0593 ± 0.0011
Reference 0.291 ± 0.0002 0.1276 ± 0.0325 0.0355 ± 0.0006 0.0293 ± 0.0010 0.0302 ± 0.0007 0.0308 ± 0.0013 0.0301 ± 0.0008

Average 0.0912 0.2269 0.1176 0.0918 0.0939 0.0914 0.0881


Ave. Rank 3.4 6.9 5.3 2.9 4.2 3.5 1.8

Table 8
Comparison results between CLSF-MK and other six algorithms (mean ± std.) in terms of Coverage ↓.

Datasets ML-KNN LPLC CDR SCLS MDMR MUCO CLSF-MK

Birds 3.512 ± 0.1124 3.741 ± 0.3147 4.234 ± 0.3571 3.132 ± 0.2802 3.318 ± 0.3112 3.112 ± 0.2864 3.107 ± 0.1444
Yeast 6.449 ± 0.0366 6.711 ± 0.2063 6.784 ± 0.1016 6.340 ± 0.0467 6.388 ± 0.0488 6.405 ± 0.0697 6.315 ± 0.1223
Emotion 1.944 ± 0.0427 2.232 ± 0.1930 3.150 ± 0.0519 1.860 ± 0.794 1.973 ± 0.1222 1.880 ± 0.1181 1.857 ± 0.1027
Cal500 131.5 ± 0.4841 148.5 ± 1.734 129.9 ± 2.033 131.6 ± 1.758 132.1 ± 1.960 131.2 ± 2.073 131.0 ± 2.149
Scene 0.5246 ± 0.0440 0.6050 ± 0.0391 2.969 ± 0.3985 0.6987 ± 0.1067 0.7034 ± 0.0875 0.6921 ± 0.0879 0.6329 ± 0.0973
Medical 3.634 ± 0.2290 3.241 ± 0.2560 2.855 ± 0.4781 3.188 ± 0.4650 4.110 ± 0.5382 4.685 ± 0.4101 3.463 ± 0.5349
Reuters 0.7545 ± 0.1045 0.6980 ± 0.0688 1.705 ± 0.0908 0.9410 ± 0.0674 0.8650 ± 0.0391 0.7390 ± 0.0406 0.6930 ± 0.0445
Enron 14.75 ± 0.9089 20.53 ± 1.201 14.37 ± 1.576 14.31 ± 1.899 14.94 ± 1.910 14.49 ± 2.052 13.64 ± 1.269
Computer 4.226 ± 0.1893 9.794 ± 2.067 4.742 ± 0.2710 4.259 ± 0.2874 4.156 ± 0.2448 4.089 ± 0.2274 4.078 ± 0.2105
Recreation 4.988 ± 0.1603 8.874 ± 0.5220 5.715 ± 0.1374 4.800 ± 0.1366 4.854 ± 0.2115 4.861 ± 0.1414 4.838 ± 0.1278
Reference 3.498 ± 0.0438 9.419 ± 0.6589 3.963 ± 0.2006 3.385 ± 0.1480 3.281 ± 0.1400 3.300 ± 0.1407 3.157 ± 0.0692

Average 15.98 19.49 16.40 15.86 16.06 15.95 15.71


Ave. Rank 4.3 5.5 5.3 3.4 4.5 3.5 1.6

Fig. 1. The performance of CLSF for Cal500 and Enron.

label on average and 167, 147, 175 features with label groups. For the rest of data sets, Computer, Reference and Medical
contain many labels and involve many outliers in their relevance judgment matrix. Reference includes 8 ∼ 26 features for
relevant label group and 608 indispensable features for label on average. 556,448,387,403 and 394 indispensable features
are concerned with relevant label group and label in Computer. Medical is divided into 12 relevant label groups by CLSF,
where 34 labels with strong connections and 11 groups only have one label. The numbers of selected features with relevant
label group for Medical are concentrated between 5 ∼ 18 and 75. The classification performance of CLSF-MK, which consists
of CLSF and ML-KNN, is effectively promoted.
To demonstrate the classification performance of each algorithm more clearly and specifically, the best experimental
results of each multi-label algorithm over all data sets are reported in Tables 4–8, where “↑” indicates “the bigger the
better”, while “↓” indicates “the smaller the better”. For fair comparison, CLSF-MK and the other six comparing multi-
810 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

label classification algorithms are repeatedly run five times on randomly partitioned training (80%) and testing (20%) data.
The experiments are tuned by five-told internal cross validation on data. Based on these experimental results, following
observations can be made:
It is obtained that Average Precision values of CLSF-MK on nine data sets are the largest compared with the other four
multi-label feature selection methods after data sets Cal500 and Computer. On almost all data sets without Scene, Average
Precision values of CLSF-MK perform better than ML-KNN and LPLC. Ranking Loss values of CLSF-MK on nine data sets
excluding Cal500 and Recreation are also better than the other four feature selection methods CDR, SCLS, MDMR and MUCO.
Except Scene, CLSF-MK has better performance than ML-KNN with Ranking Loss. For each data set, Ranking Loss values of
CLSF-MK perform better than LPLC. Based on Table 6, One Error values of CLSF-MK show better performance than the other
four feature selection methods on Birds, Yeast, Emotion, Scene, Medical, Reuters and Reference. On Scene and Computer, One
Error values of CLSF-MK are close to ML-KNN and on the other nine data sets CLSF-MK are better than ML-KNN. One Error
values of CLSF-MK have better expression than LPLC on every data set. On Scene and Computer, CLSF-MK is close to ML-KNN
and Hamming Loss values of CLSF-MK are smaller than ML-KNN on the other nine data sets. CLSF-MK is smaller than LPLC
on all data sets for Hamming Loss. CLSF-MK illustrates better performance than the others four feature selection algorithms
on nine data sets without Cal500 and Computer. It is well known that Coverage values of CLSF-MK demonstrate better
capability than CDR, SCLS, MDMR and MUCO on eight data sets except Cal500, Medical and Recreation according Table 8.
On ten data sets do not contain Scene, Coverage values of CLSF-MK are better than ML-KNN. CLSF-MK also illustrates better
results than LPLC on all data sets.
From Tables 4 to 8, we observe that CLSF-MK outperforms the other six algorithms on account of the optimal average
values and rank according to these five evaluation metrics. CLSF-MK has better performance compared with the other three
feature selection algorithms (SCLS, MDMR and MUCO) by considering label correlation on ten data sets for Hamming Loss,
Ranking Loss and Coverage, nine data sets for Average Precision and One error. Five evaluation metrics values of CLSF-MK
describe better classification ability than ML-KNN and LPLC except 1 ∼ 3 data sets. The facts prove that CLSF-MK is superior
to ML-KNN and LPLC with all features. The performance of CLSF-MK is more excellent than CDR for major part of data sets
without one or two data sets. This fact demonstrates that the feature selection method through considering label correlation
called CLSF-MK, indeed improves the performance. There are no problems with important features loss for ML-KNN by our
proposed method.
In general, each multi-label feature selection algorithm has some limitations, and any single multi-label feature selection
algorithm cannot be suitable for all situations. However, there are usually few algorithms that have the largest values on all
different evaluation metrics. Therefore, we can conclude that CLSF-MK has a better performance compared with ML-KNN,
LPLC, CDR, SCLS, MDMR, MUCO, which indicates that the effectiveness and practicability of the proposed method in this
paper are significantly increased.

6. Conclusions

This paper focused on improving the efficiency and performance of multi-label classification by exploring feature selec-
tion and label correlation. We presented a novel feature selection method of multi-label data, named CLSF, which was based
on label correlation. This method has many advantages compared with the existing methods currently in use. All the essen-
tial elements of each label are defined and calculated by considering the uncertain and certain information determined by
the label and features set. Then, a label relevancy between the labels and the corresponding relevance judgement matrix are
given by using an overlap of different essential element families related to the label, which is a process that is neglected in
the existing algorithms. The matrix can be applied to reflect label correlation among labels. Therefore, the label set is divided
into several relevant label groups and local and global label correlations are considered. Thus, CLSF takes feature selection
and local label correlation into account simultaneously, and then the labels in a group of relevant labels are integrated to
reconstruct a binary relation. To simplify the process of multi-label classification, the indispensable features with respect to
several labels having close relationships are selected for each group of relevant labels. In summary, the effectiveness and
accuracy of the prediction is promoted in the proposed method.
There are several feasible directions for subsequent research: The discretization method applied in Kim and Lee [13] sug-
gests how to choose a more reasonable method for discrete multi-label numerical data to calculate the relationship among
labels. Inspired by Kim and Lee [13,14], Lin and Hu [27,29], research could investigate how to define the concept of depen-
dency of a feature relative to relevant label group and exclusion among features in compact feature subsets with respect to
the label group and then select the minimum number of features to keep the deterministic information contained between
feature set and the given label group unchanged. In addition, another method of synthesizing labels in the relevant label
group to further improve classification performance needs to be pursued.

Declaration of competing interest

The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial
or associative interest that represents a conflict of interest in connection with the work submitted.
X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812 811

Acknowledgements

This paper is supported by grants of the National Natural Science Foundation of China (61573127, 71471060) and the fund
of North China Electric Power University.

Appendix A

Abbreviations Explanations

X = { (x, A(x ))|x ∈ U } X is set of input instances, A is set of features, U is domain of input objects
Y = {l1 , l2 , . . . , ls } Y is finite set of s possible labels
D = { (A(x ), Y (x ))|x ∈ U } Training data set with instances and their related labels
K (x ) Set of possible labels of object x
l j ( xi ) The ith object xi associating with the jth label l j
RB Indiscernibility relation with respect to subset of features B
U/RB = {[x]B |x ∈ U } U/RB is quotient set with respect to B, [x]B is equivalence class with respect to B
U/l j = {Pos j , N eg j } Quotient set of positive (Pos j ) and negative (N eg j ) objects with respect to label l j
PosA (l j )(NegA (l j )) Positive (Negative) consistent region with respect to l j
Pj∗ Set of pairs of equivalence classes to be discerned with respect to l j
Pj Distribution discernibility matrix with respect to l j
P j ( [x]A , [y]A ) Discernibility feature set discerning [x]A and [y]A with respect to l j
E j = {E j |E j ⊆ P j } Group of essential elements with respect to l j
γ ( j, k ) Label relevancy with respect to labels l j and lk
α Coincidence degree between groups of essential elements threshold
β Coincidence degree between essential elements threshold
Rel = (γ ( j, k ))|Y |×|Y | Relevancy judgement matrix with respect to set of labels Y
clu(Y ) Disjoint relevant label groups
pos j,k ([x]A )(neg j,k ([x]A )) Local positive (negative) correlation of label pair (l j , lk ) with respect to [x]A
loc j,k ([x]A ) Local correlation of label pair (l j , lk ) with respect to [x]A
η Local correlation parameter
glo( j, k ) Global correlation with respect to label pair (l j , lk )
L (Y ) Collection of label pairs with respect to group of relevant labels Y
G j,k ([x]A ) The base leaner of label pair (l j , lk ) with respect to [x]A
e j,k ([x]A ) Classification error rate of label pair (l j , lk ) with respect to [x]A
α j,k ([x]A ) The coefficient of G j,k with respect to [x]A
fY ([x]A ) Linear combination of base leaner with respect to Y
s(Y ) Each step of region with respect to Y
GY ([x]A ) The final leaner with respect to Y
PY∗ Set of pairs of equivalence classes to be discerned with respect to Y
PY Distribution discernibility matrix with respect to Y
PY ([x]A , [y]A ) Discernibility feature set discerning [x]A and [y]A with respect to Y
EY = {EY |EY ⊆ PY } Group of essential elements with respect to Y
|·| Cardinality of a set
· Smallest integer not less than a number

References

[1] A. Cano, J. Luna, E. Gibaja, S. Ventura, LAIM discretization for multi-label data, Inf. Sci. 330 (2016) 370–384.
[2] A. Ghazikhani, R. Monsefi, H.S. Yazdi, Online neural network model for non-stationary and imbalanced data stream classification, Int. J. Mach. Learn.
Cybern. 5 (1) (2014) 51–62.
[3] D. Chen, Y. Yang, Z. Dong, An incremental algorithm for attribute reduction with variable precision rough sets, Appl. Soft Comput. 45 (2016) 129–149.
[4] G. Nan, Q. Li, R. Dou, et al., Local positive and negative correlation-based k-labelsets for multi-label classification, Neurocomputing 318 (2018) 90–101.
[5] G. Tsoumakas, E. Spyromitros-Xioufis, V. Vilcek, et al., MULAN: a java library for multi-label learning, J. Mach. Learn. Res. 12 (7) (2011) 2411–2414.
[6] G. Tsoumakas, I. Katakis, Multi label classification: an overview, Int. J. Data Warehousing Min. 3 (3) (2007) 1–13.
[7] H. Li, D. Li, Y. Zhai, et al., A novel attribute reduction approach for multi-label data based on rough set theory, Inf. Sci. 367–368 (2016) 827–847.
[8] J. Dai, H. Hu, W. Wu, et al., Maximal-discernibility-pair-based approach to attribute reduction in fuzzy rough sets, IEEE Trans. Fuzzy Syst. 26 (4) (2018)
2174–2187.
[9] J. Huang, G. Li, S. Wang, et al., Multi-label classification by exploiting local positive and negative pairwise label correlation, Neurocomputing 257 (2017)
164–174.
[10] J. Lee, D. Kim, Feature selection for multi-label classification using multivariate mutual information, Pattern Recognit. Lett. 34 (2013) 349–357.
[11] J. Lee, D. Kim, Mutual information-based multi-label feature selection using interaction information, Expert Syst. Appl. 42 (2015) 2013–2025.
[12] J. Lee, D. Kim, Memetic feature selection algorithm for multi-label classification, Inf. Sci. 293 (2015) 80–96.
[13] J. Lee, D. Kim, SCLS: Multi-label feature selection based on scalable criterion for large label set, Pattern Recognit. 66 (2017) 342–352.
[14] J. Lee, H. Kim, N. Kim, et al., An approach for multi-label classification by directed acyclic graph with label correlation maximization, Inf. Sci. 351
(2016) 101–114.
[15] J. Zhang, M. Fang, X. Li, Multi-label learning with discriminative features for each label, Neurocomputing 154 (2015) 305–316.
[16] L. Chen, D. Chen, H. Wang, Alignment based kernel selection for multi-label learning, Neural Process. Lett. (2018), doi:10.1007/s11063- 018- 9863- z.
[17] M. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label scene classification, Pattern Recognit. 37 (2004) 1757–1771.
[18] M. Rahman, M. Islam, FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis, Knowl.-Based Syst.
56 (3) (2014) 311–327.
[19] M. Zhang, Z. Zhou, ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognit. 40 (7) (2007) 2038–2048.
[20] M. Zhang, Z. Zhou, A review on multi-label learning algorithms, IEEE Trans. Knowl. Data Eng. 26 (8) (2014) 1819–1837.
[21] M. Zhang, L. Wu, LIFT: multi-label learning with label-specific features, IEEE Trans. Pattern Anal. Mach. Intell. 37 (1) (2015) 1609–1614.
[22] R. Schapire, Y. Singer, Boostexter: a boosting-based system for text categorization, Mach. Learn. 39 (2–3) (20 0 0) 135–168.
812 X. Che, D. Chen and J. Mi / Information Sciences 512 (2020) 795–812

[23] S. Huang, Z. Zhou, Multi-label learning by exploiting label correlations locally, in: Proceedings of the 26th AAAI Conference on Artificial Intelligence,
2012, pp. 949–955.
[24] W. Weng, Y. Lin, S. Wu, et al., Multi-label learning based on label-specific features and local pairwise label correlation, Neurocomputing 273 (2018)
385–394.
[25] Y. Lin, J. Li, P. Lin, G. Lin, J. Chen, Feature selection via neighborhood multi-granulation fusion, Knowl. Based Syst. 67 (2014) 162–168.
[26] Y. Lin, Q. Hu, J. Liu, J. Chen, J. Duan, Multi-label feature selection based on neighborhood mutual information, Appl. Soft Comput. 38 (2016) 244–256.
[27] Y. Lin, Q. Hu, J. Liu, et al., Streaming feature selection for multi-label learning based on fuzzy mutual information, IEEE Trans. Fuzzy Syst. 25 (6) (2017)
1491–1507.
[28] Y. Lin, Q. Hu, J. Zhang, X. Wu, et al., Multi-label feature selection with streaming labels, Inf. Sci. 372 (2016) 256–275.
[29] Y. Lin, Q. Hu, J. Liu, et al., Multi-label feature selection based on max-dependency and min-redundancy, Neurocomputing 168 (2015) 92–103.
[30] Y. Yu, W. Pedrycz, D. Miao, Multi-label classification by exploiting label correlations, Expert Syst. Appl. 41 (6) (2014) 2989–3004.
[31] Y. Yu, W. Pedrycz, D. Miao, Neighborhood rough sets based multi-label classification for automatic image annotation, Int. J. Approx. Reason. 54 (9)
(2013) 1373–1387.
[32] Y. Zhu, J. Kwok, Z. Zhou, Multi-label learning with global and local label correlation, IEEE Trans. Knowl. Data Eng. 30 (2017) 1081–1094.
[33] Z. Barutcuoglu, R.E. Schapire, O.G. Troyanskaya, Hierarchical multi-labelpredictionof gene function, Bioinformatics 22 (7) (20 06) 830–836. 20 06
[34] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (5) (1982) 341–356.

Das könnte Ihnen auch gefallen