Beruflich Dokumente
Kultur Dokumente
ing a linear classifier on a hidden subspace is non-properly imputation), or using a regression based on the observed
learnable. value (regression imputation), or sampling the other exam-
ples to complete the missing value (hot deck).1 The impu-
At a high level, we assume that a sample is a triplet
tation methodologies share a similar goal as matrix comple-
(x, o, y), where x ∈ Rd is the complete example, o ⊂
tion, namely reduce the problem to one with complete data,
{1, . . . , d} is the set of observable attributes and y ∈ Y
however their methodologies and motivating scenarios are
is the label. The learner observes only (xo , y), where xo
very different. Finally, one can build a complete Bayesian
omits any attribute not in o. Our goal is given a sample
(i) model for both the observed and unobserved data and use
S = {(xo , y (i) )}m i=1 to output a classifier fS such that it to perform inference (e.g. (Vishwanathan et al., 2005)).
w.h.p.: As with almost any Bayesian methodology, its success de-
E [`(fS (xo ), y)] ≤ min E [`(w · x, y)] + , pends largely on selecting the right model and prior, this is
w∈Rd even ignoring the computational issues which make infer-
kwk≤1
ence in many of those models computationally intractable.
where ` is the loss function. Namely, we like our classifier
fS to compete with the best linear classifier for the com- In the machine learning community, missing data was con-
pletely observable data. sidered in the framework of limited attribute observability
(Ben-David & Dichterman, 1998) and its many refinements
Our main result is achieving this task (under mild regular- (Dekel et al., 2010; Cesa-Bianchi et al., 2010; 2011; Hazan
ity conditions) using a computationally efficient algorithm & Koren, 2012). However, to the best of our knowledge,
for any convex Lipschitz-bounded loss function. Our ba- the low-rank property is not captured by previous work,
sic result requires a sample size which is quasi-polynomial, nor is the extreme amount of missing data. More impor-
but we complement it with a kernel construction which can tantly, much of the research is focused on selecting which
guarantee efficient learning under appropriate large margin attributes to observe or on missing attributes at test or train
assumptions. Our kernel depends only on the intersection time (see also (Eban et al., 2014; Globerson & Roweis,
of observable values of two inputs, and is efficiently com- 2006)). In our case the learner has no control which at-
putable. (We give a more detailed overview of our main tributes are observable in an example and the domain is
results in Section 2.) fixed. The latter case is captured in the work of (Chechik
We complement our theoretical contributions with exper- et al., 2008), who rescale inner-products according to the
imental findings that show superior classification perfor- amount of missing data. Their method, however, does not
mance both on synthetic data and on publicly-available rec- entail theoretical guarantees on reconstruction in the worst
ommendation data. case, and gives rise to non-convex programs.
A natural and intuitive methodology to follow is to treat
Previous work. Classification with missing data is a well the labels (both known and unknown) as an additional col-
studied subject in statistics with numerous books and pa- umn in the data matrix and complete the data using a matrix
pers devoted to its study, (see, e.g., (Little & Rubin, 2002)). completion algorithm, thereby obtaining the classification.
The statistical treatment of missing data is broad, and to Indeed, this exactly was proposed in the innovative work
a fairly large extent assumes parametric models both for of (Goldberg et al., 2010), who connected the methodol-
the data generating process as well as the process that cre- ogy of matrix completion to prediction from missing data.
ates the missing data. One of the most popular models for Although this is a natural approach, we show that com-
the missing data process is Missing Completely at Random pletion is neither necessary nor sufficient for classifica-
(MCAR) where the missing attributes are selected indepen- tion. Furthermore, the techniques for provably complet-
dently from the values. ing a low rank matrix are only known under probabilistic
We outline a few of the main approaches handling miss- models with restricted distributions (Srebro, 2004; Candes
ing data in the statistics literature. The simplest method is & Recht, 2009; Lee et al., 2010; Salakhutdinov & Srebro,
simply to discard records with missing data, even this as- 2010; Shamir & Shalev-Shwartz, 2011). Agnostic and non-
sumes independence between the examples with missing probabilistic matrix completion algorithms such as (Srebro
values and their labels. In order to estimate simple statis- et al., 2004; Hazan et al., 2012) we were not able to use for
tics, such as the expected value of an attribute, one can use our purposes.
importance sampling methods, where the probability of an
attribute being missing can depend on it value (e.g., using Is matrix completion sufficient and/or necessary? We
the Horvitz-Thompson estimator (Horvitz & Thompson, demonstrate that classification with missing data is prov-
1952)). A large body of techniques is devoted to imputa- 1
We remark that our model implicitly includes mean-
tion procedures which complete the missing data. This can imputation or 0-imputation method and therefore will always out-
be done by replacing a missing attribute by its mean (mean perform them.
Classification with Low-Rank and Missing Data
ably different from that of matrix completion. We start by First note that there is no 1-rank completion of this ma-
considering a learner that tries to complete the missing en- trix. On the other hand, there is more than one 2-rank
tries in an unsupervised manner and then performs classi- completion each lead to a different classification of the test
fication on the completed data, this approach is close akin point. The first two possible completions are to complete
to imputation techniques, generative models and any other odd columns to a constant one vector, and even column
two step – unsupervised/supervised algorithm. Our exam- vectors to a constant −1 vector. Then complete the label-
ple shows that even under realizable assumptions, such an ing whichever way you choose. We can also complete the
algorithm may fail. We then proceed to analyze the ap- first and last rows to a constant 1 vector, and the second row
proach previously mentioned – to treat the labels as an ad- to a constant −1 vector. All possible completions lead to
ditional column. an optimal solution w.r.t Problem G but have different out-
come w.r.t classification. We stress that this is not a sample
To see that unsupervised completion is insufficient for pre-
complexity issue. Even if we observe abundant amount of
diction, consider the example in Figure 1: the original data
data, the completion task is still ill-posed.
is represented by filled red and green dots and it is linearly
separable. Each data point will have one of its two coor- Finally, matrix completion is also not necessary for pre-
dinates missing (this can even be done at random). In the diction. Consider movie recommendation dataset with two
figure the arrow from each instance points to the observed separate populations, French and Chinese, where each pop-
attribute. However, the rank-one completion of projection ulation reviews a different set of movies. Even if each pop-
onto the pink hyperplane is possible, and admits no sepa- ulation has a low rank, performing successful matrix com-
ration. The problem is clearly that the mapping to a low pletion, in this case, is impossible (and intuitively it does
dimension is independent from the labels, and therefore we not make sense in such a setting). However, linear classifi-
should not expect that properties that depend on the labels, cation in this case is possible via a single linear classifier,
such as linear separability, will be maintained. for example by setting all non-observed entries to zero. For
a numerical example, return to the matrix M in Eq. 1. Note
that we observe only three instances hence the classifica-
tion task is easy but does not lead to reconstruction of the
missing entries.
There exists an algorithm (independent of D) that re- E [`(v∗ · φγ (xo ), y)] ≤ min E [`(w · x, y)] + .
ceives
a sampleS = {(xioi , yi )}mi=1 of size m ∈ w∈Bd (1)
L2 Γ()2 log 1/δ
Ω 2 and returns a target function fS that
To summarize, Theorem 2 states that we can embed the
is -good with probability at least (1 − δ). The algorithm
sample points with missing attributes in a high dimen-
runs in time poly(|S|).
sional, finite, Hilbert space of dimension Γ, such that:
As the sample complexity in Theorem 1 is quasipolyno-
mial, the result has limited practical value in many situa- • The scalar product between embedded points can be
tions. However, as the next theorem states, fS can actu- computed efficiently. Hence, due to the conventional
ally be computed by applying a kernel trick. Thus, under representor argument, the task of empirical risk mini-
further large margin assumptions we can significantly im- mization is tractable.
prove performance.
Theorem 2. For every γ ≥ 0, there exists an embedding • Following the conventional analysis of kernel meth-
over missing data ods: Under large margin assumptions in the ambient
space, we can compute a predictor with scalable sam-
φ γ : x o → RΓ , ple complexity and computational efficiency.
Classification with Low-Rank and Missing Data
where fw,Q (xo ) is the linear predictor defined by w over As discussed, we consider a model where a distribution D
subspace defined by the matrix Q, as formally defined in is fixed over Rd × O × Y, where O = 2d consists of all
definition 2. subsets of {1, . . . , d}. We will generally denote elements
of Rd by x, w, v, u and elements of O by o. We denote√by
Given the aforementioned efficient kernel mapping φγ , we Bd the unit ball of Rd , and by Bd (r) the ball of radius r.
consider the following kernel-gradient-based online algo-
rithm for classification called KARMA (Kernelized Algo- Given a subset o we denote by Po : Rd → R|o| the projec-
rithm for Risk-minimization with Missing Attributes): tion onto the indices in o, i.e., if i1 ≤ i2 ≤ · · · ≤ ik are
the elements of o in increasing order then (Po x)j = xij .
Our main result for the fully adversarial online setting is Given a matrix A and a set of indices o, we let
given next, and proved in the supplamentary material and
in the full version. Notice that the subspace E ∗ and associ- Ao,o = Po APo> .
ated projection matrix Q∗ are chosen by an adversary and
unknown to the algorithm. Finally, given a subspace E ⊆ Rd we denote by PE :
Theorem 3. For any γ > 1, λ > 0, X > 0, ρ > 0, B > 0, Rd → Rd the projection onto E.
L-Lipschitz convex loss function `, and λ-regular sequence
{(xt , ot , yt )} w.r.t subspace E ∗ and associated projection 3.2. Model Assumptions
matrix Q∗ such that kxt k∞ < X, Run Algorithm 1 with
1 Definition 1 (λ-regularity). We say that D is λ-regular
{ηt = ρt }, sequentially outputs {vt ∈ Rt } that satisfy
with associated subpsace E if the following happens with
X
`(vt> φγ (xtot ), yt ) − min
X
`(fw,Q∗ (xtot ), yt ) ≤ probability 1 (w.r.t the joint random variables (x, o)):
kwk≤1
t t
1. kPo xk ≤ 1.
2
2L X Γ 2
ρ e−λγ
(1 + log T ) + T · B + LT 2. x ∈ E.
ρ 2 λ
Classification with Low-Rank and Missing Data
0.6
• karma: We’ve applied our method in all experiments.
Sq. loss
0.5
We evaluated the loss with γ = {1, 2, 3, 4} and C = 0.4
{10−5 , 10−4 , . . . , 104 , 105 }. We chose γ and λ 0.3
using a holdout set. A constant feature was added to 0.2
allow bias and the data was normalized. For binary 0.1
classification we let ` be the Hinge loss, for multi- 0
class we used the multiclass Hinge loss as described 0.2 0.4 0.6 0.8 1
p
in (Crammer & Singer, 2002) and finally for regres-
sion tasks we used squared loss (which was also used Figure 2. Missing Completely At Random features.
at test time).
• 0-imputation: 0-imputation is simply reduced to subset of observed features using iid Bernoulli distribution.
γ = 1 . A constant feature was added to allow bias Each feature had probability p = 0.1, 0.2, . . . , 0.9, 1 to be
and data was normalized so the mean of each feature observed. The results appear in Figure 2: Square loss vs.
was 0. fraction of observed features.
• mcb/ mc1: mcb and mc1 are methods suggested by 5.1.2. T OY DATA WITH L ARGE M ARGIN
(Goldberg et al., 2010). They are inspired by a convex
relaxation of Problem G and differ by the way a bias As discussed earlier, under large margin assumption our
term is introduced. algorithm enjoys strong guarantees. For this we consid-
ered a different type of noise model – one that guarantees
When applying mcb and mc1 in binary classifica- large margin. First we constructed fully observed sample
tion tasks, we used the algorithm as suggested there. of size m = 104 : The data resides in r = 20 dimension
We ran their iterative algorithm until the incremental subspace in a d = 200 dimension linear space. We then
change was smaller than 1e−5 and used the logistic divided the features into three types- {A1 , A2 , A3 }. Each
loss function to fit the entries. In multi-class tasks the subset contained “enough information” and we made sure
label matrix was given by a {0, 1} matrix with a 1 en- that for each subset Ai there is a classifier wAi with large
try at the correct label. In regression tasks we’ve used margin that correctly classify xo when o = Ai (in fact a
the squared loss to fit the labels (as this is the loss we random division of the features led to this). Thus, if at each
tested the results against). turn o = Ai for some i, the 0-imputation method would
• geom: Finally we‘ve applied geom (Chechik et al., guarantee to perform well. However, in our noise model,
2008). This is an algorithm that tries to maximize the for each sample, each type had probability 1/3 to appear
margin of the classifier but the margin is computed (independently, conditioned on the event that at least one
with respect to the revealed entries subspace. We’ve type of feature appears). Thus there is no guarantee that
applied this method only for binary classification as 0-imputation will work well.
this method was designed specifically to correct the Such a noise distribution can model, for example, a sce-
hinge loss methods. Again we used the iterative algo- nario where we collect our data by letting people answer a
rithm as suggested in there with 5 iterations. Regular- survey. A person is likely to answer a certain type of ques-
ization parameters were chosen using a holdout set. tion by and not answer a different type (say for example, he
Classification with Low-Rank and Missing Data
1
karma
0−imputation
0.8 mcb/mc1
almost all users was used as a label (specifically joke num-
ber 5). The movielens data set includes users who rated
0.6 various movies, and includes met-data such as age, gen-
der and occupation. We used the meta-data to construct
0.4 three different tasks. In the appendix we add the details as
to data set sizes and percentage of missing values in each
0.2
task. Throughout regression tasks were normalized to have
mean zero and standard deviation 1.
0
200 400 600 800 1000
Acknowledgements: EH: This research was supported in Dekel, Ofer, Shamir, Ohad, and Xiao, Lin. Learning to
part by the European Research Council project SUBLRN classify with missing and corrupted features. Mach.
and the Israel Science Foundation grant 810/11. Learn., 81(2):149–178, November 2010. ISSN 0885-
6125.
RL is a recipient of the Google Europe Fellowship in
Learning Theory, and this research is supported in part by Eban, Elad, , Mezuman, Elad, and Globerson, Amir. Dis-
this Google Fellowship. crete chebyshev classifiers. In Proceedings of the 27th
YM: This research was supported in part by The Israeli international conference on Machine learning, 2014.
Centers of Research Excellence (I-CORE) program, (Cen- Globerson, Amir and Roweis, Sam. Nightmare at test time:
ter No. 4/11), by a grant from the Israel Science Founda- robust learning by feature deletion. In Proceedings of the
tion, by a grant from United States-Israel Binational Sci- 23rd international conference on Machine learning, pp.
ence Foundation (BSF). 353–360. ACM, 2006.
References Goldberg, Andrew B., Zhu, Xiaojin, Recht, Ben, Xu, Jun-
Ming, and Nowak, Robert D. Transduction with matrix
Alcal-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., completion: Three birds with one stone. In Proceedings
Garca, S., Snchez, L., and Herrera., F. Keel data-mining of the 24th Annual Conference on Neural Information
software tool: Data set repository, integration of algo- Processing Systems 2010., pp. 757–765, 2010.
rithms and experimental analysis framework. Journal of
Multiple-Valued Logic and Soft Computing, 17:2-3:255– Goldberg, Ken, Roeder, Theresa, Gupta, Dhruv, and
287, 2011. URL http://sci2s.ugr.es/keel/ Perkins., Chris. Eigentaste: A constant time collabora-
missing.php. tive filtering algorithm. Information Retrieval, 4(2):133–
151, 2001. URL http://www.ieor.berkeley.
Ben-David, S. and Dichterman, E. Learning with restricted edu/˜goldberg/jester-data/.
focus of attention. Journal of Computer and System Sci-
ences, 56(3):277–298, 1998. Hazan, Elad and Koren, Tomer. Linear regression with
limited observation. In Proceedings of the 29th Inter-
Berthet, Q. and Rigollet, P. Complexity theoretic lower national Conference on Machine Learning, ICML 2012,
bounds for sparse principal component detection. J. Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.
Mach. Learn. Res., W&CP, 30:1046–1066 (electronic),
Hazan, Elad, Kale, Satyen, and Shalev-Shwartz, Shai.
2013.
Near-optimal algorithms for online matrix prediction. In
COLT, pp. 38.1–38.13, 2012.
Candes, E. and Recht, B. Exact matrix completion via con-
vex optimization. Foundations of Computational Math- Hazan, Elad, Livni, Roi, and Mansour, Yishay. Classifi-
ematics, 9:717–772, 2009. cation with low rank and missing data. arXiv preprint
arXiv:1501.03273, 2015.
Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O. Effi-
cient learning with partially observed attributes. In Pro- Horvitz, D. G. and Thompson, D. J. A generalization
ceedings of the 27th international conference on Ma- of sampling without replacement from a finite universe.
chine learning, 2010. Journal of the American Statistical Association, 47:663–
685, 1952.
Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O. On-
line learning of noisy data. IEEE Transactions on In- Lee, J., Recht, B., Salakhutdinov, R., Srebro, N., and
formation Theory, 57(12):7907 –7931, dec. 2011. ISSN Tropp, J. A. Practical large-scale optimization for max-
0018-9448. doi: 10.1109/TIT.2011.2164053. norm regularization. In NIPS, pp. 1297–1305, 2010.