Beruflich Dokumente
Kultur Dokumente
oxstar@SJTU
January 6, 2011
PN PN
Ex. 6.2 Show that i=1 (xi − x0 )li (x0 ) = 0 for local linear regression. Define bj (x0 ) = i=1 (xi −
x0 )j li (x0 ). Show that b0 (x0 ) = 1 for local polynomial regression of any degree (including local
constants). Show that bj (x0 ) = 0 for all j ∈ {1, 2, . . . , k} for local polynomial regression of degree
k. What are the implications of this on the bias?
Proof By the definition of vector-valued function, b(x)T = (1, x) and B = [1, x], so we have
Expanding (xj − x0 )j without combing of similar terms, each term can be written as (−1)b xai xb0 ,
where a + b = j. Obviously, the number of positive terms equals with the number of negative terms,
P2j bn
i.e. n=1 (−1) = 0. So each term of bj (x0 ) can be written as
N
X N
X
(−1)b xai xb0 li (x0 ) = (−1)b xb0 li (x0 )xai
i=1 i=1
= b b a
(−1) x0 x0 = (−1)b xj0 // (2)
N
X
bj (x0 ) = (xi − x0 )j li (x0 )
i=1
j
N X
X 2
= (−1)bn xj0 li (x0 ) = 0
i=1 n=1
1
Hence we have the bias
N
X
E fˆ(x0 ) − f (x0 ) = li (x0 )f (xi ) − f (x0 )
i=1
N
X N
X
= f (x0 ) li (x0 ) − f (x0 ) + f 0 (x0 ) (xi − x0 )li (x0 )
i=1 i=1
N
X N
X
00 2 (j+1)
+ k2 f (x0 ) (xi − x0 ) li (x0 ) + . . . + kj+1 f (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1 i=1
N
X
= kj+1 f (j+1) (x0 ) (xi − x0 )j+1 li (x0 ) + . . .
i=1
Ex. 6.3 Show that ||l(x)|| (Section 6.1.2) increases with the degree of the local polynomial.
From
we have
= trace(BBT (BBT )−1 (BBT )−1 BBT ) // Prof. Zhang has proved it
= trace(Id+1 ) = d + 1
√
Hence ||l(x)|| = d + 1 which increases with the degree of the local polynomial.
Ex. 6.4 Suppose that the p predictors X arise from sampling relatively smooth analog curves
at p uniformly spaced abscissa values. Denote by Cov(X|Y ) = Σ the conditional covariance matrix
of the predictors, and assume this does not change much with Y . Discuss the nature of Mahalanobis
choice A = Σ−1 for the metric in (6.14). How does this compare with A = I? How might you
construct a kernel A that (a) downweights high-frequency components in the distance metric; (b)
ignores them completely?
2
p
Answer D = (x − x0 )T Σ−1 (x − x0 ) is called the Mahalanobis distance of the point x to x0 .
It takes the correlations of the data set into account. If the predictors are highly correlated, Maha-
lanobis distance is much
p accurate than Euclidian distance.
If A = I, then d = (x − x0 )T (x − x0 ) equals to Euclidian distance of the point x to x0 . Prior to
smoothing, we should standardize each predictor, for example
xi − E(xi )
x0i = p
Var(xi )
When comparing it with Σ−1 and using the standard predictors, we have
Then the covariance matrix Σ will change to its standardized version (correlation matrix). Hence
A = I means ∀i 6= j, ρ(xi , xj ) = 0, i.e., all dimensions of x are not correlated.
(a) If we want to construct a kernel A that downweights high-frequency components (xi s) in the
distance metric, we can just decrease Cov(xi , xj ) or ρ(xi , xj ) in order to suppress their influence.
(b) If we want to construct a kernel A that ignores them completely, we can just set Cov(xi , xj ) or
ρ(xi , xj ) as 0.