Sie sind auf Seite 1von 6

Solution to Assignment 4 Machine Learning 2011-3

Juan Sebastian Ot lora M. - Code: 257867 a 30 de octubre de 2011


1. Regression on strings a) Implement a function that calculates a kernel over xed-length strings:

k : d d R

Which count the number of coincidences between two strings. The Kernel function implemented is
d

k(S1 , S2 ) =:
k=0

I(S1 (k), S2 (k))

Where we sum when the k character are equal on both strings: I(S1 (k), S2 (k)) = 1, if S1 (k) = S2 (k), 0, Otherwhise

b) Implement the kernel ridge-regression (KRR) algorithm. The implementation was made in the Python language and i used a class to dene the model of the Ridge Regression Algorithm, where the vectors X and Y and the Kernel Matrix K are stored as objects, and the predicted output g(x) are dened as a function of the class:
#Kernel Ridge Regression Implementation class RidgeRegression(): def __init__(self, DataSet, X, Y, lamBda): self.Kernel = Kernels() self.X= X self.Y = np.matrix(Y) self.lamBda = lamBda self.DS = DataSet self.K=np.zeros((len(X),len(X))) #Filling The Kernel Matrix for i in xrange(len(X)): for j in xrange(len(X)): self.K[i,j]= (nextline) self.Kernel.Kernel_Coincidences(X[i],X[j]) self.G = self.K + self.lamBda * np.eye(len(X))

0.1 0.001 0.0001 0.0000001

MSE 0.00109577189561 1.09609519003e-05 1.09632383034e-06 6.65936224441e-08

ASE 0.518036260195 0.517193455207 0.517193370519 0.517193369664

Cuadro 1: Errors over the training Set with The Kernel Ridge Regression using the Coincidences Kernel.
self.G = lg.inv(self.G) self.G = self.Y * self.G def gx(self, x): k = [] for i in xrange(len(self.X)): k.append(nextline) (self.Kernel.Kernel_Coincidences(self.X[i],x)) predicted = np.matrix(k) predicted = float (nextline) (self.G * np.transpose(predicted) ) return predicted

c) Use the KRR implementation and the kernel k to train a model using the training data set in http://dis.unal.edu.co/fgonza/courses/ 2008-I/ml/assign4-train.txt. Evaluate the error of the model on the training data set. Plot the prediction of the model on the training data along with the real output values (results may be sorted descendently by the real output value).

The trainining of the model was made creating an object of the class of the a) numeral, with their respective parameters as the downloaded training dataset, and diferent values of The error of the model on the training data set was evaluated using the Mean Square Error of the regression (MSE) and the Average Squared Error (ASE), the prior is calculated as: M SE = And the ASE is: ASE =
N i=1 (yi N i=1

yi yi N 2

yi )2

In the table 1 are some error computations for diferent values of , wich was choosen arbitrary, but formaly should be selected due to some crossvalidation. The following graph shows how the prediction of the KRR behaves with respect to the original values of the strings:

0.1 0.001 0.0001 0.0000001

MSE 0.12038751306 1.09609519003e-05 1.09632383034e-06 6.65936224441e-08

ASE 0.836936675799 0.517193455207 0.517193370519 0.517193369664

Cuadro 2: Errors over the Test DataSet with The Kernel Ridge Regression using the Coincidences Kernel.

Figura 1: Graphic of the Coincidence Kernel for the problem 1. The black squares are the values of the original training data and the white circles are the predictions of the program over the training data. d) Evaluate the trained model on the test data set http://dis.unal. edu.co/fgonza/courses/2008-I/ml/assign4-test.txt. Plot the results and discuss them. The graph 2 shows how the model behaves with a test dataset, its shown as expected that it doesnt do the regression as good as with the training dataset, however, it keeps a good prediction performance. The errors are showed in ??:

Figura 2: Graphic of the Coincidence Kernel for the Test Dataset. The black squares are the values of the original test data and the white circles are the predictions of the program over the testing dataset. e) Build a new kernel, k , composing the kernel k with more complex kernel (polynomial, Gaussian, etc). Repeat steps (c) and (d). 2. Let x = x1 , ..., xn be a subset of a input data set X. Consider a kernel function k : X X R which induces a feature space (X): a) Deduce an expression, that allows to calculate the average distance to the center of mass of the image of set x in the feature space: 1 n
n

||(xi ) S (x)||(X)
i=1

where the center of mass is dened as: S (x) = 1 n


n

1 n

(xi )
i=1

||(xi )S (x)||(X) =
i=1

1 n
n

< (xi ) S (x), (xi ) S (x) > =


i=1

1 n 1 n 1 n
n

< (xi )
i=1

1 n 1 n

(xj ), (xi )
j=1 n n

1 n

(xj ) > =
j=1 n

< (xi ), (xi )


i=1 n

(xj ) >
j=1 n

1 n

< (xj ), (xi )


j=1 n

1 n

(xk ) > =
k=1 n

< (xi ), (xi ) >


i=1

1 n

< (xi ), (xj ) >


j=1

1 n

< (xj ), (xi )


j=1

1 n

(xk ) >
k=1

If the result of the dot product < ., . > is dene over the real numbers. we can say: < (xi ), (xj ) > + < (xj ), (xi ) >= 2 < (xi ), (xj ) > So: 1 n 1 n
n

< (xi ), (xi ) >


i=1 n

1 n 2 n

< (xi ), (xj ) >


j=1 n

1 n

< (xj ), (xi )


j=1 n n

1 n

(xk ) > =
k=1

< (xi ), (xi ) >


i=1 n

< (xi ), (xj ) > +


j=1 n n

1 n2

< (xk ), (xl ) > =


k=1 l=1

1 n

k(xi , xi )
i=1

2 n

k(xi , xj ) +
j=1

1 n2

k(xm , xl )
m=1 l=1

1 n

G(i, i)
i=1

2 1 G1nx1 (i) + 2 11xn G1nx1 n n

b) Use the previous expression to calculate the average distance to the center of mass of the following point set in R2 , x = (0, 1), (1, 3), (2, 4), (3, 1), (1, 2), in the feature spaces induced by the following kernels: ADCM = AverageDistancetothecenterof mass 1) k(x, y) =< x, y > ADCM = 2,18159005893 2) k(x, y) =< x, y >2 ADCM = 7,91462892984 3) k(x, y) = (< x, y > +1)5 ADCM = 693,201082839 4) Gaussian kernel with = 1 ADCM = 0,840321201414 3. Controlling the model complexity a) Download the Wisconsin Breast Cancer data set from http://archive.ics.uci.edu/ ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) and divide it in a training set and a test set (50/50). b) Train a SVM using a linear kernel. Find an optimal complexity parameter, C, plotting the training and test error vs. the complexity parameter. Use a logarithmic scale for C,25 , 24 , ..., 215 . Discuss the results. We train the SVM using a linear kernel and we found out that the optimal value for thr complexity parameter C is 8 . The grapic shown in [Figure], let us see clearly how We train the SVM using a linear kernel and we found out that the optimal value for thr complexityFigi parameter C is 8 . The grapic is shown in gure[] c) Repeat item (b) using a Gaussian kernel with a x value.

d) Repeat (c) varying and keeping C xed. 4. Train an SVM for detecting whether a word belongs to English or Spanish: a) Build a training and a test data sets. You can use the most frequent words in http : //en.wiktionary.org/wiki/W iktionary : F requencyl ists. Consider words at least 4 characters long and ignore accents. b) Use an SVM software package that supports string kernels: LIBSVM, Shogun, etc. c) Use cross validation to nd an appropriate complexity parameter. d) Evaluate the performance of the SVM in the test data set.

Referencias