You are on page 1of 10

KNN ALGORITHM

DESCRIPTION
A nearest-neighbor classification object, where both distance metric ("nearest") and number of neighbors can be altered. The object classifies new observations using the predict method. The object contains the data used for training, so can compute resubstitution predictions.

CLASSIFICATION
mdl = ClassificationKNN.fit(X,Y) creates a k-nearest neighbor classification model. For details, see ClassificationKNN.fit. mdl = ClassificationKNN.fit(X,Y,Name,Value) creates a classifier with additional options specified by one or more Name,Value pair arguments. For details, see ClassificationKNN.fit.

Input Arguments X Matrix of predictor values. Each column of X represents one variable, and each row represents one observation. Y Grouping variables of response values with the same number of elements (rows) as X. Each entry in Y is the response to the data in the corresponding row of X.

Properties BreakTies String specifying the method predict uses to break ties if multiple classes have the same smallest cost. By default, ties occur when multiple classes have the same number of nearest points among the K nearest neighbors.

'nearest' Use the class with the nearest neighbor among tied groups. 'random' Use a random tiebreaker among tied groups. 'smallest' Use the smallest index among tied groups. 'BreakTies' applies when 'IncludeTies' is false. Change BreakTies using dot addressing: mdl.BreakTies = newBreakTies

CategoricalPredictors Specification of which predictors are categorical.

'all' All predictors are categorical. [] No predictors are categorical. List of elements in the training data Y with duplicates removed.ClassNames can be a numeric vector, vector of categorical variables (nominal or ordinal), logical vector, character array, or cell array of strings.ClassNames has the same data type as the data in the argument Y. Change ClassNames using dot addressing: mdl.ClassNames = newClassNames

ClassNames

Cost

Square matrix, where Cost(i,j) is the cost of classifying a point into

class j if its true class is i. Cost is K-by-K, where K is the number of classes. Change a Cost matrix using dot addressing: mdl.Cost = costMatrix

Distance

String or function handle specifying the distance metric. The allowable strings depend on the NSMethod parameter, which you set inClassificationKNN.fit, and which exists as a field in ModelParams. NSMethod exhaustive kdtree Distance Metric Names Any distance metric of ExhaustiveSearcher 'cityblock', 'chebychev', 'euclidean', or'minkowski'

For definitions, see Distance Metrics. The distance metrics of ExhaustiveSearcher: Value 'cityblock' 'chebychev' Description City block distance. Chebychev distance (maximum coordinate difference). 'correlation' One minus the sample linear correlation between observations (treated as sequences of values). 'cosine' One minus the cosine of the included angle between observations (treated as vectors).

'euclidean' 'hamming'

Euclidean distance. Hamming distance, percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ.

'mahalanobis' Mahalanobis distance, computed using a positive definite covariance matrix C. The default value of C is the sample covariance matrix of X, as computed by nancov(X). To specify a different value for C, use the 'Cov' name-value pair. 'minkowski' Minkowski distance. The default exponent is 2. To specify a different exponent, use the 'P'name-value pair. 'seuclidean' Standardized Euclidean distance. Each coordinate difference between X and a query point is scaled, meaning divided by a scale valueS. The default value of S is the standard deviation computed from X, S = nanstd(X). To specify another value for S, use the Scale name-value pair. 'spearman' One minus the sample Spearman's rank correlation between observations (treated as sequences of

values). @distfun Distance function handle. distfun has the form function D2 = DISTFUN(ZI,ZJ)

% calculation of distance

...

where

ZI is a 1-by-N vector containing one row of X orY. ZJ is an M2-by-N matrix containing multiple rows of X or Y.

D2 is an M2-by-1 vector of distances, andD2(k) is the distance between observationsZI and ZJ(J,:).

Change Distance using dot addressing: mdl.Distance = newDistance

If NSMethod is kdtree, you can use dot addressing to change Distanceonly among the types 'cityblock', 'chebychev', 'euclidean', or'minkowski'. DistanceWeight String or function handle specifying the distance weighting function.

DistanceWeight Meaning

'equal' 'inverse' 'inversesquared' @fcn

No weighting Weight is 1/distance Weight is 1/distance2 fcn is a function that accepts a matrix of nonnegative distances, and returns a matrix the same size containing nonnegative distance weights. For example, 'inversesquared' is equivalent to @(d)d.^(-2).

Change DistanceWeight using dot addressing: mdl.DistanceWeight = newDistanceWeight

DistParameter

Additional parameter for the distance metric.

Distance Metric 'mahalanobis' 'minkowski' 'seuclidean' Parameter Positive definite covariance matrix C. Minkowski distance exponent, a positive scalar. Vector of positive scale values with length equal to the number of columns of X. For values of the distance metric other than those in the table,DistParameter must be []. Change DistParameter using dot

addressing: mdl.DistParameter = newDistParameter

IncludeTies

Logical value indicating whether predict includes all the neighbors whose distance values are equal to the Kth smallest distance. IfIncludeTies is true, predict includes all these neighbors. Otherwise,predict uses exactly K neighbors (see 'BreakTies'). Change IncludeTies using dot addressing: mdl.IncludeTies = newIncludeTies

ModelParams NObservations

Parameters used in training mdl. Number of observations used in training mdl. This can be less than the number of rows in the training data, because data rows containing NaNvalues are not part of the fit.

NumNeighbors

Positive integer specifying the number of nearest neighbors in X to find for classifying each point when predicting. Change NumNeighbors using dot addressing: mdl.NumNeighbors = newNumNeighbors

PredictorNames

Cell array of names for the predictor variables, in the order in which they appear in the training data X. Change PredictorNames using dot addressing: mdl.PredictorNames = newPredictorNames

Prior

Prior probabilities for each class. Prior is a numeric vector whose entries relate to the corresponding ClassNames property. Add or change a Prior vector using dot addressing: obj.Prior = priorVector

ResponseName

String describing the response variable Y. Change ResponseName using dot addressing: mdl.ResponseName = newResponseName

Numeric vector of nonnegative weights with the same number of rows as Y. Each entry in W specifies the relative importance of the corresponding observation in Y. Change W using dot addressing: mdl.W = newW

Numeric matrix of predictor values. Each column of X represents one predictor (variable), and each row represents one observation.

Numeric vector of response values with the same number of rows as X. Each entry in Y is the response to the data in the corresponding row of X.

Methods crossval edge fit loss margin predict resubEdge resubLoss resubMargin resubPredict template Cross-validated k-nearest neighbor classifier Edge of k-nearest neighbor classifier Fit k-nearest neighbor classifier Loss of k-nearest neighbor classifier Margin of k-nearest neighbor classifier Predict k-nearest neighbor classification Edge of k-nearest neighbor classifier by resubstitution Loss of k-nearest neighbor classifier by resubstitution Margin of k-nearest neighbor classifier by resubstitution Predict resubstitution response of k-nearest neighbor classifier k-nearest neighbor classifier template for ensemble

Definitions Prediction ClassificationKNN predicts the classification of a point Xnew using a procedure equivalent to this: 1. Find the NumNeighbors points in the training set X that are nearest to Xnew. 2. Find the NumNeighbors response values Y to those nearest points. 3. Assign the classification label Ynew that has smallest expected misclassification cost among the values in Y.