2.2. A Tutorial On Statistical-Learning For Scientific Data Processing

3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
2.2. A tutorial on statistical-learning for scientific data processing

Statistical learning Machine learning [1] is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset. This tutorial will explore statistical learning, that is the use of machine learning techniques with the goal of statistical inference [2]: drawing conclusions on the data at hand.
sklearn is a Python module integrating classic machine learning algorithms in
the tightly-knit world of scientific Python packages (numpy [3], scipy [4], matplotlib [5]).
page 2
2.2.1.1. Datasets
The scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multidimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis. A simple example shipped with the scikit: iris dataset
> > >f r o ms k l e a r ni m p o r td a t a s e t s > > >i r i s=d a t a s e t s . l o a d _ i r i s ( ) > > >d a t a=i r i s . d a t a > > >d a t a . s h a p e ( 1 5 0 ,4 )
It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as detailed in iris.DESCR. When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to by used by scikit. An example of reshaping data would be the digits dataset
scikit-learn.org/stable/tutorial/statistical_inference/index.html
1/27
3.6.13.
[6]
The digits dataset is made of 1797 8x8 images of hand-written digits

> > >d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) > > >d i g i t s . i m a g e s . s h a p e ( 1 7 9 7 ,8 ,8 ) > > >i m p o r tp y l a ba sp l > > >p l . i m s h o w ( d i g i t s . i m a g e s [ 1 ] ,c m a p = p l . c m . g r a y _ r ) < m a t p l o t l i b . i m a g e . A x e s I m a g eo b j e c ta t. . . >
To use this dataset with the scikit, we transform each 8x8 image into a feature vector of length 64
> > >d a t a=d i g i t s . i m a g e s . r e s h a p e ( ( d i g i t s . i m a g e s . s h a p e [ 0 ] ,1 ) )
2.2.1.2. Estimators objects

Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data. All estimator objects expose a fit method that takes a dataset (usually a 2-d array):
> > >e s t i m a t o r . f i t ( d a t a )
Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute:
> > >e s t i m a t o r=E s t i m a t o r ( p a r a m 1 = 1 ,p a r a m 2 = 2 ) > > >e s t i m a t o r . p a r a m 1 1
Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
> > >e s t i m a t o r . e s t i m a t e d _ p a r a m _
page 3
The problem solved in supervised learning Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually
scikit-learn.org/stable/tutorial/statistical_inference/index.html 2/27
3.6.13.
called target or labels. Most often, y is a 1D array of length n_samples. All supervised estimators [7] in the scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y. Vocabulary: classification and regression If the prediction task is to classify the observations in a set of finite labels, in other words to name the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task. In the scikit-learn for classification tasks, y is a vector of integers. Note: See the Introduction to machine learning with Scikit-learn Tutorial for a quick run-through on the basic machine learning vocabulary used within Scikit-learn.
2.2.2.1. Nearest neighbor and the curse of dimensionality

Classifying irises: The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width:
> > >i m p o r tn u m p ya sn p > > >f r o ms k l e a r ni m p o r td a t a s e t s > > >i r i s=d a t a s e t s . l o a d _ i r i s ( ) > > >i r i s _ X=i r i s . d a t a > > >i r i s _ y=i r i s . t a r g e t > > >n p . u n i q u e ( i r i s _ y ) a r r a y ( [ 0 ,1 ,2 ] )
2.2.2.1.1. k-Nearest neighbors classifier

The simplest possible classifier is the nearest neighbor [8]: given a new observation X_test, find in the training set (i.e. the data used to train the estimator) the observation with the closest feature vector. (Please see the Nearest Neighbors section of the online Scikit-learn documentation for more information about this type of classifier.) Training set and testing set While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is why datasets are often split into train and test data. KNN (k nearest neighbors) classification example:
3/27
3.6.13.
[9] > > >#S p l i ti r i sd a t ai nt r a i na n dt e s td a t a > > >#Ar a n d o mp e r m u t a t i o n ,t os p l i tt h ed a t ar a n d o m l y > > >n p . r a n d o m . s e e d ( 0 ) > > >i n d i c e s=n p . r a n d o m . p e r m u t a t i o n ( l e n ( i r i s _ X ) ) > > >i r i s _ X _ t r a i n=i r i s _ X [ i n d i c e s [ : 1 0 ] ] > > >i r i s _ y _ t r a i n=i r i s _ y [ i n d i c e s [ : 1 0 ] ] > > >i r i s _ X _ t e s t =i r i s _ X [ i n d i c e s [ 1 0 : ] ] > > >i r i s _ y _ t e s t =i r i s _ y [ i n d i c e s [ 1 0 : ] ] > > >#C r e a t ea n df i tan e a r e s t n e i g h b o rc l a s s i f i e r > > >f r o ms k l e a r n . n e i g h b o r si m p o r tK N e i g h b o r s C l a s s i f i e r > > >k n n=K N e i g h b o r s C l a s s i f i e r ( ) > > >k n n . f i t ( i r i s _ X _ t r a i n ,i r i s _ y _ t r a i n ) K N e i g h b o r s C l a s s i f i e r ( a l g o r i t h m = ' a u t o ' ,l e a f _ s i z e = 3 0 ,n _ n e i g h b o r s = 5 ,p = 2 , w a r n _ o n _ e q u i d i s t a n t = T r u e ,w e i g h t s = ' u n i f o r m ' ) > > >k n n . p r e d i c t ( i r i s _ X _ t e s t ) a r r a y ( [ 1 ,2 ,1 ,0 ,0 ,0 ,2 ,1 ,2 ,0 ] ) > > >i r i s _ y _ t e s t a r r a y ( [ 1 ,1 ,1 ,0 ,0 ,0 ,2 ,1 ,2 ,0 ] )
2.2.2.1.2. The curse of dimensionality

For an estimator to be effective, you need the distance between neighboring points to be less than some value d, which depends on the problem. In one dimension, this requires on average n ~ 1/d points. In the context of the above KNN example, if the data is described by just one feature with values ranging from 0 to 1 and with n training observations, then new data will be no further away than 1/n. Therefore, the nearest neighbor decision rule will be efficient as soon as 1/n is small compared to the scale of between-class feature variations. If the number of features is p, you now require n ~ 1/d^p points. Lets say that we require 10 points in one dimension: Now 10^p points are required in p dimensions to pave the [0, 1] space. As p becomes large, the number of training points required for a good estimator grows exponentially. For example, if each point is just a single number (8 bytes), then an effective KNN estimator in a paltry p~20 dimensions would require more training data than the current estimated size of the entire internet! (1000 Exabytes or so). This is called the curse of dimensionality [10] and is a core problem that machine learning addresses.
4/27
3.6.13.
2.2.2.2. Linear model: from regression to sparsity

Diabetes dataset The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442 patients, and an indication of disease progression after one year:
> > >d i a b e t e s=d a t a s e t s . l o a d _ d i a b e t e s ( ) > > >d i a b e t e s _ X _ t r a i n=d i a b e t e s . d a t a [ : 2 0 ] > > >d i a b e t e s _ X _ t e s t =d i a b e t e s . d a t a [ 2 0 : ] > > >d i a b e t e s _ y _ t r a i n=d i a b e t e s . t a r g e t [ : 2 0 ] > > >d i a b e t e s _ y _ t e s t =d i a b e t e s . t a r g e t [ 2 0 : ]
The task at hand is to predict disease progression from physiological variables.
2.2.2.2.1. Linear regression

LinearRegression, in its simplest form, fits a linear model to the data set by
adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.
[11]
Linear models:
> > >f r o ms k l e a r ni m p o r tl i n e a r _ m o d e l > > >r e g r=l i n e a r _ m o d e l . L i n e a r R e g r e s s i o n ( ) > > >r e g r . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n ) L i n e a r R e g r e s s i o n ( c o p y _ X = T r u e ,f i t _ i n t e r c e p t = T r u e ,n o r m a l i z e = F a l s e ) > > >p r i n tr e g r . c o e f _ [ 0 . 3 0 3 4 9 9 5 52 3 7 . 6 3 9 3 1 5 3 3 5 1 0 . 5 3 0 6 0 5 4 4 3 2 7 . 7 3 6 9 8 0 4 18 1 4 . 1 3 1 7 0 9 3 7 4 9 2 . 8 1 4 5 8 7 9 8 1 0 2 . 8 4 8 4 5 2 1 9 1 8 4 . 6 0 6 4 8 9 0 6 7 4 3 . 5 1 9 6 1 6 7 5 > > >#T h em e a ns q u a r ee r r o r > > >n p . m e a n ( ( r e g r . p r e d i c t ( d i a b e t e s _ X _ t e s t ) d i a b e t e s _ y _ t e s t ) * * 2 ) 2 0 0 4 . 5 6 7 6 0 2 6 8 . . . > > >#E x p l a i n e dv a r i a n c es c o r e :1i sp e r f e c tp r e d i c t i o n > > >#a n d0m e a n st h a tt h e r ei sn ol i n e a rr e l a t i o n s h i p > > >#b e t w e e nXa n dY . > > >r e g r . s c o r e ( d i a b e t e s _ X _ t e s t ,d i a b e t e s _ y _ t e s t ) 0 . 5 8 5 0 7 5 3 0 2 2 6 9 0 . . . 7 6 . 0 9 5 1 7 2 2 2 ]
5/27
3.6.13.
2.2.2.2.2. Shrinkage
If there are few data points per dimension, noise in the observations induces high variance:
[12] > > >X=n p . c _ [. 5 ,1 ] . T > > >y=[ . 5 ,1 ] > > >t e s t=n p . c _ [0 ,2 ] . T > > >r e g r=l i n e a r _ m o d e l . L i n e a r R e g r e s s i o n ( ) > > >i m p o r tp y l a ba sp l > > >p l . f i g u r e ( ) > > >n p . r a n d o m . s e e d ( 0 ) > > >f o r_i nr a n g e ( 6 ) : . . . . . . . . . . . . t h i s _ X=. 1 * n p . r a n d o m . n o r m a l ( s i z e = ( 2 ,1 ) )+X r e g r . f i t ( t h i s _ X ,y ) p l . p l o t ( t e s t ,r e g r . p r e d i c t ( t e s t ) ) p l . s c a t t e r ( t h i s _ X ,y ,s = 3 )
A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:
[13] > > >r e g r=l i n e a r _ m o d e l . R i d g e ( a l p h a = . 1 ) > > >p l . f i g u r e ( ) > > >n p . r a n d o m . s e e d ( 0 ) > > >f o r_i nr a n g e ( 6 ) : . . . . . . . . . . . . t h i s _ X=. 1 * n p . r a n d o m . n o r m a l ( s i z e = ( 2 ,1 ) )+X r e g r . f i t ( t h i s _ X ,y ) p l . p l o t ( t e s t ,r e g r . p r e d i c t ( t e s t ) ) p l . s c a t t e r ( t h i s _ X ,y ,s = 3 )
6/27
3.6.13.
This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower the variance. We can choose alpha to minimize left out error, this time using the diabetes dataset rather than our synthetic data:
> > >a l p h a s=n p . l o g s p a c e ( 4 ,1 ,6 ) > > >p r i n t[ r e g r . s e t _ p a r a m s ( a l p h a = a l p h a . . . . . . ) . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n , ) . s c o r e ( d i a b e t e s _ X _ t e s t ,d i a b e t e s _ y _ t e s t )f o ra l p h ai na l p h a s ]
[ 0 . 5 8 5 1 1 1 0 6 8 3 8 8 3 . . . ,0 . 5 8 5 2 0 7 3 0 1 5 4 4 4 . . . ,0 . 5 8 5 4 6 7 7 5 4 0 6 9 8 . . . ,0 . 5 8 5 5 5 1 2 0 3 6 5 0 3 . . . ,0 . 5 8 3 0 7 1 7 0 8 5 5 5 4 . . . ,0 . 5 7 0 5 8 9 9 9 4 3 7 . . . ]
Capturing in the fitted parameters noise that prevents the model to generalize to new data is called overfitting [14]. The bias introduced by the ridge regression is called a regularization [15].
2.2.2.2.3. Sparsity
Fitting only features 1 and 2
[16]
[17]
[18]
7/27
3.6.13.
[19]
A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions and one of the target variable). It is hard to develop an intuition on such representation, but it may be useful to keep in mind that it would be a fairly empty space. We can see that, although feature 2 has a strong coefficient on the full model, it conveys little information on y when considered with feature 1. To improve the conditioning of the problem (i.e. mitigating the The curse of dimensionality), it would be interesting to select only the informative features and set non-informative ones, like feature 2 to 0. Ridge regression will decrease their contribution, but not set them to zero. Another penalization approach, called Lasso (least absolute shrinkage and selection operator), can set some coefficients to zero. Such methods are called sparse method and sparsity can be seen as an application of Occams razor: prefer simpler models.
> > >r e g r=l i n e a r _ m o d e l . L a s s o ( ) > > >s c o r e s=[ r e g r . s e t _ p a r a m s ( a l p h a = a l p h a . . . . . . . . . ) . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n ) . s c o r e ( d i a b e t e s _ X _ t e s t ,d i a b e t e s _ y _ t e s t ) f o ra l p h ai na l p h a s ]
> > >b e s t _ a l p h a=a l p h a s [ s c o r e s . i n d e x ( m a x ( s c o r e s ) ) ] > > >r e g r . a l p h a=b e s t _ a l p h a > > >r e g r . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n ) L a s s o ( a l p h a = 0 . 0 2 5 1 1 8 8 6 4 3 1 5 0 9 5 7 9 4 ,c o p y _ X = T r u e ,f i t _ i n t e r c e p t = T r u e , m a x _ i t e r = 1 0 0 0 ,n o r m a l i z e = F a l s e ,p o s i t i v e = F a l s e ,p r e c o m p u t e = ' a u t o ' , t o l = 0 . 0 0 0 1 ,w a r m _ s t a r t = F a l s e ) > > >p r i n tr e g r . c o e f _ [ 0 . 2 1 2 . 4 3 7 6 4 5 4 8 5 1 7 . 1 9 4 7 8 1 1 1 3 1 3 . 7 7 9 5 9 9 6 21 6 0 . 8 3 0 3 9 8 2 7 1 . 8 4 2 3 9 0 0 8 ] 0 . 1 8 7 . 1 9 5 5 4 7 0 5 6 9 . 3 8 2 2 9 0 3 8 5 0 8 . 6 6 0 1 1 2 1 7
Different algorithms for the same problem Different algorithms can be used to solve the same mathematical problem. For instance the Lasso object in the scikit-learn solves the lasso regression problem using a coordinate decent [20] method, that is efficient on large datasets. However, the scikit-learn also provides the LassoLars object using the LARS which is very efficient for problems in which the weight vector estimated is very sparse, (i.e. problems with very few observations).
2.2.2.2.4. Classification
8/27
3.6.13.
[21]
For classification, as in the labeling iris [22] task, linear regression is not the right approach as it will give too much weight to data far from the decision frontier. A linear approach is to fit a sigmoid function or logistic function:
> > >l o g i s t i c=l i n e a r _ m o d e l . L o g i s t i c R e g r e s s i o n ( C = 1 e 5 ) > > >l o g i s t i c . f i t ( i r i s _ X _ t r a i n ,i r i s _ y _ t r a i n ) L o g i s t i c R e g r e s s i o n ( C = 1 0 0 0 0 0 . 0 ,c l a s s _ w e i g h t = N o n e ,d u a l = F a l s e , f i t _ i n t e r c e p t = T r u e ,i n t e r c e p t _ s c a l i n g = 1 ,p e n a l t y = ' l 2 ' , r a n d o m _ s t a t e = N o n e ,t o l = 0 . 0 0 0 1 )
This is known as LogisticRegression.
[23]
Multiclass classification If you have several classes to predict, an option often used is to fit one-versusall classifiers and then use a voting heuristic for the final decision. Shrinkage and sparsity with logistic regression The C parameter controls the amount of regularization in the LogisticRegression object: a large value for C results in less regularization. penalty=l2 gives Shrinkage (i.e. non-sparse coefficients), while penalty=l1 gives Sparsity. Exercise Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test prediction performance on these observations.
f r o ms k l e a r ni m p o r td a t a s e t s ,n e i g h b o r s ,l i n e a r _ m o d e l
9/27
3.6.13.
d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) X _ d i g i t s=d i g i t s . d a t a y _ d i g i t s=d i g i t s . t a r g e t
Solution: ../../auto_examples/exercises/plot_digits_classification_exercise.py [24]
2.2.2.3. Support vector machines (SVMs ) 2.2.2.3.1. Linear SVMs

Support Vector Machines belong to the discriminant model family: they try to find a combination of samples to build a plane maximizing the margin between the two classes. Regularization is set by the C parameter: a small value for C means the margin is calculated using many or all of the observations around the separating line (more regularization); a large value for C means the margin is calculated on observations close to the separating line (less regularization). UNREGULARIZED SVM REGULARIZED SVM (DEFAULT)
[25]
[26]
[27]
SVMs can be used in regression SVR (Support Vector Regression), or in classification SVC (Support Vector Classification).
> > >f r o ms k l e a r ni m p o r ts v m > > >s v c=s v m . S V C ( k e r n e l = ' l i n e a r ' ) > > >s v c . f i t ( i r i s _ X _ t r a i n ,i r i s _ y _ t r a i n ) S V C ( C = 1 . 0 ,c a c h e _ s i z e = 2 0 0 ,c l a s s _ w e i g h t = N o n e ,c o e f 0 = 0 . 0 ,d e g r e e = 3 ,g a m m a = 0 . 0 , k e r n e l = ' l i n e a r ' ,m a x _ i t e r = 1 ,p r o b a b i l i t y = F a l s e ,s h r i n k i n g = T r u e ,t o l = 0 . 0 0 1 , v e r b o s e = F a l s e )
Warning Normalizing data For many estimators, including the SVMs, having datasets with unit standard deviation for each feature is important to get good prediction.
3.6.13.
2.2.2.3.2. Using kernels

Classes are not always linearly separable in feature space. The solution is to build a decision function that is not linear but may be polynomial instead. This is done using the kernel trick that can be seen as creating a decision energy by positioning kernels on observations: Linear kernel Polynomial kernel
[28] > > >s v c=s v m . S V C ( k e r n e l = ' l i n e a r ' )
[29] > > >s v c=s v m . S V C ( k e r n e l = ' p o l y ' , . . . d e g r e e = 3 ) > > >#d e g r e e :p o l y n o m i a ld e g r e e
RBF kernel (Radial Basis Function)
[30] > > >s v c=s v m . S V C ( k e r n e l = ' r b f ' ) > > >#g a m m a :i n v e r s eo fs i z eo f > > >#r a d i a lk e r n e l
Interactive example See the SVM GUI to download svm_gui.py; add data points of both classes with right and left button, fit the model and change parameters and data. Exercise Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 first features. Leave out 10% of each class and test prediction performance on these observations. Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class. Hint: You can use the decision_function method on a grid to get intuitions.
i r i s=d a t a s e t s . l o a d _ i r i s ( ) X=i r i s . d a t a y=i r i s . t a r g e t
11/27
3.6.13.
X=X [ y! =0 ,: 2 ] y=y [ y! =0 ]
Solution: ../../auto_examples/exercises/plot_iris_exercise.py [31]
page 4
2.2.3.1. Score, and cross-validated scores

As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.
> > >f r o ms k l e a r ni m p o r td a t a s e t s ,s v m > > >d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) > > >X _ d i g i t s=d i g i t s . d a t a > > >y _ d i g i t s=d i g i t s . t a r g e t > > >s v c=s v m . S V C ( C = 1 ,k e r n e l = ' l i n e a r ' ) > > >s v c . f i t ( X _ d i g i t s [ : 1 0 0 ] ,y _ d i g i t s [ : 1 0 0 ] ) . s c o r e ( X _ d i g i t s [ 1 0 0 : ] ,y _ d i g i t s [ 1 0 0 : ] ) 0 . 9 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8
To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing:
> > >i m p o r tn u m p ya sn p > > >X _ f o l d s=n p . a r r a y _ s p l i t ( X _ d i g i t s ,3 ) > > >y _ f o l d s=n p . a r r a y _ s p l i t ( y _ d i g i t s ,3 ) > > >s c o r e s=l i s t ( ) > > >f o rki nr a n g e ( 3 ) : . . . . . . . . . . . . . . . . . . . . . . . . #W eu s e' l i s t 't oc o p y ,i no r d e rt o' p o p 'l a t e ro n X _ t r a i n=l i s t ( X _ f o l d s ) X _ t e s t =X _ t r a i n . p o p ( k ) X _ t r a i n=n p . c o n c a t e n a t e ( X _ t r a i n ) y _ t r a i n=l i s t ( y _ f o l d s ) y _ t e s t =y _ t r a i n . p o p ( k ) y _ t r a i n=n p . c o n c a t e n a t e ( y _ t r a i n ) s c o r e s . a p p e n d ( s v c . f i t ( X _ t r a i n ,y _ t r a i n ) . s c o r e ( X _ t e s t ,y _ t e s t ) )
> > >p r i n ts c o r e s [ 0 . 9 3 4 8 9 1 4 8 5 8 0 9 6 8 2 8 4 ,0 . 9 5 6 5 9 4 3 2 3 8 7 3 1 2 1 8 2 ,0 . 9 3 9 8 9 9 8 3 3 0 5 5 0 9 1 8 4 ]
This is called a KFold cross validation
2.2.3.2. Cross-validation generators

The code above to split data in train and test sets is tedious to write. The sklearn exposes cross-validation generators to generate list of indices for this purpose:
> > >f r o ms k l e a r ni m p o r tc r o s s _ v a l i d a t i o n > > >k _ f o l d=c r o s s _ v a l i d a t i o n . K F o l d ( n = 6 ,n _ f o l d s = 3 ,i n d i c e s = T r u e ) > > >f o rt r a i n _ i n d i c e s ,t e s t _ i n d i c e si nk _ f o l d : . . . p r i n t' T r a i n :% s|t e s t :% s '%( t r a i n _ i n d i c e s ,t e s t _ i n d i c e s ) T r a i n :[ 2345 ]|t e s t :[ 01 ] T r a i n :[ 0145 ]|t e s t :[ 23 ] T r a i n :[ 0123 ]|t e s t :[ 45 ]
The cross-validation can then be implemented easily:

3.6.13.
> > >k f o l d=c r o s s _ v a l i d a t i o n . K F o l d ( l e n ( X _ d i g i t s ) ,n _ f o l d s = 3 ) > > >[ s v c . f i t ( X _ d i g i t s [ t r a i n ] ,y _ d i g i t s [ t r a i n ] ) . s c o r e ( X _ d i g i t s [ t e s t ] ,y _ d i g i t s [ t e s t ] ) . . . f o rt r a i n ,t e s ti nk f o l d ] [ 0 . 9 3 4 8 9 1 4 8 5 8 0 9 6 8 2 8 4 ,0 . 9 5 6 5 9 4 3 2 3 8 7 3 1 2 1 8 2 ,0 . 9 3 9 8 9 9 8 3 3 0 5 5 0 9 1 8 4 ]
To compute the score method of an estimator, the sklearn exposes a helper function:
> > >c r o s s _ v a l i d a t i o n . c r o s s _ v a l _ s c o r e ( s v c ,X _ d i g i t s ,y _ d i g i t s ,c v = k f o l d ,n _ j o b s = 1 ) a r r a y ( [0 . 9 3 4 8 9 1 4 9 , 0 . 9 5 6 5 9 4 3 2 , 0 . 9 3 9 8 9 9 8 3 ] )
n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer. Cross-validation generators
KFold (n, k) StratifiedKFold (y, k) LeaveOneOut LeaveOneLabelOut
(n) Split it K folds, train It preserves the class on left-out Exercise within each fold. Leave one out
(labels) Takes a label observations
on K-1 and then test ratios / label distribution observation array to group
[32]
On the digits dataset, plot the cross-validation score of a SVC estimator with an linear kernel as a function of parameter C (use a logarithmic grid of points, from 1 to 10).
f r o ms k l e a r ni m p o r tc r o s s _ v a l i d a t i o n ,d a t a s e t s ,s v m d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) X=d i g i t s . d a t a y=d i g i t s . t a r g e t s v c=s v m . S V C ( k e r n e l = ' l i n e a r ' ) C _ s=n p . l o g s p a c e ( 1 0 ,0 ,1 0 ) s c o r e s=l i s t ( ) s c o r e s _ s t d=l i s t ( )
Solution: Cross-validation on Digits Dataset Exercise
2.2.3.3. Grid-search and cross-validated estimators 2.2.3.3.1. Grid-search

3.6.13.
The sklearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:
> > >f r o ms k l e a r n . g r i d _ s e a r c hi m p o r tG r i d S e a r c h C V > > >g a m m a s=n p . l o g s p a c e ( 6 ,1 ,1 0 ) > > >c l f=G r i d S e a r c h C V ( e s t i m a t o r = s v c ,p a r a m _ g r i d = d i c t ( g a m m a = g a m m a s ) , . . . n _ j o b s = 1 ) > > >c l f . f i t ( X _ d i g i t s [ : 1 0 0 0 ] ,y _ d i g i t s [ : 1 0 0 0 ] ) G r i d S e a r c h C V ( c v = N o n e , . . . > > >c l f . b e s t _ s c o r e _ 0 . 9 8 8 9 9 1 9 8 5 9 9 7 9 7 4 > > >c l f . b e s t _ e s t i m a t o r _ . g a m m a 9 . 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 5 e 0 7 > > >#P r e d i c t i o np e r f o r m a n c eo nt e s ts e ti sn o ta sg o o da so nt r a i ns e t > > >c l f . s c o r e ( X _ d i g i t s [ 1 0 0 0 : ] ,y _ d i g i t s [ 1 0 0 0 : ] ) 0 . 9 4 2 2 8 3 5 6 3 3 6 2 6 0 9 7 7
By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold. Nested cross-validation
> > >c r o s s _ v a l i d a t i o n . c r o s s _ v a l _ s c o r e ( c l f ,X _ d i g i t s ,y _ d i g i t s ) a r r a y ( [0 . 9 7 9 9 6 6 6 1 , 0 . 9 8 1 6 3 6 0 6 , 0 . 9 8 3 3 0 5 5 1 ] )
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data. Warning You cannot nest objects with parallel computing (n_jobs different than 1).
2.2.3.3.2. Cross-validated estimators

Cross-validation to set a parameter can be done more efficiently on an algorithm-by-algorithm basis. This is why for certain estimators the sklearn exposes Cross-Validation: evaluating estimator performance estimators that set their parameter automatically by cross-validation:
> > >f r o ms k l e a r ni m p o r tl i n e a r _ m o d e l ,d a t a s e t s > > >l a s s o=l i n e a r _ m o d e l . L a s s o C V ( ) > > >d i a b e t e s=d a t a s e t s . l o a d _ d i a b e t e s ( ) > > >X _ d i a b e t e s=d i a b e t e s . d a t a > > >y _ d i a b e t e s=d i a b e t e s . t a r g e t > > >l a s s o . f i t ( X _ d i a b e t e s ,y _ d i a b e t e s ) L a s s o C V ( a l p h a s = N o n e ,c o p y _ X = T r u e ,c v = N o n e ,e p s = 0 . 0 0 1 ,f i t _ i n t e r c e p t = T r u e , m a x _ i t e r = 1 0 0 0 ,n _ a l p h a s = 1 0 0 ,n o r m a l i z e = F a l s e ,p r e c o m p u t e = ' a u t o ' , t o l = 0 . 0 0 0 1 ,v e r b o s e = F a l s e ) > > >#T h ee s t i m a t o rc h o s ea u t o m a t i c a l l yi t sl a m b d a : > > >l a s s o . a l p h a _ 0 . 0 1 3 1 8 . . .
These estimators are called similarly to their counterparts, with CV appended to their name. Exercise
3.6.13.
On the diabetes dataset, find the optimal regularization parameter alpha. Bonus: How much can you trust the selection of alpha?
i m p o r tn u m p ya sn p i m p o r tp y l a ba sp l f r o ms k l e a r ni m p o r tc r o s s _ v a l i d a t i o n ,d a t a s e t s ,l i n e a r _ m o d e l d i a b e t e s=d a t a s e t s . l o a d _ d i a b e t e s ( ) X=d i a b e t e s . d a t a [ : 1 5 0 ] y=d i a b e t e s . t a r g e t [ : 1 5 0 ] l a s s o=l i n e a r _ m o d e l . L a s s o ( ) a l p h a s=n p . l o g s p a c e ( 4 ,. 5 ,3 0 )
Solution: Cross-validation on diabetes Dataset Exercise
page 5
2.2.4.1. Clustering: grouping observations together

The problem solved in clustering Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist to label them: we could try a clustering task: split the observations into well-separated group called clusters.
2.2.4.1.1. K-means clustering

Note that there exist a lot of different clustering criteria and associated algorithms. The simplest clustering algorithm is K-means.
[33] > > >f r o ms k l e a r ni m p o r tc l u s t e r ,d a t a s e t s > > >i r i s=d a t a s e t s . l o a d _ i r i s ( ) > > >X _ i r i s=i r i s . d a t a > > >y _ i r i s=i r i s . t a r g e t > > >k _ m e a n s=c l u s t e r . K M e a n s ( n _ c l u s t e r s = 3 ) > > >k _ m e a n s . f i t ( X _ i r i s ) K M e a n s ( c o p y _ x = T r u e ,i n i t = ' k m e a n s + + ' ,. . . > > >p r i n tk _ m e a n s . l a b e l s _ [ : : 1 0 ] [ 111110000022222 ]
15/27
3.6.13.
> > >p r i n ty _ i r i s [ : : 1 0 ] [ 000001111122222 ]
Warning There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although in the sklearn package we play many tricks to mitigate this issue.
[34]
[35]
[36]
Bad initialization Dont over-interpret clustering results Application example: vector quantization
8 clusters
Ground truth
Clustering in general and KMeans, in particular, can be seen as a way of choosing a small number of exemplars to compress the information. The problem is sometimes known as vector quantization [37]. For instance, this can be used to posterize an image:
> > >i m p o r ts c i p ya ss p > > >t r y : . . . . . . . . . l e n a=s p . l e n a ( ) f r o ms c i p yi m p o r tm i s c l e n a=m i s c . l e n a ( ) . . .e x c e p tA t t r i b u t e E r r o r :
> > >X=l e n a . r e s h a p e ( ( 1 ,1 ) )#W en e e da n( n _ s a m p l e ,n _ f e a t u r e )a r r a y > > >k _ m e a n s=c l u s t e r . K M e a n s ( n _ c l u s t e r s = 5 ,n _ i n i t = 1 ) > > >k _ m e a n s . f i t ( X ) K M e a n s ( c o p y _ x = T r u e ,i n i t = ' k m e a n s + + ' ,. . . > > >v a l u e s=k _ m e a n s . c l u s t e r _ c e n t e r s _ . s q u e e z e ( ) > > >l a b e l s=k _ m e a n s . l a b e l s _ > > >l e n a _ c o m p r e s s e d=n p . c h o o s e ( l a b e l s ,v a l u e s ) > > >l e n a _ c o m p r e s s e d . s h a p e=l e n a . s h a p e
[38]
[39]
[40]
[41]
16/27
3.6.13.
Raw image
K-means quantization
Equal bins
Image histogram
2.2.4.1.2. Hierarchical agglomerative clustering: Ward

A Hierarchical clustering method is a type of cluster analysis that aims to build a hierarchy of clusters. In general, the various approaches of this technique are either: Agglomerative - bottom-up approaches, or Divisive - top-down approaches. For estimating a large number of clusters, top-down approaches are both statistically ill-posed and slow due to it starting with all observations as one cluster, which it splits recursively. Agglomerative hierarchical-clustering is a bottom-up approach that successively merges observations together and is particularly useful when the clusters of interest are made of only a few observations. Ward clustering minimizes a criterion similar to k-means in a bottom-up approach. When the number of clusters is large, it is much more computationally efficient than k-means.
2.2.4.1.2.1. Connectivity-constrained clustering

With Ward clustering, it is possible to specify which samples can be clustered together by giving a connectivity graph. Graphs in the scikit are represented by their adjacency matrix. Often, a sparse matrix is used. This can be useful, for instance, to retrieve connected regions (sometimes also referred to as connected components) when clustering an image:
[42] # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #G e n e r a t ed a t a l e n a=s p . m i s c . l e n a ( ) #D o w n s a m p l et h ei m a g eb yaf a c t o ro f4 l e n a=l e n a [ : : 2 ,: : 2 ]+l e n a [ 1 : : 2 ,: : 2 ]+l e n a [ : : 2 ,1 : : 2 ]+l e n a [ 1 : : 2 ,1 : : 2 ] X=n p . r e s h a p e ( l e n a ,( 1 ,1 ) ) # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #D e f i n et h es t r u c t u r eAo ft h ed a t a .P i x e l sc o n n e c t e dt ot h e i rn e i g h b o r s . c o n n e c t i v i t y=g r i d _ t o _ g r a p h ( * l e n a . s h a p e )
17/27
3.6.13.
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #C o m p u t ec l u s t e r i n g p r i n t" C o m p u t es t r u c t u r e dh i e r a r c h i c a lc l u s t e r i n g . . . " s t=t i m e . t i m e ( ) n _ c l u s t e r s=1 5 #n u m b e ro fr e g i o n s w a r d=W a r d ( n _ c l u s t e r s = n _ c l u s t e r s ,c o n n e c t i v i t y = c o n n e c t i v i t y ) . f i t ( X ) l a b e l=n p . r e s h a p e ( w a r d . l a b e l s _ ,l e n a . s h a p e ) p r i n t" E l a s p s e dt i m e :" ,t i m e . t i m e ( )-s t p r i n t" N u m b e ro fp i x e l s :" ,l a b e l . s i z e p r i n t" N u m b e ro fc l u s t e r s :" ,n p . u n i q u e ( l a b e l ) . s i z e
2.2.4.1.2.2. Feature agglomeration

We have seen that sparsity could be used to mitigate the curse of dimensionality, i.e an insufficient amount of observations compared to the number of features. Another approach is to merge together similar features: feature agglomeration. This approach can be implemented by clustering in the feature direction, in other words clustering the transposed data.
[43] > > >d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) > > >i m a g e s=d i g i t s . i m a g e s > > >X=n p . r e s h a p e ( i m a g e s ,( l e n ( i m a g e s ) ,1 ) ) > > >c o n n e c t i v i t y=g r i d _ t o _ g r a p h ( * i m a g e s [ 0 ] . s h a p e ) > > >a g g l o=c l u s t e r . W a r d A g g l o m e r a t i o n ( c o n n e c t i v i t y = c o n n e c t i v i t y , . . . > > >a g g l o . f i t ( X ) W a r d A g g l o m e r a t i o n ( c o m p u t e _ f u l l _ t r e e = ' a u t o ' , . . . > > >X _ r e d u c e d=a g g l o . t r a n s f o r m ( X ) > > >X _ a p p r o x=a g g l o . i n v e r s e _ t r a n s f o r m ( X _ r e d u c e d ) > > >i m a g e s _ a p p r o x=n p . r e s h a p e ( X _ a p p r o x ,i m a g e s . s h a p e ) n _ c l u s t e r s = 3 2 )
transform and inverse_transform methods Some estimators expose a transform method, for instance to reduce the dimensionality of the dataset.
2.2.4.2. Decompositions: from a signal to components and loadings

Components and loadings If X is our multivariate data, then the problem that we are trying to solve is to
3.6.13.
rewrite it on a different observational basis: we want to learn loadings L and a set of components C such that X = L C. Different criteria exist to choose the components
2.2.4.2.1. Principal component analysis: PCA

Principal component analysis (PCA) selects the successive components that explain the maximum variance in the signal.
[44]
[45]
[46]
The point cloud spanned by the observations above is very flat in one direction: one of the three univariate features can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat When used to transform data, PCA can reduce the dimensionality of the data by projecting on a principal subspace.
> > >#C r e a t eas i g n a lw i t ho n l y2u s e f u ld i m e n s i o n s > > >x 1=n p . r a n d o m . n o r m a l ( s i z e = 1 0 0 ) > > >x 2=n p . r a n d o m . n o r m a l ( s i z e = 1 0 0 ) > > >x 3=x 1+x 2 > > >X=n p . c _ [ x 1 ,x 2 ,x 3 ] > > >f r o ms k l e a r ni m p o r td e c o m p o s i t i o n > > >p c a=d e c o m p o s i t i o n . P C A ( ) > > >p c a . f i t ( X ) P C A ( c o p y = T r u e ,n _ c o m p o n e n t s = N o n e ,w h i t e n = F a l s e ) > > >p r i n tp c a . e x p l a i n e d _ v a r i a n c e _ [ 2 . 1 8 5 6 5 8 1 1 e + 0 0 1 . 1 9 3 4 6 7 4 7 e + 0 0 8 . 4 3 0 2 6 6 7 9 e 3 2 ]
19/27
3.6.13.
> > >p c a . n _ c o m p o n e n t s=2
> > >#A sw ec a ns e e ,o n l yt h e2f i r s tc o m p o n e n t sa r eu s e f u l > > >X _ r e d u c e d=p c a . f i t _ t r a n s f o r m ( X ) > > >X _ r e d u c e d . s h a p e ( 1 0 0 ,2 )
2.2.4.2.2. Independent Component Analysis: ICA

Independent component analysis (ICA) selects components so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals:
[47] > > >#G e n e r a t es a m p l ed a t a > > >t i m e=n p . l i n s p a c e ( 0 ,1 0 ,2 0 0 0 ) > > >s 1=n p . s i n ( 2*t i m e ) #S i g n a l1:s i n u s o i d a ls i g n a l > > >s 2=n p . s i g n ( n p . s i n ( 3*t i m e ) ) #S i g n a l2:s q u a r es i g n a l > > >S=n p . c _ [ s 1 ,s 2 ] > > >S+ =0 . 2*n p . r a n d o m . n o r m a l ( s i z e = S . s h a p e ) #A d dn o i s e > > >S/ =S . s t d ( a x i s = 0 ) #S t a n d a r d i z ed a t a > > >#M i xd a t a > > >A=n p . a r r a y ( [ [ 1 ,1 ] ,[ 0 . 5 ,2 ] ] ) #M i x i n gm a t r i x > > >X=n p . d o t ( S ,A . T ) #G e n e r a t eo b s e r v a t i o n s > > >#C o m p u t eI C A > > >i c a=d e c o m p o s i t i o n . F a s t I C A ( ) > > >S _=i c a . f i t ( X ) . t r a n s f o r m ( X ) #G e tt h ee s t i m a t e ds o u r c e s > > >A _=i c a . g e t _ m i x i n g _ m a t r i x ( ) #G e te s t i m a t e dm i x i n gm a t r i x > > >n p . a l l c l o s e ( X ,n p . d o t ( S _ ,A _ . T ) ) T r u e
page 6
2.2.5.1. Pipelining
We have seen that some estimators can transform data and that some estimators can predict variables. We can also create combined estimators:
20/27
3.6.13.
[48] i m p o r tp y l a ba sp l f r o ms k l e a r ni m p o r tl i n e a r _ m o d e l ,d e c o m p o s i t i o n ,d a t a s e t s l o g i s t i c=l i n e a r _ m o d e l . L o g i s t i c R e g r e s s i o n ( ) p c a=d e c o m p o s i t i o n . P C A ( ) f r o ms k l e a r n . p i p e l i n ei m p o r tP i p e l i n e p i p e=P i p e l i n e ( s t e p s = [ ( ' p c a ' ,p c a ) ,( ' l o g i s t i c ' ,l o g i s t i c ) ] ) d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) X _ d i g i t s=d i g i t s . d a t a y _ d i g i t s=d i g i t s . t a r g e t # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #P l o tt h eP C As p e c t r u m p c a . f i t ( X _ d i g i t s ) p l . f i g u r e ( 1 ,f i g s i z e = ( 4 ,3 ) ) p l . c l f ( ) p l . a x e s ( [ . 2 ,. 2 ,. 7 ,. 7 ] ) p l . p l o t ( p c a . e x p l a i n e d _ v a r i a n c e _ ,l i n e w i d t h = 2 ) p l . a x i s ( ' t i g h t ' ) p l . x l a b e l ( ' n _ c o m p o n e n t s ' ) p l . y l a b e l ( ' e x p l a i n e d _ v a r i a n c e _ ' ) # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #P r e d i c t i o n f r o ms k l e a r n . g r i d _ s e a r c hi m p o r tG r i d S e a r c h C V n _ c o m p o n e n t s=[ 2 0 ,4 0 ,6 4 ] C s=n p . l o g s p a c e ( 4 ,4 ,3 ) # P a r a m e t e r so fp i p e l i n e sc a nb es e tu s i n g _ _ s e p a r a t e dp a r a m e t e rn a m e s : e s t i m a t o r=G r i d S e a r c h C V ( p i p e , d i c t ( p c a _ _ n _ c o m p o n e n t s = n _ c o m p o n e n t s , l o g i s t i c _ _ C = C s ) ) e s t i m a t o r . f i t ( X _ d i g i t s ,y _ d i g i t s ) p l . a x v l i n e ( e s t i m a t o r . b e s t _ e s t i m a t o r _ . n a m e d _ s t e p s [ ' p c a ' ] . n _ c o m p o n e n t s , l i n e s t y l e = ' : ' ,l a b e l = ' n _ c o m p o n e n t sc h o s e n ' ) p l . l e g e n d ( p r o p = d i c t ( s i z e = 1 2 ) )
2.2.5.2. Face recognition with eigenfaces

The dataset used in this example is a preprocessed excerpt of the Labeled Faces
3.6.13.
in the Wild, also known as LFW [49]: http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz [50] (233MB)

" " " = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = F a c e sr e c o g n i t i o ne x a m p l eu s i n ge i g e n f a c e sa n dS V M s = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = T h ed a t a s e tu s e di nt h i se x a m p l ei sap r e p r o c e s s e de x c e r p to ft h e " L a b e l e dF a c e si nt h eW i l d " ,a k aL F W _ : h t t p : / / v i s w w w . c s . u m a s s . e d u / l f w / l f w f u n n e l e d . t g z( 2 3 3 M B ) . ._ L F W :h t t p : / / v i s w w w . c s . u m a s s . e d u / l f w / E x p e c t e dr e s u l t sf o rt h et o p5m o s tr e p r e s e n t e dp e o p l ei nt h ed a t a s e t : : p r e c i s i o n G e r h a r d _ S c h r o e d e r D o n a l d _ R u m s f e l d T o n y _ B l a i r C o l i n _ P o w e l l G e o r g e _ W _ B u s h a v g/t o t a l 0 . 9 1 0 . 8 4 0 . 6 5 0 . 7 8 0 . 9 3 0 . 8 6 r e c a l l f 1 s c o r e 0 . 7 5 0 . 8 2 0 . 8 2 0 . 8 8 0 . 8 6 0 . 8 4 0 . 8 2 0 . 8 3 0 . 7 3 0 . 8 3 0 . 9 0 0 . 8 5 s u p p o r t 2 8 3 3 3 4 5 8 1 2 9 2 8 2
" " " p r i n t_ _ d o c _ _ f r o mt i m ei m p o r tt i m e i m p o r tl o g g i n g i m p o r tp y l a ba sp l f r o ms k l e a r n . c r o s s _ v a l i d a t i o ni m p o r tt r a i n _ t e s t _ s p l i t f r o ms k l e a r n . d a t a s e t si m p o r tf e t c h _ l f w _ p e o p l e f r o ms k l e a r n . g r i d _ s e a r c hi m p o r tG r i d S e a r c h C V f r o ms k l e a r n . m e t r i c si m p o r tc l a s s i f i c a t i o n _ r e p o r t f r o ms k l e a r n . m e t r i c si m p o r tc o n f u s i o n _ m a t r i x f r o ms k l e a r n . d e c o m p o s i t i o ni m p o r tR a n d o m i z e d P C A f r o ms k l e a r n . s v mi m p o r tS V C #D i s p l a yp r o g r e s sl o g so ns t d o u t l o g g i n g . b a s i c C o n f i g ( l e v e l = l o g g i n g . I N F O ,f o r m a t = ' % ( a s c t i m e ) s% ( m e s s a g e ) s ' )
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #D o w n l o a dt h ed a t a ,i fn o ta l r e a d yo nd i s ka n dl o a di ta sn u m p ya r r a y s l f w _ p e o p l e=f e t c h _ l f w _ p e o p l e ( m i n _ f a c e s _ p e r _ p e r s o n = 7 0 ,r e s i z e = 0 . 4 ) #i n t r o s p e c tt h ei m a g e sa r r a y st of i n dt h es h a p e s( f o rp l o t t i n g ) n _ s a m p l e s ,h ,w=l f w _ p e o p l e . i m a g e s . s h a p e #f o tm a c h i n el e a r n i n gw eu s et h e2d a t ad i r e c t l y( a sr e l a t i v ep i x e l #p o s i t i o n si n f oi si g n o r e db yt h i sm o d e l ) X=l f w _ p e o p l e . d a t a n _ f e a t u r e s=X . s h a p e [ 1 ] #t h el a b e lt op r e d i c ti st h ei do ft h ep e r s o n
22/27
3.6.13.
y=l f w _ p e o p l e . t a r g e t
t a r g e t _ n a m e s=l f w _ p e o p l e . t a r g e t _ n a m e s n _ c l a s s e s=t a r g e t _ n a m e s . s h a p e [ 0 ] p r i n t" T o t a ld a t a s e ts i z e : " p r i n t" n _ s a m p l e s :% d "%n _ s a m p l e s p r i n t" n _ f e a t u r e s :% d "%n _ f e a t u r e s p r i n t" n _ c l a s s e s :% d "%n _ c l a s s e s
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #S p l i ti n t oat r a i n i n gs e ta n dat e s ts e tu s i n gas t r a t i f i e dkf o l d #s p l i ti n t oat r a i n i n ga n dt e s t i n gs e t X _ t r a i n ,X _ t e s t ,y _ t r a i n ,y _ t e s t=t r a i n _ t e s t _ s p l i t ( X ,y ,t e s t _ s i z e = 0 . 2 5 )
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #C o m p u t eaP C A( e i g e n f a c e s )o nt h ef a c ed a t a s e t( t r e a t e da su n l a b e l e d #d a t a s e t ) :u n s u p e r v i s e df e a t u r ee x t r a c t i o n/d i m e n s i o n a l i t yr e d u c t i o n n _ c o m p o n e n t s=1 5 0 p r i n t" E x t r a c t i n gt h et o p% de i g e n f a c e sf r o m% df a c e s "%( n _ c o m p o n e n t s ,X _ t r a i n . s h a p e [ 0 ] ) t 0=t i m e ( ) p c a=R a n d o m i z e d P C A ( n _ c o m p o n e n t s = n _ c o m p o n e n t s ,w h i t e n = T r u e ) . f i t ( X _ t r a i n ) p r i n t" d o n ei n% 0 . 3 f s "%( t i m e ( )-t 0 ) e i g e n f a c e s=p c a . c o m p o n e n t s _ . r e s h a p e ( ( n _ c o m p o n e n t s ,h ,w ) ) p r i n t" P r o j e c t i n gt h ei n p u td a t ao nt h ee i g e n f a c e so r t h o n o r m a lb a s i s " t 0=t i m e ( ) X _ t r a i n _ p c a=p c a . t r a n s f o r m ( X _ t r a i n ) X _ t e s t _ p c a=p c a . t r a n s f o r m ( X _ t e s t ) p r i n t" d o n ei n% 0 . 3 f s "%( t i m e ( )-t 0 )
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #T r a i naS V Mc l a s s i f i c a t i o nm o d e l p r i n t" F i t t i n gt h ec l a s s i f i e rt ot h et r a i n i n gs e t " t 0=t i m e ( ) p a r a m _ g r i d={ ' C ' :[ 1 e 3 ,5 e 3 ,1 e 4 ,5 e 4 ,1 e 5 ] , ' g a m m a ' :[ 0 . 0 0 0 1 ,0 . 0 0 0 5 ,0 . 0 0 1 ,0 . 0 0 5 ,0 . 0 1 ,0 . 1 ] ,} c l f=G r i d S e a r c h C V ( S V C ( k e r n e l = ' r b f ' ,c l a s s _ w e i g h t = ' a u t o ' ) ,p a r a m _ g r i d ) c l f=c l f . f i t ( X _ t r a i n _ p c a ,y _ t r a i n ) p r i n t" d o n ei n% 0 . 3 f s "%( t i m e ( )-t 0 ) p r i n t" B e s te s t i m a t o rf o u n db yg r i ds e a r c h : " p r i n tc l f . b e s t _ e s t i m a t o r _
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #Q u a n t i t a t i v ee v a l u a t i o no ft h em o d e lq u a l i t yo nt h et e s ts e t p r i n t" P r e d i c t i n gt h ep e o p l en a m e so nt h et e s t i n gs e t " t 0=t i m e ( ) y _ p r e d=c l f . p r e d i c t ( X _ t e s t _ p c a ) p r i n t" d o n ei n% 0 . 3 f s "%( t i m e ( )-t 0 ) p r i n tc l a s s i f i c a t i o n _ r e p o r t ( y _ t e s t ,y _ p r e d ,t a r g e t _ n a m e s = t a r g e t _ n a m e s ) p r i n tc o n f u s i o n _ m a t r i x ( y _ t e s t ,y _ p r e d ,l a b e l s = r a n g e ( n _ c l a s s e s ) )
23/27
3.6.13.
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #Q u a l i t a t i v ee v a l u a t i o no ft h ep r e d i c t i o n su s i n gm a t p l o t l i b d e fp l o t _ g a l l e r y ( i m a g e s ,t i t l e s ,h ,w ,n _ r o w = 3 ,n _ c o l = 4 ) : " " " H e l p e rf u n c t i o nt op l o tag a l l e r yo fp o r t r a i t s " " " p l . f i g u r e ( f i g s i z e = ( 1 . 8*n _ c o l ,2 . 4*n _ r o w ) ) p l . s u b p l o t s _ a d j u s t ( b o t t o m = 0 ,l e f t = . 0 1 ,r i g h t = . 9 9 ,t o p = . 9 0 ,h s p a c e = . 3 5 ) f o rii nr a n g e ( n _ r o w*n _ c o l ) : p l . s u b p l o t ( n _ r o w ,n _ c o l ,i+1 ) p l . i m s h o w ( i m a g e s [ i ] . r e s h a p e ( ( h ,w ) ) ,c m a p = p l . c m . g r a y ) p l . t i t l e ( t i t l e s [ i ] ,s i z e = 1 2 ) p l . x t i c k s ( ( ) ) p l . y t i c k s ( ( ) )
#p l o tt h er e s u l to ft h ep r e d i c t i o no nap o r t i o no ft h et e s ts e t d e ft i t l e ( y _ p r e d ,y _ t e s t ,t a r g e t _ n a m e s ,i ) : p r e d _ n a m e=t a r g e t _ n a m e s [ y _ p r e d [ i ] ] . r s p l i t ( '' ,1 ) [ 1 ] t r u e _ n a m e=t a r g e t _ n a m e s [ y _ t e s t [ i ] ] . r s p l i t ( '' ,1 ) [ 1 ] r e t u r n' p r e d i c t e d :% s \ n t r u e : % s '%( p r e d _ n a m e ,t r u e _ n a m e )
p r e d i c t i o n _ t i t l e s=[ t i t l e ( y _ p r e d ,y _ t e s t ,t a r g e t _ n a m e s ,i ) f o rii nr a n g e ( y _ p r e d . s h a p e [ 0 ] ) ] p l o t _ g a l l e r y ( X _ t e s t ,p r e d i c t i o n _ t i t l e s ,h ,w ) #p l o tt h eg a l l e r yo ft h em o s ts i g n i f i c a t i v ee i g e n f a c e s e i g e n f a c e _ t i t l e s=[ " e i g e n f a c e% d "%if o rii nr a n g e ( e i g e n f a c e s . s h a p e [ 0 ] ) ] p l o t _ g a l l e r y ( e i g e n f a c e s ,e i g e n f a c e _ t i t l e s ,h ,w ) p l . s h o w ( )
[51]
[52]
Prediction
Eigenfaces
Expected results for the top 5 most represented people in the dataset:
p r e c i s i o n G e r h a r d _ S c h r o e d e r D o n a l d _ R u m s f e l d T o n y _ B l a i r C o l i n _ P o w e l l G e o r g e _ W _ B u s h 0 . 9 1 0 . 8 4 0 . 6 5 0 . 7 8 0 . 9 3 r e c a l l f 1 s c o r e 0 . 7 5 0 . 8 2 0 . 8 2 0 . 8 8 0 . 8 6 0 . 8 2 0 . 8 3 0 . 7 3 0 . 8 3 0 . 9 0 s u p p o r t 2 8 3 3 3 4 5 8 1 2 9
24/27
3.6.13.
a v g/t o t a l 0 . 8 6
0 . 8 4 0 . 8 5 2 8 2
2.2.5.3. Open problem: Stock Market Structure

Can we predict the variation in stock prices for Google over a given time frame? Visualizing the stock market structure
page 7
2.2.6.1. The project mailing list

If you encounter a bug with scikit-learn or something that needs clarification in the docstring or the online documentation, please feel free to ask on the Mailing List [53]
2.2.6.2. Q&A communities with Machine Learning practitioners

Metaoptimize/QA: A forum for Machine Learning, Natural Language Processing and other Data Analytics discussions (similar to what Stackoverflow is for developers): http://metaoptimize.com/qa
[54]
A good starting point is the discussion on good freely available textbooks on machine learning [55] Quora.com: Quora has a topic for Machine Learning related questions that also features some interesting discussions: http://quora.com/Machine-Learning [56] Have a look at the best questions section, eg: What are some good resources for learning about machine learning [57]. _An excellent free online course for Machine Learning taught by Professor Andrew Ng of Stanford: https://www.coursera.org/course/ml [58] _Another excellent free online course that takes a more general approach to Artificial Intelligence:http://www.udacity.com/overview/Course/cs271/CourseRev/1 [59]
1. http://en.wikipedia.org/wiki/Machine_learning 2. http://en.wikipedia.org/wiki/Statistical_inference 3. http://www.scipy.org/ 4. http://www.scipy.org/ 5. http://matplotlib.sourceforge.net/ 6. http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html 7. http://en.wikipedia.org/wiki/Estimator 8. http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
25/27
3.6.13.
9. http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html 10. http://en.wikipedia.org/wiki/Curse_of_dimensionality 11. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html 12. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge_variance.html 13. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge_variance.html 14. http://en.wikipedia.org/wiki/Overfitting 15. http://en.wikipedia.org/wiki/Regularization_%28machine_learning%29 16. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 17. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 18. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 19. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 20. http://en.wikipedia.org/wiki/Coordinate_descent 21. http://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html 22. http://en.wikipedia.org/wiki/Iris_flower_data_set 23. http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html 24. http://scikit-learn.org/stable/_downloads/plot_digits_classification_exercise1.py 25. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html 26. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html 27. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_iris.html 28. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html 29. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html 30. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html 31. http://scikit-learn.org/stable/_downloads/plot_iris_exercise1.py 32. http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_digits.html 33. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 34. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 35. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 36. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 37. http://en.wikipedia.org/wiki/Vector_quantization 38. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 39. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 40. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 41. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 42. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html 43. http://scikit-learn.org/stable/auto_examples/cluster/plot_digits_agglomeration.html 44. http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html 45. http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html 46. http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html 47. http://scikitlearn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html 48. http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html 49. http://vis-www.cs.umass.edu/lfw/ 50. http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz 51. http://scikit-learn.org/stable/_images/plot_face_recognition_1.png 52. http://scikit-learn.org/stable/_images/plot_face_recognition_2.png 53. http://scikit-learn.sourceforge.net/support.html 54. http://metaoptimize.com/qa
26/27
3.6.13.
machine-learning
55. http://metaoptimize.com/qa/questions/186/good-freely-available-textbooks-on-
56. http://quora.com/Machine-Learning 57. http://www.quora.com/What-are-some-good-resources-for-learning-about-machinelearning 58. https://www.coursera.org/course/ml 59. http://www.udacity.com/overview/Course/cs271/CourseRev/1
27/27

2.2. A Tutorial On Statistical-Learning For Scientific Data Processing - Scikit-Learn 0.13

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2.2. A Tutorial On Statistical-Learning For Scientific Data Processing - Scikit-Learn 0.13

Hochgeladen von

Copyright:

Verfügbare Formate

3.6.13.

The digits dataset is made of 1797 8x8 images of hand-written digits

2.2.1.2. Estimators objects

2.2.2.1. Nearest neighbor and the curse of dimensionality

2.2.2.1.1. k-Nearest neighbors classifier

2.2.2.1.2. The curse of dimensionality

2.2.2.2. Linear model: from regression to sparsity

The task at hand is to predict disease progression from physiological variables.

2.2.2.2.1. Linear regression

This is known as LogisticRegression.

Solution: ../../auto_examples/exercises/plot_digits_classification_exercise.py [24]

2.2.2.3. Support vector machines (SVMs ) 2.2.2.3.1. Linear SVMs

2.2.2.3.2. Using kernels

[28] > > >s v c=s v m . S V C ( k e r n e l = ' l i n e a r ' )

RBF kernel (Radial Basis Function)

Solution: ../../auto_examples/exercises/plot_iris_exercise.py [31]

2.2.3.1. Score, and cross-validated scores

> > >p r i n ts c o r e s [ 0 . 9 3 4 8 9 1 4 8 5 8 0 9 6 8 2 8 4 ,0 . 9 5 6 5 9 4 3 2 3 8 7 3 1 2 1 8 2 ,0 . 9 3 9 8 9 9 8 3 3 0 5 5 0 9 1 8 4 ]

This is called a KFold cross validation

2.2.3.2. Cross-validation generators

The cross-validation can then be implemented easily:

(labels) Takes a label observations

Solution: Cross-validation on Digits Dataset Exercise

2.2.3.3. Grid-search and cross-validated estimators 2.2.3.3.1. Grid-search

2.2.3.3.2. Cross-validated estimators

Solution: Cross-validation on diabetes Dataset Exercise

2.2.4.1. Clustering: grouping observations together

2.2.4.1.1. K-means clustering

2.2.4.1.2. Hierarchical agglomerative clustering: Ward

2.2.4.1.2.1. Connectivity-constrained clustering

2.2.4.1.2.2. Feature agglomeration

2.2.4.2. Decompositions: from a signal to components and loadings

2.2.4.2.1. Principal component analysis: PCA

2.2.4.2.2. Independent Component Analysis: ICA

2.2.5.2. Face recognition with eigenfaces

in the Wild, also known as LFW [49]: http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz [50] (233MB)

2.2.5.3. Open problem: Stock Market Structure

2.2.6.1. The project mailing list

2.2.6.2. Q&A communities with Machine Learning practitioners

56. http://quora.com/Machine-Learning 57. http://www.quora.com/What-are-some-good-resources-for-learning-about-machinelearning 58. https://www.coursera.org/course/ml 59. http://www.udacity.com/overview/Course/cs271/CourseRev/1

Das könnte Ihnen auch gefallen