Beruflich Dokumente
Kultur Dokumente
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
the tightly-knit world of scientific Python packages (numpy [3], scipy [4], matplotlib [5]).
page 2
2.2.1.1. Datasets
The scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multidimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis. A simple example shipped with the scikit: iris dataset
> > >f r o ms k l e a r ni m p o r td a t a s e t s > > >i r i s=d a t a s e t s . l o a d _ i r i s ( ) > > >d a t a=i r i s . d a t a > > >d a t a . s h a p e ( 1 5 0 ,4 )
It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as detailed in iris.DESCR. When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to by used by scikit. An example of reshaping data would be the digits dataset
scikit-learn.org/stable/tutorial/statistical_inference/index.html
1/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
[6]
To use this dataset with the scikit, we transform each 8x8 image into a feature vector of length 64
> > >d a t a=d i g i t s . i m a g e s . r e s h a p e ( ( d i g i t s . i m a g e s . s h a p e [ 0 ] ,1 ) )
Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute:
> > >e s t i m a t o r=E s t i m a t o r ( p a r a m 1 = 1 ,p a r a m 2 = 2 ) > > >e s t i m a t o r . p a r a m 1 1
Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
> > >e s t i m a t o r . e s t i m a t e d _ p a r a m _
page 3
The problem solved in supervised learning Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually
scikit-learn.org/stable/tutorial/statistical_inference/index.html 2/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
called target or labels. Most often, y is a 1D array of length n_samples. All supervised estimators [7] in the scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y. Vocabulary: classification and regression If the prediction task is to classify the observations in a set of finite labels, in other words to name the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task. In the scikit-learn for classification tasks, y is a vector of integers. Note: See the Introduction to machine learning with Scikit-learn Tutorial for a quick run-through on the basic machine learning vocabulary used within Scikit-learn.
scikit-learn.org/stable/tutorial/statistical_inference/index.html
3/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
[9] > > >#S p l i ti r i sd a t ai nt r a i na n dt e s td a t a > > >#Ar a n d o mp e r m u t a t i o n ,t os p l i tt h ed a t ar a n d o m l y > > >n p . r a n d o m . s e e d ( 0 ) > > >i n d i c e s=n p . r a n d o m . p e r m u t a t i o n ( l e n ( i r i s _ X ) ) > > >i r i s _ X _ t r a i n=i r i s _ X [ i n d i c e s [ : 1 0 ] ] > > >i r i s _ y _ t r a i n=i r i s _ y [ i n d i c e s [ : 1 0 ] ] > > >i r i s _ X _ t e s t =i r i s _ X [ i n d i c e s [ 1 0 : ] ] > > >i r i s _ y _ t e s t =i r i s _ y [ i n d i c e s [ 1 0 : ] ] > > >#C r e a t ea n df i tan e a r e s t n e i g h b o rc l a s s i f i e r > > >f r o ms k l e a r n . n e i g h b o r si m p o r tK N e i g h b o r s C l a s s i f i e r > > >k n n=K N e i g h b o r s C l a s s i f i e r ( ) > > >k n n . f i t ( i r i s _ X _ t r a i n ,i r i s _ y _ t r a i n ) K N e i g h b o r s C l a s s i f i e r ( a l g o r i t h m = ' a u t o ' ,l e a f _ s i z e = 3 0 ,n _ n e i g h b o r s = 5 ,p = 2 , w a r n _ o n _ e q u i d i s t a n t = T r u e ,w e i g h t s = ' u n i f o r m ' ) > > >k n n . p r e d i c t ( i r i s _ X _ t e s t ) a r r a y ( [ 1 ,2 ,1 ,0 ,0 ,0 ,2 ,1 ,2 ,0 ] ) > > >i r i s _ y _ t e s t a r r a y ( [ 1 ,1 ,1 ,0 ,0 ,0 ,2 ,1 ,2 ,0 ] )
scikit-learn.org/stable/tutorial/statistical_inference/index.html
4/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.
[11]
Linear models:
> > >f r o ms k l e a r ni m p o r tl i n e a r _ m o d e l > > >r e g r=l i n e a r _ m o d e l . L i n e a r R e g r e s s i o n ( ) > > >r e g r . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n ) L i n e a r R e g r e s s i o n ( c o p y _ X = T r u e ,f i t _ i n t e r c e p t = T r u e ,n o r m a l i z e = F a l s e ) > > >p r i n tr e g r . c o e f _ [ 0 . 3 0 3 4 9 9 5 52 3 7 . 6 3 9 3 1 5 3 3 5 1 0 . 5 3 0 6 0 5 4 4 3 2 7 . 7 3 6 9 8 0 4 18 1 4 . 1 3 1 7 0 9 3 7 4 9 2 . 8 1 4 5 8 7 9 8 1 0 2 . 8 4 8 4 5 2 1 9 1 8 4 . 6 0 6 4 8 9 0 6 7 4 3 . 5 1 9 6 1 6 7 5 > > >#T h em e a ns q u a r ee r r o r > > >n p . m e a n ( ( r e g r . p r e d i c t ( d i a b e t e s _ X _ t e s t ) d i a b e t e s _ y _ t e s t ) * * 2 ) 2 0 0 4 . 5 6 7 6 0 2 6 8 . . . > > >#E x p l a i n e dv a r i a n c es c o r e :1i sp e r f e c tp r e d i c t i o n > > >#a n d0m e a n st h a tt h e r ei sn ol i n e a rr e l a t i o n s h i p > > >#b e t w e e nXa n dY . > > >r e g r . s c o r e ( d i a b e t e s _ X _ t e s t ,d i a b e t e s _ y _ t e s t ) 0 . 5 8 5 0 7 5 3 0 2 2 6 9 0 . . . 7 6 . 0 9 5 1 7 2 2 2 ]
scikit-learn.org/stable/tutorial/statistical_inference/index.html
5/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
2.2.2.2.2. Shrinkage
If there are few data points per dimension, noise in the observations induces high variance:
[12] > > >X=n p . c _ [. 5 ,1 ] . T > > >y=[ . 5 ,1 ] > > >t e s t=n p . c _ [0 ,2 ] . T > > >r e g r=l i n e a r _ m o d e l . L i n e a r R e g r e s s i o n ( ) > > >i m p o r tp y l a ba sp l > > >p l . f i g u r e ( ) > > >n p . r a n d o m . s e e d ( 0 ) > > >f o r_i nr a n g e ( 6 ) : . . . . . . . . . . . . t h i s _ X=. 1 * n p . r a n d o m . n o r m a l ( s i z e = ( 2 ,1 ) )+X r e g r . f i t ( t h i s _ X ,y ) p l . p l o t ( t e s t ,r e g r . p r e d i c t ( t e s t ) ) p l . s c a t t e r ( t h i s _ X ,y ,s = 3 )
A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression:
[13] > > >r e g r=l i n e a r _ m o d e l . R i d g e ( a l p h a = . 1 ) > > >p l . f i g u r e ( ) > > >n p . r a n d o m . s e e d ( 0 ) > > >f o r_i nr a n g e ( 6 ) : . . . . . . . . . . . . t h i s _ X=. 1 * n p . r a n d o m . n o r m a l ( s i z e = ( 2 ,1 ) )+X r e g r . f i t ( t h i s _ X ,y ) p l . p l o t ( t e s t ,r e g r . p r e d i c t ( t e s t ) ) p l . s c a t t e r ( t h i s _ X ,y ,s = 3 )
scikit-learn.org/stable/tutorial/statistical_inference/index.html
6/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower the variance. We can choose alpha to minimize left out error, this time using the diabetes dataset rather than our synthetic data:
> > >a l p h a s=n p . l o g s p a c e ( 4 ,1 ,6 ) > > >p r i n t[ r e g r . s e t _ p a r a m s ( a l p h a = a l p h a . . . . . . ) . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n , ) . s c o r e ( d i a b e t e s _ X _ t e s t ,d i a b e t e s _ y _ t e s t )f o ra l p h ai na l p h a s ]
[ 0 . 5 8 5 1 1 1 0 6 8 3 8 8 3 . . . ,0 . 5 8 5 2 0 7 3 0 1 5 4 4 4 . . . ,0 . 5 8 5 4 6 7 7 5 4 0 6 9 8 . . . ,0 . 5 8 5 5 5 1 2 0 3 6 5 0 3 . . . ,0 . 5 8 3 0 7 1 7 0 8 5 5 5 4 . . . ,0 . 5 7 0 5 8 9 9 9 4 3 7 . . . ]
Capturing in the fitted parameters noise that prevents the model to generalize to new data is called overfitting [14]. The bias introduced by the ridge regression is called a regularization [15].
2.2.2.2.3. Sparsity
Fitting only features 1 and 2
[16]
[17]
[18]
scikit-learn.org/stable/tutorial/statistical_inference/index.html
7/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
[19]
A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions and one of the target variable). It is hard to develop an intuition on such representation, but it may be useful to keep in mind that it would be a fairly empty space. We can see that, although feature 2 has a strong coefficient on the full model, it conveys little information on y when considered with feature 1. To improve the conditioning of the problem (i.e. mitigating the The curse of dimensionality), it would be interesting to select only the informative features and set non-informative ones, like feature 2 to 0. Ridge regression will decrease their contribution, but not set them to zero. Another penalization approach, called Lasso (least absolute shrinkage and selection operator), can set some coefficients to zero. Such methods are called sparse method and sparsity can be seen as an application of Occams razor: prefer simpler models.
> > >r e g r=l i n e a r _ m o d e l . L a s s o ( ) > > >s c o r e s=[ r e g r . s e t _ p a r a m s ( a l p h a = a l p h a . . . . . . . . . ) . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n ) . s c o r e ( d i a b e t e s _ X _ t e s t ,d i a b e t e s _ y _ t e s t ) f o ra l p h ai na l p h a s ]
> > >b e s t _ a l p h a=a l p h a s [ s c o r e s . i n d e x ( m a x ( s c o r e s ) ) ] > > >r e g r . a l p h a=b e s t _ a l p h a > > >r e g r . f i t ( d i a b e t e s _ X _ t r a i n ,d i a b e t e s _ y _ t r a i n ) L a s s o ( a l p h a = 0 . 0 2 5 1 1 8 8 6 4 3 1 5 0 9 5 7 9 4 ,c o p y _ X = T r u e ,f i t _ i n t e r c e p t = T r u e , m a x _ i t e r = 1 0 0 0 ,n o r m a l i z e = F a l s e ,p o s i t i v e = F a l s e ,p r e c o m p u t e = ' a u t o ' , t o l = 0 . 0 0 0 1 ,w a r m _ s t a r t = F a l s e ) > > >p r i n tr e g r . c o e f _ [ 0 . 2 1 2 . 4 3 7 6 4 5 4 8 5 1 7 . 1 9 4 7 8 1 1 1 3 1 3 . 7 7 9 5 9 9 6 21 6 0 . 8 3 0 3 9 8 2 7 1 . 8 4 2 3 9 0 0 8 ] 0 . 1 8 7 . 1 9 5 5 4 7 0 5 6 9 . 3 8 2 2 9 0 3 8 5 0 8 . 6 6 0 1 1 2 1 7
Different algorithms for the same problem Different algorithms can be used to solve the same mathematical problem. For instance the Lasso object in the scikit-learn solves the lasso regression problem using a coordinate decent [20] method, that is efficient on large datasets. However, the scikit-learn also provides the LassoLars object using the LARS which is very efficient for problems in which the weight vector estimated is very sparse, (i.e. problems with very few observations).
2.2.2.2.4. Classification
scikit-learn.org/stable/tutorial/statistical_inference/index.html
8/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
[21]
For classification, as in the labeling iris [22] task, linear regression is not the right approach as it will give too much weight to data far from the decision frontier. A linear approach is to fit a sigmoid function or logistic function:
[23]
Multiclass classification If you have several classes to predict, an option often used is to fit one-versusall classifiers and then use a voting heuristic for the final decision. Shrinkage and sparsity with logistic regression The C parameter controls the amount of regularization in the LogisticRegression object: a large value for C results in less regularization. penalty=l2 gives Shrinkage (i.e. non-sparse coefficients), while penalty=l1 gives Sparsity. Exercise Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test prediction performance on these observations.
f r o ms k l e a r ni m p o r td a t a s e t s ,n e i g h b o r s ,l i n e a r _ m o d e l
scikit-learn.org/stable/tutorial/statistical_inference/index.html
9/27
3.6.13.
d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) X _ d i g i t s=d i g i t s . d a t a y _ d i g i t s=d i g i t s . t a r g e t
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
[25]
[26]
[27]
SVMs can be used in regression SVR (Support Vector Regression), or in classification SVC (Support Vector Classification).
> > >f r o ms k l e a r ni m p o r ts v m > > >s v c=s v m . S V C ( k e r n e l = ' l i n e a r ' ) > > >s v c . f i t ( i r i s _ X _ t r a i n ,i r i s _ y _ t r a i n ) S V C ( C = 1 . 0 ,c a c h e _ s i z e = 2 0 0 ,c l a s s _ w e i g h t = N o n e ,c o e f 0 = 0 . 0 ,d e g r e e = 3 ,g a m m a = 0 . 0 , k e r n e l = ' l i n e a r ' ,m a x _ i t e r = 1 ,p r o b a b i l i t y = F a l s e ,s h r i n k i n g = T r u e ,t o l = 0 . 0 0 1 , v e r b o s e = F a l s e )
Warning Normalizing data For many estimators, including the SVMs, having datasets with unit standard deviation for each feature is important to get good prediction.
scikit-learn.org/stable/tutorial/statistical_inference/index.html 10/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
[29] > > >s v c=s v m . S V C ( k e r n e l = ' p o l y ' , . . . d e g r e e = 3 ) > > >#d e g r e e :p o l y n o m i a ld e g r e e
[30] > > >s v c=s v m . S V C ( k e r n e l = ' r b f ' ) > > >#g a m m a :i n v e r s eo fs i z eo f > > >#r a d i a lk e r n e l
Interactive example See the SVM GUI to download svm_gui.py; add data points of both classes with right and left button, fit the model and change parameters and data. Exercise Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 first features. Leave out 10% of each class and test prediction performance on these observations. Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class. Hint: You can use the decision_function method on a grid to get intuitions.
i r i s=d a t a s e t s . l o a d _ i r i s ( ) X=i r i s . d a t a y=i r i s . t a r g e t
scikit-learn.org/stable/tutorial/statistical_inference/index.html
11/27
3.6.13.
X=X [ y! =0 ,: 2 ] y=y [ y! =0 ]
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
page 4
To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can successively split the data in folds that we use for training and testing:
> > >i m p o r tn u m p ya sn p > > >X _ f o l d s=n p . a r r a y _ s p l i t ( X _ d i g i t s ,3 ) > > >y _ f o l d s=n p . a r r a y _ s p l i t ( y _ d i g i t s ,3 ) > > >s c o r e s=l i s t ( ) > > >f o rki nr a n g e ( 3 ) : . . . . . . . . . . . . . . . . . . . . . . . . #W eu s e' l i s t 't oc o p y ,i no r d e rt o' p o p 'l a t e ro n X _ t r a i n=l i s t ( X _ f o l d s ) X _ t e s t =X _ t r a i n . p o p ( k ) X _ t r a i n=n p . c o n c a t e n a t e ( X _ t r a i n ) y _ t r a i n=l i s t ( y _ f o l d s ) y _ t e s t =y _ t r a i n . p o p ( k ) y _ t r a i n=n p . c o n c a t e n a t e ( y _ t r a i n ) s c o r e s . a p p e n d ( s v c . f i t ( X _ t r a i n ,y _ t r a i n ) . s c o r e ( X _ t e s t ,y _ t e s t ) )
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
To compute the score method of an estimator, the sklearn exposes a helper function:
> > >c r o s s _ v a l i d a t i o n . c r o s s _ v a l _ s c o r e ( s v c ,X _ d i g i t s ,y _ d i g i t s ,c v = k f o l d ,n _ j o b s = 1 ) a r r a y ( [0 . 9 3 4 8 9 1 4 9 , 0 . 9 5 6 5 9 4 3 2 , 0 . 9 3 9 8 9 9 8 3 ] )
n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer. Cross-validation generators
KFold (n, k) StratifiedKFold (y, k) LeaveOneOut LeaveOneLabelOut
(n) Split it K folds, train It preserves the class on left-out Exercise within each fold. Leave one out
on K-1 and then test ratios / label distribution observation array to group
[32]
On the digits dataset, plot the cross-validation score of a SVC estimator with an linear kernel as a function of parameter C (use a logarithmic grid of points, from 1 to 10).
f r o ms k l e a r ni m p o r tc r o s s _ v a l i d a t i o n ,d a t a s e t s ,s v m d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) X=d i g i t s . d a t a y=d i g i t s . t a r g e t s v c=s v m . S V C ( k e r n e l = ' l i n e a r ' ) C _ s=n p . l o g s p a c e ( 1 0 ,0 ,1 0 ) s c o r e s=l i s t ( ) s c o r e s _ s t d=l i s t ( )
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
The sklearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:
> > >f r o ms k l e a r n . g r i d _ s e a r c hi m p o r tG r i d S e a r c h C V > > >g a m m a s=n p . l o g s p a c e ( 6 ,1 ,1 0 ) > > >c l f=G r i d S e a r c h C V ( e s t i m a t o r = s v c ,p a r a m _ g r i d = d i c t ( g a m m a = g a m m a s ) , . . . n _ j o b s = 1 ) > > >c l f . f i t ( X _ d i g i t s [ : 1 0 0 0 ] ,y _ d i g i t s [ : 1 0 0 0 ] ) G r i d S e a r c h C V ( c v = N o n e , . . . > > >c l f . b e s t _ s c o r e _ 0 . 9 8 8 9 9 1 9 8 5 9 9 7 9 7 4 > > >c l f . b e s t _ e s t i m a t o r _ . g a m m a 9 . 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 5 e 0 7 > > >#P r e d i c t i o np e r f o r m a n c eo nt e s ts e ti sn o ta sg o o da so nt r a i ns e t > > >c l f . s c o r e ( X _ d i g i t s [ 1 0 0 0 : ] ,y _ d i g i t s [ 1 0 0 0 : ] ) 0 . 9 4 2 2 8 3 5 6 3 3 6 2 6 0 9 7 7
By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold. Nested cross-validation
> > >c r o s s _ v a l i d a t i o n . c r o s s _ v a l _ s c o r e ( c l f ,X _ d i g i t s ,y _ d i g i t s ) a r r a y ( [0 . 9 7 9 9 6 6 6 1 , 0 . 9 8 1 6 3 6 0 6 , 0 . 9 8 3 3 0 5 5 1 ] )
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores are unbiased estimates of the prediction score on new data. Warning You cannot nest objects with parallel computing (n_jobs different than 1).
These estimators are called similarly to their counterparts, with CV appended to their name. Exercise
scikit-learn.org/stable/tutorial/statistical_inference/index.html 14/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
On the diabetes dataset, find the optimal regularization parameter alpha. Bonus: How much can you trust the selection of alpha?
i m p o r tn u m p ya sn p i m p o r tp y l a ba sp l f r o ms k l e a r ni m p o r tc r o s s _ v a l i d a t i o n ,d a t a s e t s ,l i n e a r _ m o d e l d i a b e t e s=d a t a s e t s . l o a d _ d i a b e t e s ( ) X=d i a b e t e s . d a t a [ : 1 5 0 ] y=d i a b e t e s . t a r g e t [ : 1 5 0 ] l a s s o=l i n e a r _ m o d e l . L a s s o ( ) a l p h a s=n p . l o g s p a c e ( 4 ,. 5 ,3 0 )
page 5
[33] > > >f r o ms k l e a r ni m p o r tc l u s t e r ,d a t a s e t s > > >i r i s=d a t a s e t s . l o a d _ i r i s ( ) > > >X _ i r i s=i r i s . d a t a > > >y _ i r i s=i r i s . t a r g e t > > >k _ m e a n s=c l u s t e r . K M e a n s ( n _ c l u s t e r s = 3 ) > > >k _ m e a n s . f i t ( X _ i r i s ) K M e a n s ( c o p y _ x = T r u e ,i n i t = ' k m e a n s + + ' ,. . . > > >p r i n tk _ m e a n s . l a b e l s _ [ : : 1 0 ] [ 111110000022222 ]
scikit-learn.org/stable/tutorial/statistical_inference/index.html
15/27
3.6.13.
> > >p r i n ty _ i r i s [ : : 1 0 ] [ 000001111122222 ]
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
Warning There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although in the sklearn package we play many tricks to mitigate this issue.
[34]
[35]
[36]
Bad initialization Dont over-interpret clustering results Application example: vector quantization
8 clusters
Ground truth
Clustering in general and KMeans, in particular, can be seen as a way of choosing a small number of exemplars to compress the information. The problem is sometimes known as vector quantization [37]. For instance, this can be used to posterize an image:
> > >i m p o r ts c i p ya ss p > > >t r y : . . . . . . . . . l e n a=s p . l e n a ( ) f r o ms c i p yi m p o r tm i s c l e n a=m i s c . l e n a ( ) . . .e x c e p tA t t r i b u t e E r r o r :
> > >X=l e n a . r e s h a p e ( ( 1 ,1 ) )#W en e e da n( n _ s a m p l e ,n _ f e a t u r e )a r r a y > > >k _ m e a n s=c l u s t e r . K M e a n s ( n _ c l u s t e r s = 5 ,n _ i n i t = 1 ) > > >k _ m e a n s . f i t ( X ) K M e a n s ( c o p y _ x = T r u e ,i n i t = ' k m e a n s + + ' ,. . . > > >v a l u e s=k _ m e a n s . c l u s t e r _ c e n t e r s _ . s q u e e z e ( ) > > >l a b e l s=k _ m e a n s . l a b e l s _ > > >l e n a _ c o m p r e s s e d=n p . c h o o s e ( l a b e l s ,v a l u e s ) > > >l e n a _ c o m p r e s s e d . s h a p e=l e n a . s h a p e
[38]
[39]
[40]
[41]
scikit-learn.org/stable/tutorial/statistical_inference/index.html
16/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
Raw image
K-means quantization
Equal bins
Image histogram
[42] # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #G e n e r a t ed a t a l e n a=s p . m i s c . l e n a ( ) #D o w n s a m p l et h ei m a g eb yaf a c t o ro f4 l e n a=l e n a [ : : 2 ,: : 2 ]+l e n a [ 1 : : 2 ,: : 2 ]+l e n a [ : : 2 ,1 : : 2 ]+l e n a [ 1 : : 2 ,1 : : 2 ] X=n p . r e s h a p e ( l e n a ,( 1 ,1 ) ) # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #D e f i n et h es t r u c t u r eAo ft h ed a t a .P i x e l sc o n n e c t e dt ot h e i rn e i g h b o r s . c o n n e c t i v i t y=g r i d _ t o _ g r a p h ( * l e n a . s h a p e )
scikit-learn.org/stable/tutorial/statistical_inference/index.html
17/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #C o m p u t ec l u s t e r i n g p r i n t" C o m p u t es t r u c t u r e dh i e r a r c h i c a lc l u s t e r i n g . . . " s t=t i m e . t i m e ( ) n _ c l u s t e r s=1 5 #n u m b e ro fr e g i o n s w a r d=W a r d ( n _ c l u s t e r s = n _ c l u s t e r s ,c o n n e c t i v i t y = c o n n e c t i v i t y ) . f i t ( X ) l a b e l=n p . r e s h a p e ( w a r d . l a b e l s _ ,l e n a . s h a p e ) p r i n t" E l a s p s e dt i m e :" ,t i m e . t i m e ( )-s t p r i n t" N u m b e ro fp i x e l s :" ,l a b e l . s i z e p r i n t" N u m b e ro fc l u s t e r s :" ,n p . u n i q u e ( l a b e l ) . s i z e
[43] > > >d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) > > >i m a g e s=d i g i t s . i m a g e s > > >X=n p . r e s h a p e ( i m a g e s ,( l e n ( i m a g e s ) ,1 ) ) > > >c o n n e c t i v i t y=g r i d _ t o _ g r a p h ( * i m a g e s [ 0 ] . s h a p e ) > > >a g g l o=c l u s t e r . W a r d A g g l o m e r a t i o n ( c o n n e c t i v i t y = c o n n e c t i v i t y , . . . > > >a g g l o . f i t ( X ) W a r d A g g l o m e r a t i o n ( c o m p u t e _ f u l l _ t r e e = ' a u t o ' , . . . > > >X _ r e d u c e d=a g g l o . t r a n s f o r m ( X ) > > >X _ a p p r o x=a g g l o . i n v e r s e _ t r a n s f o r m ( X _ r e d u c e d ) > > >i m a g e s _ a p p r o x=n p . r e s h a p e ( X _ a p p r o x ,i m a g e s . s h a p e ) n _ c l u s t e r s = 3 2 )
transform and inverse_transform methods Some estimators expose a transform method, for instance to reduce the dimensionality of the dataset.
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
rewrite it on a different observational basis: we want to learn loadings L and a set of components C such that X = L C. Different criteria exist to choose the components
[45]
[46]
The point cloud spanned by the observations above is very flat in one direction: one of the three univariate features can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat When used to transform data, PCA can reduce the dimensionality of the data by projecting on a principal subspace.
> > >#C r e a t eas i g n a lw i t ho n l y2u s e f u ld i m e n s i o n s > > >x 1=n p . r a n d o m . n o r m a l ( s i z e = 1 0 0 ) > > >x 2=n p . r a n d o m . n o r m a l ( s i z e = 1 0 0 ) > > >x 3=x 1+x 2 > > >X=n p . c _ [ x 1 ,x 2 ,x 3 ] > > >f r o ms k l e a r ni m p o r td e c o m p o s i t i o n > > >p c a=d e c o m p o s i t i o n . P C A ( ) > > >p c a . f i t ( X ) P C A ( c o p y = T r u e ,n _ c o m p o n e n t s = N o n e ,w h i t e n = F a l s e ) > > >p r i n tp c a . e x p l a i n e d _ v a r i a n c e _ [ 2 . 1 8 5 6 5 8 1 1 e + 0 0 1 . 1 9 3 4 6 7 4 7 e + 0 0 8 . 4 3 0 2 6 6 7 9 e 3 2 ]
scikit-learn.org/stable/tutorial/statistical_inference/index.html
19/27
3.6.13.
> > >p c a . n _ c o m p o n e n t s=2
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
> > >#A sw ec a ns e e ,o n l yt h e2f i r s tc o m p o n e n t sa r eu s e f u l > > >X _ r e d u c e d=p c a . f i t _ t r a n s f o r m ( X ) > > >X _ r e d u c e d . s h a p e ( 1 0 0 ,2 )
[47] > > >#G e n e r a t es a m p l ed a t a > > >t i m e=n p . l i n s p a c e ( 0 ,1 0 ,2 0 0 0 ) > > >s 1=n p . s i n ( 2*t i m e ) #S i g n a l1:s i n u s o i d a ls i g n a l > > >s 2=n p . s i g n ( n p . s i n ( 3*t i m e ) ) #S i g n a l2:s q u a r es i g n a l > > >S=n p . c _ [ s 1 ,s 2 ] > > >S+ =0 . 2*n p . r a n d o m . n o r m a l ( s i z e = S . s h a p e ) #A d dn o i s e > > >S/ =S . s t d ( a x i s = 0 ) #S t a n d a r d i z ed a t a > > >#M i xd a t a > > >A=n p . a r r a y ( [ [ 1 ,1 ] ,[ 0 . 5 ,2 ] ] ) #M i x i n gm a t r i x > > >X=n p . d o t ( S ,A . T ) #G e n e r a t eo b s e r v a t i o n s > > >#C o m p u t eI C A > > >i c a=d e c o m p o s i t i o n . F a s t I C A ( ) > > >S _=i c a . f i t ( X ) . t r a n s f o r m ( X ) #G e tt h ee s t i m a t e ds o u r c e s > > >A _=i c a . g e t _ m i x i n g _ m a t r i x ( ) #G e te s t i m a t e dm i x i n gm a t r i x > > >n p . a l l c l o s e ( X ,n p . d o t ( S _ ,A _ . T ) ) T r u e
page 6
2.2.5.1. Pipelining
We have seen that some estimators can transform data and that some estimators can predict variables. We can also create combined estimators:
scikit-learn.org/stable/tutorial/statistical_inference/index.html
20/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
[48] i m p o r tp y l a ba sp l f r o ms k l e a r ni m p o r tl i n e a r _ m o d e l ,d e c o m p o s i t i o n ,d a t a s e t s l o g i s t i c=l i n e a r _ m o d e l . L o g i s t i c R e g r e s s i o n ( ) p c a=d e c o m p o s i t i o n . P C A ( ) f r o ms k l e a r n . p i p e l i n ei m p o r tP i p e l i n e p i p e=P i p e l i n e ( s t e p s = [ ( ' p c a ' ,p c a ) ,( ' l o g i s t i c ' ,l o g i s t i c ) ] ) d i g i t s=d a t a s e t s . l o a d _ d i g i t s ( ) X _ d i g i t s=d i g i t s . d a t a y _ d i g i t s=d i g i t s . t a r g e t # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #P l o tt h eP C As p e c t r u m p c a . f i t ( X _ d i g i t s ) p l . f i g u r e ( 1 ,f i g s i z e = ( 4 ,3 ) ) p l . c l f ( ) p l . a x e s ( [ . 2 ,. 2 ,. 7 ,. 7 ] ) p l . p l o t ( p c a . e x p l a i n e d _ v a r i a n c e _ ,l i n e w i d t h = 2 ) p l . a x i s ( ' t i g h t ' ) p l . x l a b e l ( ' n _ c o m p o n e n t s ' ) p l . y l a b e l ( ' e x p l a i n e d _ v a r i a n c e _ ' ) # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #P r e d i c t i o n f r o ms k l e a r n . g r i d _ s e a r c hi m p o r tG r i d S e a r c h C V n _ c o m p o n e n t s=[ 2 0 ,4 0 ,6 4 ] C s=n p . l o g s p a c e ( 4 ,4 ,3 ) # P a r a m e t e r so fp i p e l i n e sc a nb es e tu s i n g _ _ s e p a r a t e dp a r a m e t e rn a m e s : e s t i m a t o r=G r i d S e a r c h C V ( p i p e , d i c t ( p c a _ _ n _ c o m p o n e n t s = n _ c o m p o n e n t s , l o g i s t i c _ _ C = C s ) ) e s t i m a t o r . f i t ( X _ d i g i t s ,y _ d i g i t s ) p l . a x v l i n e ( e s t i m a t o r . b e s t _ e s t i m a t o r _ . n a m e d _ s t e p s [ ' p c a ' ] . n _ c o m p o n e n t s , l i n e s t y l e = ' : ' ,l a b e l = ' n _ c o m p o n e n t sc h o s e n ' ) p l . l e g e n d ( p r o p = d i c t ( s i z e = 1 2 ) )
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
scikit-learn.org/stable/tutorial/statistical_inference/index.html
22/27
3.6.13.
y=l f w _ p e o p l e . t a r g e t
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
t a r g e t _ n a m e s=l f w _ p e o p l e . t a r g e t _ n a m e s n _ c l a s s e s=t a r g e t _ n a m e s . s h a p e [ 0 ] p r i n t" T o t a ld a t a s e ts i z e : " p r i n t" n _ s a m p l e s :% d "%n _ s a m p l e s p r i n t" n _ f e a t u r e s :% d "%n _ f e a t u r e s p r i n t" n _ c l a s s e s :% d "%n _ c l a s s e s
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #C o m p u t eaP C A( e i g e n f a c e s )o nt h ef a c ed a t a s e t( t r e a t e da su n l a b e l e d #d a t a s e t ) :u n s u p e r v i s e df e a t u r ee x t r a c t i o n/d i m e n s i o n a l i t yr e d u c t i o n n _ c o m p o n e n t s=1 5 0 p r i n t" E x t r a c t i n gt h et o p% de i g e n f a c e sf r o m% df a c e s "%( n _ c o m p o n e n t s ,X _ t r a i n . s h a p e [ 0 ] ) t 0=t i m e ( ) p c a=R a n d o m i z e d P C A ( n _ c o m p o n e n t s = n _ c o m p o n e n t s ,w h i t e n = T r u e ) . f i t ( X _ t r a i n ) p r i n t" d o n ei n% 0 . 3 f s "%( t i m e ( )-t 0 ) e i g e n f a c e s=p c a . c o m p o n e n t s _ . r e s h a p e ( ( n _ c o m p o n e n t s ,h ,w ) ) p r i n t" P r o j e c t i n gt h ei n p u td a t ao nt h ee i g e n f a c e so r t h o n o r m a lb a s i s " t 0=t i m e ( ) X _ t r a i n _ p c a=p c a . t r a n s f o r m ( X _ t r a i n ) X _ t e s t _ p c a=p c a . t r a n s f o r m ( X _ t e s t ) p r i n t" d o n ei n% 0 . 3 f s "%( t i m e ( )-t 0 )
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #T r a i naS V Mc l a s s i f i c a t i o nm o d e l p r i n t" F i t t i n gt h ec l a s s i f i e rt ot h et r a i n i n gs e t " t 0=t i m e ( ) p a r a m _ g r i d={ ' C ' :[ 1 e 3 ,5 e 3 ,1 e 4 ,5 e 4 ,1 e 5 ] , ' g a m m a ' :[ 0 . 0 0 0 1 ,0 . 0 0 0 5 ,0 . 0 0 1 ,0 . 0 0 5 ,0 . 0 1 ,0 . 1 ] ,} c l f=G r i d S e a r c h C V ( S V C ( k e r n e l = ' r b f ' ,c l a s s _ w e i g h t = ' a u t o ' ) ,p a r a m _ g r i d ) c l f=c l f . f i t ( X _ t r a i n _ p c a ,y _ t r a i n ) p r i n t" d o n ei n% 0 . 3 f s "%( t i m e ( )-t 0 ) p r i n t" B e s te s t i m a t o rf o u n db yg r i ds e a r c h : " p r i n tc l f . b e s t _ e s t i m a t o r _
scikit-learn.org/stable/tutorial/statistical_inference/index.html
23/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #Q u a l i t a t i v ee v a l u a t i o no ft h ep r e d i c t i o n su s i n gm a t p l o t l i b d e fp l o t _ g a l l e r y ( i m a g e s ,t i t l e s ,h ,w ,n _ r o w = 3 ,n _ c o l = 4 ) : " " " H e l p e rf u n c t i o nt op l o tag a l l e r yo fp o r t r a i t s " " " p l . f i g u r e ( f i g s i z e = ( 1 . 8*n _ c o l ,2 . 4*n _ r o w ) ) p l . s u b p l o t s _ a d j u s t ( b o t t o m = 0 ,l e f t = . 0 1 ,r i g h t = . 9 9 ,t o p = . 9 0 ,h s p a c e = . 3 5 ) f o rii nr a n g e ( n _ r o w*n _ c o l ) : p l . s u b p l o t ( n _ r o w ,n _ c o l ,i+1 ) p l . i m s h o w ( i m a g e s [ i ] . r e s h a p e ( ( h ,w ) ) ,c m a p = p l . c m . g r a y ) p l . t i t l e ( t i t l e s [ i ] ,s i z e = 1 2 ) p l . x t i c k s ( ( ) ) p l . y t i c k s ( ( ) )
[51]
[52]
Prediction
Eigenfaces
Expected results for the top 5 most represented people in the dataset:
p r e c i s i o n G e r h a r d _ S c h r o e d e r D o n a l d _ R u m s f e l d T o n y _ B l a i r C o l i n _ P o w e l l G e o r g e _ W _ B u s h 0 . 9 1 0 . 8 4 0 . 6 5 0 . 7 8 0 . 9 3 r e c a l l f 1 s c o r e 0 . 7 5 0 . 8 2 0 . 8 2 0 . 8 8 0 . 8 6 0 . 8 2 0 . 8 3 0 . 7 3 0 . 8 3 0 . 9 0 s u p p o r t 2 8 3 3 3 4 5 8 1 2 9
scikit-learn.org/stable/tutorial/statistical_inference/index.html
24/27
3.6.13.
a v g/t o t a l 0 . 8 6
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
0 . 8 4 0 . 8 5 2 8 2
page 7
A good starting point is the discussion on good freely available textbooks on machine learning [55] Quora.com: Quora has a topic for Machine Learning related questions that also features some interesting discussions: http://quora.com/Machine-Learning [56] Have a look at the best questions section, eg: What are some good resources for learning about machine learning [57]. _An excellent free online course for Machine Learning taught by Professor Andrew Ng of Stanford: https://www.coursera.org/course/ml [58] _Another excellent free online course that takes a more general approach to Artificial Intelligence:http://www.udacity.com/overview/Course/cs271/CourseRev/1 [59]
scikit-learn.org/stable/tutorial/statistical_inference/index.html
25/27
3.6.13.
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
9. http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html 10. http://en.wikipedia.org/wiki/Curse_of_dimensionality 11. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html 12. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge_variance.html 13. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge_variance.html 14. http://en.wikipedia.org/wiki/Overfitting 15. http://en.wikipedia.org/wiki/Regularization_%28machine_learning%29 16. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 17. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 18. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 19. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_3d.html 20. http://en.wikipedia.org/wiki/Coordinate_descent 21. http://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html 22. http://en.wikipedia.org/wiki/Iris_flower_data_set 23. http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html 24. http://scikit-learn.org/stable/_downloads/plot_digits_classification_exercise1.py 25. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html 26. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html 27. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_iris.html 28. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html 29. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html 30. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html 31. http://scikit-learn.org/stable/_downloads/plot_iris_exercise1.py 32. http://scikit-learn.org/stable/auto_examples/exercises/plot_cv_digits.html 33. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 34. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 35. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 36. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html 37. http://en.wikipedia.org/wiki/Vector_quantization 38. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 39. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 40. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 41. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_compress.html 42. http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html 43. http://scikit-learn.org/stable/auto_examples/cluster/plot_digits_agglomeration.html 44. http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html 45. http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html 46. http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html 47. http://scikitlearn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html 48. http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html 49. http://vis-www.cs.umass.edu/lfw/ 50. http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz 51. http://scikit-learn.org/stable/_images/plot_face_recognition_1.png 52. http://scikit-learn.org/stable/_images/plot_face_recognition_2.png 53. http://scikit-learn.sourceforge.net/support.html 54. http://metaoptimize.com/qa
scikit-learn.org/stable/tutorial/statistical_inference/index.html
26/27
3.6.13.
machine-learning
2.2. A tutorial on statistical-learning for scientific data processing scikit-learn 0.13.1 documentation
55. http://metaoptimize.com/qa/questions/186/good-freely-available-textbooks-on-
scikit-learn.org/stable/tutorial/statistical_inference/index.html
27/27