Beruflich Dokumente
Kultur Dokumente
22. Decision tree i) Simple and intuitive 29. Error rate fraction of misclassified data points
advantages ii) Good computational performance $1-p_{m}$
iii) Possible to validate a model using 30. Evaluating a i) Error rate
statistical tests decision tree - Proportion of errors across all instances
23. Decision tree i) Greedy algorithms can get stuck in local ii) Resubstitution error
Limitations optimality. - Error rate on the training data
ii) Simple decision structure does not - Tends to be optimistic due to overfitting
represent all problems effectively. iii) Test set error
- XOR parity. - Performance on previously unseen set test
- Producing large trees data
iii) Can produce overly complex trees that iv) Holdout
do not generalise well - Reserve some data, often 20% for testing
- Overfitting - N-fold cross validation
24. Deep Learning i) Using a neural network with several layers 31. Function Approximates a function from a subset of
of nodes between the input and output. approximation inputs and outputs
ii) Series of layers between the input and 32. Gini Impurity $ Gini(S) = 1-\sum_{c=1}^{l}p_{c}^{2}$
output do feature detection and
processing.
iii) Model human visual system
33. Greedy- Steps for training 39. Instance- i) Doesn't construct a theory
layer-wise Based ii) Does classification/regression from the raw
training 1) Train first layer using unsupervised without Methods data each time
labels iii) Complexity is limited to the size of the
2) Use abundant unlabeled data which is not training set
part of the training set iv) Nearest sequence memory
3) Freeze the first layer parameters and start - Match current episode against past episodes
training the second layer using the output of
40. Instance- i) Nearest sequence memory
the first layers as unsupervised input to the
Based - Keep current episode in STM
second layer.
solutions - Remember last N episodes
4) Repeat
ii) Can learn very quickly
5) Unfreeze all weights and fine tune the
- Potentially one-shot
network
iii) Fixed memory requirements
34. Greedy i) Avoids many problems of training deep - Can forget solutions
layer-wise networks in a supervised fashion
41. Interpreting if cov(x,y) > 0 => x,y are positively correlated
training ii) Each layer gets full learning focus since its
Covariance if cov(x,y) < 0 => x,y are inversely correlated
advantages the top layer.
if cov(x,y) = 0 => x,y Independent
iii) Takes advantage of unlabelled data.
iv) Helps with problems of ineffective early 42. Issues with i) Decide on a stopping criteria
layer learning stopping a) Too specific and it will produce small,
v) Deep net local minima criteria underfitted trees - local minima
b) Loose criteria produce large, overfitting
35. Hidden state The match between STM and LTM provides
Identification an estimate of how close traces in LTM are to 43. Kernel Trick i) Transforming some datasets can make it
STM linearlhy separable.
ii) This can be done more rigorously using a
36. Hierarchical i) Breaks down the dataset into a series of
mapping function (kernel) which transforms
Clustering nested clusters
one given space into another
ii) Single cluster at top with all data
iii) Transformation of functions for easier to
iii) N clusters are at the bottom one for each
solve problems
data point
iv) Can be displayed as a dendogram 44. K-fold K is commonly set to 5 or 10 based on
cross experience
37. Hierarchical i) Encode all observation-action-reward
validation
encoding of tuples in SOM
variants - Stratification
sequence ii) Record STM as decaying activation of
- Each subset has the same proportion of data
data augmented SOM nodes
from each class
iii)Use another SOM to learn the sequence
- Leave one out CV
information in the node activations
- Fold size 1
iv) Auto-encoding
- Best use of data and most expensive
38. Improvement Improvement $ I(S_{i1},S_{i2})$ is the
45. K-means K - the number of clusters
in decision difference in quality between the original
clustering K is set by hand
trees subset $s_{i}$, and the joint quality of two
new subsets $s_{i1},s_{i2}$
1. Set a suitable number of K clusters
2. Randomly assign the first K data points to be
$I(S_{i1},S_{i2}) = Q(S_{i}) - \sum_{i=1}^{n}
centroids of the k clusters
\frac{|S_{in}|}{|S_{i}|} Q(S_{in})$
3. Loop
3.1. For each data point assign closest cluster
- Gini gain - Improvement based on gini
center
impurity
3.2. Re-compute cluster centers as mean of
- Information gain - based on entropy
assigned
3.3. Terminate loop if no change
Structure :
$P(x_{i} | a) = \frac{1}{\sqrt{2 \pi \delta^{2} a}}
exp { (- \frac{(x_{i} - \mu)^{2}}{2 \delta^{2}})}$
$X_{i}$ - input vector features
$N_{j}$ - Nodes / Map 56. Models i) Anything that can be used to predict results
$W_{i,j}$ - Input vector is fully connected of actions
to each node - Simulated experience
ii) Distribution models
Node weights are similar to K-means - Provide probability distributions over
centroids successor states
iii) Sample Models
50. Limitations of i) Multiple hidden layers
- Return a single successor state drawn from
Back- ii) Get stuck in local optima
the appropriate probability distribution
propagation - Start weight from random positions
iii) Slow convergence to optimum 57. Monte- Assume we don't know the transition function,
- Large training set needed Carlo T
iv) Only uses labelled data Methods - Model Free
v) Error attenuation with deep nets
Evaluate the policy based on experience
- Obtain episodes by following policy $\pi$
- Start at state $s$, take actions until goal is
reached
K-means Clustering
Given by:
$\theta$ is an angle
$\mu , k $ are the mean and width
$l_{o}(k)$ Bessel function of order k