Sie sind auf Seite 1von 8

Machine Learning

Study online at quizlet.com/_49727f

1. Actor-Critic 1. Produce action,a, for current state,s 7. Bayes Rule Given A:


Algorithm 2. Observe next state,s', and the Formula $P(A|x) = \frac{P(X|A)P(A)}{P(x)}$
reward, r
3. Update the utility of state, s(critic)
4. Update the probability of the action $P(A|x) = \frac{P(X|A)P(A)}{P(X)}$
2. Actor-Critic Methods A meta-architecture that splits
$P(X) = P(X,A)+P(X,B)$
learning into two elements:
$P(X,A) = P(A)P(X|A)$
$P(X,B) = P(B)P(X|B)$
- Actor : A function that selects
actions from states
For a cluster a given the data point $x_{i}$ it
- Critic: A function that provides a
is:
value estimate for a state and action.
3. Actor Only methods i) Search in policy space $p(a|x_{i}) = \frac{p(x_{i}|a)p(a)}
- Run n policies for M episodes {p(x_{i}|b)p(b)+p(x_{i}|a)p(a)}$
- Keep the best one
8. Belief state Because of not being able to access state, s,
- Evolve
directly you introduce a belief state
ii) Ignore state-action value
information
It is essentially a probability distribution
iii) Potentially less efficient
9. Belief We can update the estimated belief state
4. Addition Law of $P(A \bigcup B) = P(A) + P(B) - P(A
Updates using the probabilities in the observation and
probability \land B)$
transfer functions
For example picking an ace and a
Point-based value iteration
diamond:
10. Bootstrap i) Good method for small datasets.
$P(A) = \frac{4}{52}$ ii) Creates a training set by sampling with
$P(B) = \frac{13}{52}$ replacement
$(A \land B) = \frac{1}{52}$ iii) Samples a dataset of N instances by N
times to create a new training set of N
Therefore: instances

$\frac{4+13-1}{52} = \frac{4}{13}$ - Test on all instances


5. Advantages of i) Do not need model of the 11. Bootstrap i) Pessimistic due to high probability of single
Temporal Difference environment properties instances not making it to the training set
Learning ii) Bootstrap ii) Combine with re-substitution error for a
- Learn a guess from a guess realistic error estimate.
iii) Work in an on-line fashion
12. Calculating $\frac{1}{n} \sum_{i=1}^{n} (x_{i}- \mu)^2 $
- No need to wait for terminal state
variance
- Update every time step
13. Classification i) Map training data classes to winning nodes
6. Auto-encoder Trained with a standard training
Using SOMS ii) Find winning node for new test data
algorithm and learns to map input
iii) Classify according to the node's majority
back onto itself.
class from the training data
14. Clustering Idea of grouping patterns so that their
vectors are similar to one another within the
same region.

All require a way to determine similarity


15. Constructing a The dataset is split recursively into subsets 25. Dimensionality i) Inputs are of dimension, i
decision tree based on a single input variable Reduction - Number of input features
ii) Outputs are of dimension 2
Until: - can be more
- All subsets has the same class
- Splitting no longer improves predicton SOM is an example of vector quantization
- Subsets are too small - 2 <i
26. Discounted All sequence information is stored
- Top-down induction of decision trees
Future Reward Can reconstruct all sequences and
- Greedy
calculate discounted reward for all bottom-
16. Constructivist Doesn't require access to state space or level o/a/r tuples
approach transition functions.
27. Distance Euclidian Distance:
17. Cost function $e_{i} = y_{i} - (mx_{i} + c) $ Equations for $d_{ij} = \sqrt{(x_{i}-c_{j})^{T} (x_{i}-c_{j})}$
for a data clustering
point Mahannabis distance:
$(x-y)^{T} \Sigma^{2} (x-y) $
18. Cost function $e = \sum_{i=1}^{n} (y_{i}-(mx+c))^{2}$
- This accounts for the fact that the
over a data set
variances in each direction are different
19. Covariance Measures the strength of the linear - Accounts for covariance between
relationship between 2 variables variables
- Reduces the amount of uncorrelated
It is given by: variables using unit variance.
28. Eligibility i) A record of the most recently
$\epsilon [ (x- \mu x)(y- \mu y)]$
Traces experienced state,actions, and rewards
20. Critic Only i) Value function estimation - A sequence of tuples <s,a,r> in
methods ii) Trivial action selection chronological order
- Updated on each step of the algorithm
- TD-learning, Q-learning
21. Decision tree i) A set of ordering rules for classifying data ii) Improved mechanism for temporal credit
ii) Each node addresses an input variable assignment
iii) Leaves assign labels or values to the - Spreading new information about state-
data action values through the value function

22. Decision tree i) Simple and intuitive 29. Error rate fraction of misclassified data points
advantages ii) Good computational performance $1-p_{m}$
iii) Possible to validate a model using 30. Evaluating a i) Error rate
statistical tests decision tree - Proportion of errors across all instances
23. Decision tree i) Greedy algorithms can get stuck in local ii) Resubstitution error
Limitations optimality. - Error rate on the training data
ii) Simple decision structure does not - Tends to be optimistic due to overfitting
represent all problems effectively. iii) Test set error
- XOR parity. - Performance on previously unseen set test
- Producing large trees data
iii) Can produce overly complex trees that iv) Holdout
do not generalise well - Reserve some data, often 20% for testing
- Overfitting - N-fold cross validation

24. Deep Learning i) Using a neural network with several layers 31. Function Approximates a function from a subset of
of nodes between the input and output. approximation inputs and outputs
ii) Series of layers between the input and 32. Gini Impurity $ Gini(S) = 1-\sum_{c=1}^{l}p_{c}^{2}$
output do feature detection and
processing.
iii) Model human visual system
33. Greedy- Steps for training 39. Instance- i) Doesn't construct a theory
layer-wise Based ii) Does classification/regression from the raw
training 1) Train first layer using unsupervised without Methods data each time
labels iii) Complexity is limited to the size of the
2) Use abundant unlabeled data which is not training set
part of the training set iv) Nearest sequence memory
3) Freeze the first layer parameters and start - Match current episode against past episodes
training the second layer using the output of
40. Instance- i) Nearest sequence memory
the first layers as unsupervised input to the
Based - Keep current episode in STM
second layer.
solutions - Remember last N episodes
4) Repeat
ii) Can learn very quickly
5) Unfreeze all weights and fine tune the
- Potentially one-shot
network
iii) Fixed memory requirements
34. Greedy i) Avoids many problems of training deep - Can forget solutions
layer-wise networks in a supervised fashion
41. Interpreting if cov(x,y) > 0 => x,y are positively correlated
training ii) Each layer gets full learning focus since its
Covariance if cov(x,y) < 0 => x,y are inversely correlated
advantages the top layer.
if cov(x,y) = 0 => x,y Independent
iii) Takes advantage of unlabelled data.
iv) Helps with problems of ineffective early 42. Issues with i) Decide on a stopping criteria
layer learning stopping a) Too specific and it will produce small,
v) Deep net local minima criteria underfitted trees - local minima
b) Loose criteria produce large, overfitting
35. Hidden state The match between STM and LTM provides
Identification an estimate of how close traces in LTM are to 43. Kernel Trick i) Transforming some datasets can make it
STM linearlhy separable.
ii) This can be done more rigorously using a
36. Hierarchical i) Breaks down the dataset into a series of
mapping function (kernel) which transforms
Clustering nested clusters
one given space into another
ii) Single cluster at top with all data
iii) Transformation of functions for easier to
iii) N clusters are at the bottom one for each
solve problems
data point
iv) Can be displayed as a dendogram 44. K-fold K is commonly set to 5 or 10 based on
cross experience
37. Hierarchical i) Encode all observation-action-reward
validation
encoding of tuples in SOM
variants - Stratification
sequence ii) Record STM as decaying activation of
- Each subset has the same proportion of data
data augmented SOM nodes
from each class
iii)Use another SOM to learn the sequence
- Leave one out CV
information in the node activations
- Fold size 1
iv) Auto-encoding
- Best use of data and most expensive
38. Improvement Improvement $ I(S_{i1},S_{i2})$ is the
45. K-means K - the number of clusters
in decision difference in quality between the original
clustering K is set by hand
trees subset $s_{i}$, and the joint quality of two
new subsets $s_{i1},s_{i2}$
1. Set a suitable number of K clusters
2. Randomly assign the first K data points to be
$I(S_{i1},S_{i2}) = Q(S_{i}) - \sum_{i=1}^{n}
centroids of the k clusters
\frac{|S_{in}|}{|S_{i}|} Q(S_{in})$
3. Loop
3.1. For each data point assign closest cluster
- Gini gain - Improvement based on gini
center
impurity
3.2. Re-compute cluster centers as mean of
- Information gain - based on entropy
assigned
3.3. Terminate loop if no change

4. Repeat processing changes


46. K-Means Selecting which of 2 classes to allocate 51. Linear Partitions the input space into discrete regions
Formula vector $x_{i}$. Decision - Each region is associated with a class label
Boundary
Calculate the point distance to mean of cl i) 2D is a line
clusters: ii) 3D is a plane
ii) n-D is a hyperplane
$d_{i1} = \sqrt{(x_{i}-c_{1})^{T} (x_{i}-c_{1})}$
$ g(x) = w^{T}x+b$ - Where w is weight, b is a
$d_{i2} = \sqrt{(x_{i}-c_{2})^{T} (x_{i}-c_{2})}$ constant

$d_{i1} < d_{i2} $ Allocate $x_{i}$ to $c_{1}$ For boundary Classification


$d_{i1} > d_{i2} $ Allocate $x_{i}$ to $c_{2}$ if $ g(x)>0 $ then class 1
if $ g(x)<0$ then class 0
47. K-Nearest i) Select K-Points in the training set that are
Neighbour nearest to the data point you need to 52. Markov S - Set of States
Learning classify decision A - Set of Actions
- Requires a distance metric for deciding problem T - Transition Function
how close points are consists of R - Reward Function
53. Model i) Explicitly estimates T
- Euclidian distance
Based ii) Use T to improve policy
Algorithms
$ d(p,q) = \sqrt {\sum_{i=1}^{n}(q_{i} -
p_{i})^{2}}$ 54. Model Free i) Approximate value functions from
Methods experience
48. KNN Sample i) Distance from the distance metric
ii) Do not model the transition function
Space ii) Classification is the majority class of the
explicitly
nearest neighbours
iii) Can weight the neighbours by their 55. Modelling a Given parameters with mean $\mu$ and
proximity 1D Gaussian variance $\delta^{2}$ we can compute the
probability of the data point from the Gaussian
49. Kohonen Self- i) Preserves similarity between examples
model.
Organizing - examples that are close in n dimensional
Maps space will be close in m dimensional space
Probabilities of data point $x_{i}$ for model a
is:

Structure :
$P(x_{i} | a) = \frac{1}{\sqrt{2 \pi \delta^{2} a}}
exp { (- \frac{(x_{i} - \mu)^{2}}{2 \delta^{2}})}$
$X_{i}$ - input vector features
$N_{j}$ - Nodes / Map 56. Models i) Anything that can be used to predict results
$W_{i,j}$ - Input vector is fully connected of actions
to each node - Simulated experience
ii) Distribution models
Node weights are similar to K-means - Provide probability distributions over
centroids successor states
iii) Sample Models
50. Limitations of i) Multiple hidden layers
- Return a single successor state drawn from
Back- ii) Get stuck in local optima
the appropriate probability distribution
propagation - Start weight from random positions
iii) Slow convergence to optimum 57. Monte- Assume we don't know the transition function,
- Large training set needed Carlo T
iv) Only uses labelled data Methods - Model Free
v) Error attenuation with deep nets
Evaluate the policy based on experience
- Obtain episodes by following policy $\pi$
- Start at state $s$, take actions until goal is
reached

Use average discounted reward


58. Monte-Carlo i) Different successor states with different 64. Perceptron If the data is linearly separable:
update probabilities and values algorithm - Decision boundary will classify all points
problems ii) Slower convergence but more stable comments - Will stop when no incorrectly classified
values points
- Guaranteed for solution converse
59. Neural Gases i) Neural Gases are connectionist clustering
methods like SOMS but without topology
Two updates:
ii) Also uses feature vectors with weights
$w_{i,j}$
i) Batch
iii) Updates all feature vectors according to
- All data presented.
each sample in the training set
- Weight updated at the end
- Updates are based on a vector ranking
index, k, reflecting relative Euclidian
ii) Sequential
distance
- Data presented sample at a time
- Neighbourhood range, $\lambda$,
- Weight updated each sample
decreasing
- $\Delta w_{i,k} = \alpha e^{-k/ \lambda} 65. Perceptron Algorithm for supervised learning of binary
|x_{i}- w_{i,k}|$ learning rule classifiers
iv) Converges more robustly than k-means
and SOMs $net = \sum_{i=1}^{n} w_{i} x_{i} =
\bar{w}\bar{x}$
60. N-fold cross i) Divide data randomly into N sets (folds)
validation of equal size 66. Planning i) Interpreted differently in different areas
ii) Leave each subset out during training ii) Take a model
and test on that subset iii) Produce or improve a policy
iii) Calculate mean performance across N
67. Pole Balancing i) MPD
folds
- State includes cart position, pole angle,
61. Partial Commonly we cannot uniquely identify the cart speed, and rotational speed of pole
Observability state of the world
- Limited sensor ii) POMDP
- Sensor noise - State inludes only position and angle.
- Historical events - Algorithm must implicitly estimate speeds
base don memory of observed position and
So we need memory to tell states apart angle
Replace states with observations
68. Policies The agent's probability of choosing a given
62. Pattern i) Partitions the input space action in a given state.
Classification ii) May have multiple input dimensions
iii) May have multiple output classes Commonly based on a value function and
iv) Decision boundary depends on the action selected mechanism:
classifier
i) Greedy action selection
Supervised Learning Training - Always select an action with the highest
value
63. Perceptron 1. Initialise the weights
- Greedy-$\epsilon$ fixed probability of
Algorithm 2. Loop and compute perceptron output
random action.
$Z$(1,0)
3. After each presentation weight update
ii) Softmax
- Probablistic
$\Delta w_{i} = \eta (t-z) x_{i} $
- Highest Probability to action with highest
Where $\eta$ = learning rate
value
t= target
- Non zero probability to all other options
z = current output
69. Policy Computing the value function $V_{\pi}$ for 76. Random forest i) Create B training sets using tree bagging
Evaluation policy $\pi$ ii) At each split select a random set of
Start with an arbitrary initial approximation, features to split on (attribute/feature
$V^{0}$ bagging)
iii) Training set size - N
Repeatedly apply the Bellman equation as an iv) Number of trees - L
update rule. v) Number of features - D
vi) Choose a number of random features d<
Dynamic Programming D
vii) Select d features randomly (with
70. POMDP i) Constructivist approach
replacement)
and - Doesn't require access to state space or
memory transition functions 77. Receiver Considering the role of threshold or bias in
ii) LTM operating classification. Certain systems require a
- Previously seen sequence of observations, Characteristic higher identification rate and care less
actions and rewards (ROC) about false positives (earthquakes).
- Previous episodes
78. Recursive Use RNNs to encode Q-functions
iii) STM
Neural
- Recently seen sequence of observations, a, r
Networks for - Recurrent connection weights encode
iv) State identification
POMDPS observed sequences LTM
- Implicit
- Forward feed values represent recently
- Best matches between LTM and STM.
observed sequences, STM
71. Probability Maps all possible values of a variable to its
79. Reinforcement The machine can generate actions
function respective probabilities.
Learning $a_{1},a_{2},a_{3},...,a_{n}$ that affect its
environment and receives an award or
Properties:
punishment for them. Its goal is to learn
- P(x) is a number between 0 to 1.0
actions that maximise the long term reward.
- The area under a probability function is always
unity 80. Replicated Use many different copies of the same
feature feature detector with different positions
72. Pruning Pruning overcomes stopping criteria issues by:
approach
- Could also replicate across scale and
i) Using loose stopping criteria, or grow a full
orientation
tree.
- Replication greatly reduces the number of
ii) Remove sub branches that are not
free parameters to be learned.
contributing significantly.
- Use several different feature types, each
iii) Desirable to trade accuracy for simplicity.
with its own map of replicated detectors.
73. Purity i) Quality in a decision tree is related to the
81. Shannon $Entropy(S) = -\sum p_{c} log_{2} p_{c}$
Metrics purity of a given set S
Entropy
- Entropy and gini measure diversity of discrete
data 82. SOM Accuracy Representation/Quantization Error.
- Variane measures diversity on continuous data - The sum of the Euclidian distance between
the input and weights of the corresponding
ii) Purity is measured in terms of probability winning node.
$p_{c}$, of each class $c$, present in the given
set $S$, $p_{c} = \frac{|c|}{S}$ $ \epsilon \frac{1}{NM} \sum_{j=1}^{N} \sqrt{
(x_{1} - w_{1,j})^{2} + ... +(x_{m} - w_{m,j} )^{2})}$
74. Q- $Q(s_{t},a_{t}) = Q(s_{t},a_{t}) + \alpha [r_{t+1} +
Learning \gamma maxQ(s_{t+1},a) - Q(s_{t},a_{t}) ]$ 83. State space Discrete representations of a problem state
Equation coding
- Assumes state space is continuous and
75. Radial RBFs are a natural extension of coarse coding
two dimensional
Basis
- One kind of feature could be a circle in
Function - A collection of Gaussians
this state space
- The value of each feature varies with the
- Coarse coding
distance from the mean for that feature
- The value is between 0 and 1
84. Student T i) Shows significance between two means 92. Training SOMS i) The nodes that adjoin the winner are also
Testing - Probability it happened by chance - updated
Neighbourhood ii) Recruits more nodes to cover dense
ii) P-value is the probability that the results areas of the input space
from sample data occurred by chance iii) The learning rate decreases with
- Low = Good distance from the winner
iv) The learning rate and size of the
iii) Unpaired neighbourhood also decreases with time
- Our samples are not pair-wise related
93. Transition Model of the environment
iv) One-tail
function
- Only interested in whether the higher
curve is > than lower 94. Tree Bagging i) Ensemble Learning
v) Unequal variance - Overcomes overfitting
ii) Creates B different trees
85. Supervised The machine is given
- From B random training sets of n samples
Learning $y_{1},y_{2},y_{3},...,y_{n}$ as outputs and its
from the training set
goal is to learn to reproduce them from
iii) To classify
the inputs.
- Classification - Majority vote from all B
86. Supervised i) Classifying Input data trees
Learning a) Character recognition - Regression - mean value
Examples ii) Regression
95. Types of cluster i) Kmeans
87. Temporal i) Use experience instead of a model T models - Clusters do not overlap
Difference ii) Dynamic Programming - Data in one cluster only
Learning - Estimate value based on other estimates ii) Mixture of Gaussians (Soft Clustering)
iii) Wait one step until the return following - Clusters may overlap
a visit to state $s$, and action $a$ is known - Data may exhibit non-binary strength of
iv) Use observation to update the value association to all clusters
function - Probabilistic method
- Each cluster is a generative model
To update the value function (Temporal - Clusters have parameters
difference)
96. Types of data i) Discrete data
$V(s_{t}) = V(s_{t}) + \alpha [r_{t+1} + \gamma
- Integers
V(s_{t+1}) - V(s_{t})]$
- Dice value
88. Tile coding Represent feature space as 2D continuous - {H,T} coin
space ii) Continuous: Any value
- Real numbers
89. Topological i) The neighbourhood function means all
- Blood pressure
Maps nodes are included in the solution.
ii) Nodes with similar weights are close to 97. Types of i) Classification trees have leaves with
each other in the map. decision trees discrete classes
iii) In colour-space this creates uniform ii) Regression trees have leaves with
colour maps. numerical values
iv) In Cartesian space nodes that are close
98. Types of Deep i) Convolutional Neural Network
on a map are also close in space.
Networks a) Alternating layers of CNNs followed by
90. Training SOMS The weights of the winning node are a pooling layer
- updated b) Output uses a traditional MLP
accommodation
$\Delta W_{i,j} = \alpha (x_{i} - w_{i,j})$ ii) Deep Belief networks
a) Perceptron stacked Boltzmann machines
91. Training SOMs - i) Competitive Networks
b) Classification output layer
Competition ii) Individual nodes compete with each
other 99. Unsupervised The machine should build a representation
iii) Nodes compete on Euclidian distance in Learning of $x$ that can be used for decision
feature space making
iv) One node is selected as the winner
100. Unsupervised Learning Examples i) Clustering - Find useful representation
ii) Data compression
iii) Probability density modelling
iv) Outlier detection
v) Unsupervised classification ( recognise object in environment)
101. Value functions i) They see how good a given state is
ii) Evaluated in term of expected reward

Bellman equation for $V_{\pi}$


- Expresses relationship between the value pf a state and its successor state
102. Variance Reduction i) For regression trees
ii) Data points do not have a discrete class but a continuous value
iii) Calculate the variance of the node before the split
iv) Compare with the sum of variances in the new node
103. Vector Quantization i) Unsupervised Learning
ii) Represents an n-dimensional space as m-dimensional one
- m<n

K-means Clustering

1. Initialise k centroids (feature vectors)


2. While not converged
1. For each training sample, $x_{i}$
2. Assign $X_{i}$ to the nearest centroid using Euclidean distance
3. Update all the centroids using the new training sample distribution
104. Von Mises distribution Circular continuous probability distribution

Given by:

$\frac{1}{2 \pi I_{o}(k)} exp(kcos(\theta - \mu))$

$\theta$ is an angle
$\mu , k $ are the mean and width
$l_{o}(k)$ Bessel function of order k

Useful for modelling circular data - angular tuning

Das könnte Ihnen auch gefallen