1. Convergence Read this: 14. Memorization Memorizing, given facts, is an obvious
https://www.researchgate.net/post/How_to_proof_the_convergence_properties_of_a_metaheuristic_algorithm task in learning. This can be done by 2. Data generating Now that we have defined our loss function, we need to consider storing wherethe theinput datasamples (trainingexplicitly, and test) or comes distribution by identifying the concept behind from. The model that we will use is the probabilistic model of learning. Namely, there is a probability the inputgenerating distribution D over input/output pairs. This is often called the data data, and memorizing distribution. their A useful way to general rules. think about D is that it gives high probability to reasonable (x, y) pairs, and low probability to unreasonable (x, y) pairs. 15. Reasons for *Noise* in the training data: 1) at 3. "Divide & failure in ML feature level (e.g. incorrect values such Conquer" as typos) or 2) at label level (e.g. the algorithm wrong label is assigned to a set of features). *Insufficient features*: There are not enough features / data available for a learning algorithm to work. *More than one correct answer*: There might exist more than one correct answer. *Inductive bias* is too far away from the A divide and conquer algorithm works by recursively breaking downthat concept a problem is beinginto two or more sub- learned. problems of the same (or related) type (divide), until these become simple enough to be solved directly 16. Regularization (conquer). 4. Expected Loss
The loss l (loss function) we expect given a data generating distribution D.
5. Formal Given (i) a loss function l and (ii) a sample D from some unknown distribution D, you must compute a definition of function f that has low expected error e over D with respect to l. induction Helps avoid overfitting by reducing the machine magnitude of a certain feature. learning 17. Shallow We limit the depth of the decision tree? 6. Generalization The ability to identify the rules, to generalize,decision allows the system to make predictions on unknown data. tree 7. Greedy Greedy Algorithm works by making the decision that seems 18. Training error most promising at any moment; it never algorythm reconsiders this decision, whatever situation may arise later. They are shortsighted in their approach in the sense that they take decisions on the basis of information at hand without worrying about the effect these decisions may have in the future. 8. Hypercube a geometrical figure in four or more dimensions which is analogous to a cubeerror in three dimensions. The training is simply our average 9. Hyperparameter A parameter that controls other parameters in the model. Cannot naivelyerror adjusted using over the the training data but training need a validation set or development data because if we do it on the train data wedata. risk overfitting whereas if we do it on the test data we break the rule that test data has always have to be unseen. 10. Hyperspheres a geometrical figure in four or more dimensions which is analogous to a cube in three dimensions. 11. Inductive bias / The set of assumptions that the learner uses to predict outputs given inputs that it has not encountered. E.g. Learning bias Maximum margin bias (SVM), nearest neighbors bias (knn), etc. 12. knn scaling You should normalize when the scale of a feature is irrelevant or misleading, and not normalize when the scale is meaningful. K-means considers Euclidean distance to be meaningful. If a feature has a big scale compared to another, but the first feature truly represents greater diversity, then clustering in that dimension should be penalized. 13. Loss function
Tells us how 'bad' a system's prediction is in comparison to the truth. E.g. can be seen as a measure of error.