Overview
 Introduction
 Decision Trees
 KNearest Neighbors
 Linear Regression
 Statistical Learning
 Mixture of Gaussians
 Logistic Regression
 Generalized Linear Models
 Artificial Neural Networks
 Kernel Methods
 Gaussian Processes
 Support Vector Machines
 Deep Neural Networks
 Convolutional Neural Networks
 Recurrent and Recursive Neural Networks
 Ensemble Learning
 Stream Learning
These notes are a result of my preparation for a midterm exam in Pascal Poupart’s course on Machine Learning at University of Waterloo in the winter 2016 term. The contents are therefore based on the corresponding presentation slides available online.
Introduction
Definition by Tom Mitchell (1988): “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance for tasks in T, as measured by P, improves with experience E.”
Inductive Learning: given a training set of examples of the form $(x, f(X))$, return a function $h$ (hypothesis) that approximates $f$ (true underlying function).
The quality measure for hypotheses is generalization. A good hypothesis will generalize well, i.e., predict unseen examples correctly. Ockham’s razor suggests to prefer the simplest hypothesis consistent with the input data.
Two different types of inductive learning:
 Classification: range/output space of $f$ is categorical (or discrete)
 Regression: range/output space of $f$ is continuous
Hypothesis space $H$: set of all possible $h$
Consistency: $h$ is consistent if it agrees with $f$ on all examples; it is not always possible to find a consistent $h$ (e.g., due to an insufficient $H$ or noisy input data)
Realizability: a learning problem is realizable if and only if $f$ is in $H$
In general, we will observe a tradeoff between expressiveness (i.e., the size of $H$) and the complexity of finding a good $h$.
Overfitting: given a hypothesis space $H$, a hypothesis $h \in H$ is said to overfit the training data if there exists some alternative hypothesis $h’ \in H$ such that $h$ has smaller error than $h’$ over the training examples, but $h’$ has smaller error than $h$ over the entire distribution of instances.
$k$fold crossvalidation: run $k$ experiments, each time putting aside $\frac{1}{k}$ of the data to test on and, finally, compute the average accuracy of the experiments
Decision Trees
Decision trees (a.k.a. Classification and Regression Trees [CART]) represent disjunctions (OR) of conjunctions (AND) of constraints of attribute values.
Decision trees have the following structure:
 nodes: attributes
 edges: attribute values
 leaves: classes / regression values
Quality measure for decision trees:
 small size
 high consistency
Greedy induction of a decision tree:
 depthfirst search like construction
 a good attribute splits the examples into subsets that are ideally all from the same class or, in other words, that minimizes the residual error
Residual error for classification:
Error Frequency  Gini Index  Entropy  

Definition  $Q_\tau=num(\tau)max_k num(k)$  $Q_\tau=\sum_kp_\tau(k)(1p_\tau(k))$  $Q_t=\sum_kp_\tau(k) \ ld \ p_\tau(k)$ 
Explanation  Number of examples in leaf $\tau$ minus maximum number of examples in leaf $\tau$ that belong to any class $k$  Expected misclassification when choosing the class according to $p_\tau(k)$  Expected # of bits to encode the class of an instance chosen at random according to $p_\tau(k)$ 
Variables:
 $\tau$: index for leaf $\tau$
 $k$: index for class $k$
 $num(\tau)$: number of examples in leaf $\tau$
 $num(k)$: number of examples in leaf $\tau$ belonging to class $k$
 $p_\tau(k) = \frac{num(k)}{num(\tau)}$
Residual error for regression:
 Let $t_n = f(x_n)$ be the target for the $n^{th}$ example.
 Let $y_t$ be the value returned by leaf $\tau$.
 Let $R_\tau$ be the set of examples in leaf $\tau$.
 Euclidean error for leaf $\tau$: $Q_\tau=\sum_{n \in R_\tau}(t_ny_n)^2$
Choosing the best attribute for the next decision layer in the tree:
 In leaf $\tau$, choose attribute $A$ that reduces the residual error the most when expanded:
 $A^*=argmax_AQ_\tau\sum_{a \in A}p_\tau(A=a)Q_{\tau a}$, where
 $p_\tau(A=a)=\frac{num(A=a)}{num(\tau)}$ denotes the proportion of examples with value $a$ (in attribute $A$) inside node $\tau$
 $\tau a$ indexes the node reached by following the edge for attribute value $a$, starting from node $\tau$
Techniques to avoid overfitting:
 Stop learning when the curve for testing accuracy, plotted against the tree size, has reached its global peak; in practice, this is sometimes sometimes hard to achieve because the curve might exhibit high fluctuation
 Pruning of statistically irrelevant nodes in a bottomup fashion
 Remove nodes that improve testing accuracy by less than some threshold
 Regularization: add a penalty term that reflects the tree complexity (e.g., $T=$ #leaves in the tree) and remove leaves with a negative “regularized” error reduction: $Q_\tau\sum_{a \in A}p_\tau(A=a)Q_{\tau a}\lambdaT$
KNearest Neighbors
In the limit, singleattribute thresholding to construct decision trees for attributes with continuous inputs will lead to a full tree with one input example per leaf. Decision boundaries will always be axisaligned. A better approach without this restriction is $K$Nearest Neighbors.
Approach:
 Let $knn(x)$ be the $K$ nearest neighbors of $x$ according to distance $d$
 Label $y_x=mode({y_x’ \mid x’ \in knn(x)})$, i.e., the most frequent label among the $K$ nearest neighbors
 $K$ controls the degree of smoothing and can be optimized using $k$fold crossvalidation (if $K$ is too small this will lead to overfitting; if $K$ is too high this will lead to underfitting)
Comparison of complexity between decision trees and $K$Nearest Neighbors with respect to:
 $N$: size of training set
 $M$: number of attributes
Training  Testing  

Decision Tree  $O(M^2N)$  $O(M)$ 
$K$Nearest Neighbors  $O(MN)$  $O(MN)$ 
A good visualization of the decision boundaries for the twodimensional 1Nearest Neighbor case assuming Euclidean distance is the Voronoi diagram.
Linear Regression
Data: ${(x_1, t_1),(x_2,t_2),…,(x_N,t_N)}$ where $x_n \in \mathbb{R}^D$ and $t_n \in \mathbb{R}$
Problem: find linear hypothesis $h$ that maps $x$ to $t$; in other words, try to find a weight vector $w \in \mathbb{R}^{D+1}$ so that the error for $h(x,w)=w^T\overline{x}$ with $\overline{x}=\begin{pmatrix} 1 \ x \end{pmatrix}$ over all $x$ in the dataset is minimal
Solution: $ Aw=b $ with:
 $A=\sum_{n=1}^N\overline{x}_n\overline{x}_n^T$ (invertible if the training instances span $\mathbb{R}^{D+1}$)
 $b=\sum_{n=1}^Nt_n\overline{x}_n$
Tikhonov Regularization: a technique, applied to avoid a form of overfitting where small changes to the input data lead to big changes in the learned weight vector: $(\lambda I+A)w=b$. The greater $\lambda$ the smaller the magnitude of $w$ will be.
Statistical Learning
Idea: learning simply reduces the uncertainty in our knowledge of the world
Marginal Probability Distribution: specification of a probability for each event in our sample space; all probabilities must sum up to $1$
Joint Probability Distribution: specification of probabilities for all combinations of events: $Pr(A=a \wedge B=b)$ for all $a$, $b$
Marginalization (sumout rule): $Pr(A=a)=\sum_bPr(A=a \wedge B=b)$
Conditional Probability: fraction of worlds in which $B$ is true that also have $A$ true: $Pr(A \mid B)=\frac{Pr(A \wedge B)}{Pr(B)}$; also $Pr(A \wedge B)=Pr(A \mid B) \ Pr(B)$
Bayes’ Rule: $Pr(B \mid A)=\frac{Pr(A \mid B) \ Pr(B)}{Pr(A)}$
Bayesian Inference: $P(H \mid e)=kP(e \mid H) \ P(H)$
 $H$: hypothesis
 $e$: evidence
 $P(H)$: prior probability of $H$
 $P(e \mid H)=\prod_nP(e_n \mid H)$: likelihood of observing $e$, given $H$
 $P(H \mid e)$: posterior probability of $H$, given $e$
 $k$: normalizing factor, applied so that all posteriors sum up to $1$
Bayesian Prediction: $P(X \mid e)=\sum_iP(X \mid h_i) \ P(h_i \mid e)$
Properties of Bayesian Learning: optimal and not prone to overfitting, but potentially intractable if the hypothesis space is large
Approximations of Bayesian Learning:
 Maximum a Posteriori (MAP):
 making predictions, based on the most probable hypothesis $h_{MAP}=argmax_{h_i}P(h_i \mid e)$
 less accurate than Bayesian prediction, but both converge in accuracy as data increases
 controlled overfitting (prior can be used to penalize complex hypotheses)
 MAP for linear regression leads to regularized least square problem
 Maximum Likelihood (ML):
 MAP with uniform prior: $h_{ML}=argmax_{h_i}P(e \mid h_i)$
 less accurate than Bayesian and MAP prediction, but all three converge in accuracy as data increases
 prone to overfitting
 ML for linear regression leads to nonregularized least square problem
BiasVariance Decomposition for linear regression: $expected \ loss = (bias)^2 + variance + noise$
 bias increases as regularization parameter $\lambda$ increases
 variance decreases as regularization parameter $\lambda$ increases
 noise is constant
Mixture of Gaussians
Purpose: linear classification technique
Assumptions:
 $Pr(C=c_k)=\pi_k$: prior probability of class $k$
 $Pr(x \mid C=c_k) \propto e^{0.5(x\mu_k)^T\Sigma^{1}(x\mu_k)}$: likelihood of data point $x$, given class $k$, is a Gaussian distribution with the same covariance matrix $\Sigma$ for all classes
 $Pr(C=c_k \mid x)=kPr(x \mid C=c_k)Pr(C=c_k)$: posterior probability of class $k$, given data point $x$
For two classes $A$ and $B$ (using sigmoid):
$Pr(C=A)=\sigma(w^Tx+w_0)$ where
 $\sigma(a) = \frac{1}{1+e^{a}}$ (sigmoid function)
 $w=\Sigma^{1}(\mu_A\mu_B)$
 $w_0=0.5\mu_A^T\Sigma^{1}\mu_A+0.5\mu_B^T\Sigma^{1}\mu_B+ln\frac{\pi_A}{\pi_B}$
 $\pi_k:$ fraction of training examples that belong to class $k$ (via maximum likelihood)
 $\mu_k:$ empirical mean of all training examples that belong to class $k$ (via maximum likelihood)
 $\Sigma=\frac{S_A+S_B}{N}$: normalized sum of covariance matrices (via maximum likelihood)
 $S_k=\sum_{n \in c_k}(x_n\mu_k)(x_n\mu_k)^T$
For multiple classes (using softmax):
$Pr(C=c_k \mid x)=\frac{e^{w_k^T\overline{x}}}{\sum_je^{w_j^T\overline{x}}}$
Prediction: best class $k^*=argmax_kPr(c_k \mid x)$
Logistic Regression
Purpose: linear classification technique (can be viewed as a regression where the goal is to estimate a posterior probability which is continuous)
Assumption: $Pr(x \mid C=c_k)$ are members of the exponential family: $Pr(x \mid \Theta_k)=exp(\Theta_k^TT(x)A(\Theta_k)+B(x))$
Derivation: $Pr(C=c_k \mid x)=\sigma(w^T\overline{x})$: the posterior probability of class $k$ is a sigmoid logistic linear in $x$ (or softmax linear in $x$ for more than two classes)
Idea: learning $Pr(C=c_k \mid x)$ directly by maximum likelihood
Implementation for binary classification:
 $y \in {0,1}$
 $w^*=argmax_w\prod_n\sigma(w^T\overline{x})^{y_n}(1\sigma(w^T\overline{x}))^{1y_n}$
 $ \ \ \ \ \ =argmin_w\sum_nln(\sigma(w^T\overline{x}))+(1y_n)ln(1\sigma(w^T\overline{x}))$
 Derivative $\nabla L(w)=\sum_n(\sigma(w^T\overline{x})y_n)\overline{x}_n$
 Solve derivative iteratively for $0$ using Newton’s method: $w_{i+1}=w_iH^{1}\nabla L(w)$ where
 $H=\overline{X}R\overline{X}^T$ is the Hessian matrix, $X$ being the training data matrix where each column is represents one input vector
 $R$ is a diagonal matrix of size $N*N$ with entries of $\sigma_n(1\sigma_n)$
 $\sigma_n=\sigma(w_i^T\overline{x}_n)$
Generalized Linear Models
Purpose: nonlinear classification/regression
Idea: map inputs to a different space using a set of basis functions and do linear classification/regression in that space
Common basis functions:
 Polynomial: $\theta_j(x)=x^j$
 Gaussian: $\theta_j(x)=e^{\frac{(x\mu_j)^2}{2s^2}}$
 Sigmoid: $\theta_j(x)=\sigma(\frac{(x\mu_j)^2}{s})$
 Fourier, Wavelets etc.
Artificial Neural Networks
Purpose: nonlinear classification/regression
Idea: network of units similar to neurons in a human brain
Implementation: numerical output of unit $j$, $h(a_j)$ where
 $a_j=\sum_{i}W_{ji}x_i+w_0=W_j\overline{x}$
 $x_i$ is the output of unit $i$
 $W_{ji}$ denotes the strength of the link from unit $i$ to unit $j$
 $h(x)$ is the activation function (e.g., threshold, sigmoid, Gaussian, hyperbolic tangent, identity)
Structures:
 feedforward network (directed acyclic graph)
 recurrent network (directed cyclic graph)
Perceptron: singlelayer feedforward network
 Threshold Perceptron Learning:
 done separately for each unit $j$
 for each $(x,y)$ pair, correct weight $W_{ji}$ if incorrect output is produced:
 if output produced is $0$ instead of $1$: $W_{ji}=W_{ji}+x_i$
 if output produced is $1$ instead of $0$: $W_{ji}=W_{ji}x_i$
 convergence if and only if the dataset is linearly separable
 Sigmoid Perceptron Learning:
 same hypothesis space as logistic regression
Multilayer neural networks: flexible nonlinear models by learning nonlinear basis functions
 examples of 2layer feedforward networks:
 $h_1$ nonlinear and $h_2$ sigmoid: nonlinear classification
 $h_1$ nonlinear and $h_2$ identity: nonlinear regression
Back Propagation:
 Purpose: learning by iteratively adjusting network’s weights to minimze output error
 Two phases:
 Forward phase: compute output $z_j$ for each unit $j$
 Backward phase: compute delta $\delta_j$ at each unit $j$:
 if $j$ is an output unit: $\delta_j=h’(a_j)(y_jz_j)$
 if $j$ is a hidden unit: $\delta_j=h’(a_j)\sum_kw_{kj}\delta_k$ (recursion) where all units $k$ are the ones that receive input from unit $j$
 update weights: $w_{ji} \leftarrow w_{ji}  \alpha \ \delta_j z_i$
Kernel Methods
Key idea: use a large (possibly infinite) set of fixed, nonlinear functions; normally, the computational complexity depends on the number of basis functions used, but by a “dual trick”, complexity depends on the amount of data.
Kernel function: $k(x, x’) = \phi(x)^T \phi(x’)$
Gram matrix: $K = \Phi^T \Phi$
Prediction for kernelized linear regression: $y_n = k(\cdot, x_n)^T (K + \lambda I)^{1}y$ where $k(\cdot, x_n) = k(x, x_n) \ \forall \ x$
Constructing kernels $K$:
 Valid kernels must be positive semidefinite (= all eigenvalues must be $\geq 0$); in other words, the $K$ must factor into a product of a transposed matrix by itself ($K = \Phi^T \Phi$)
 There are welldefined rules that can be applied to combine kernels with each other to create new kernels, preserving the property of positive semidefiniteness
 Kernels can be defined with respect to other things than vector such as sets, strings or graphs.
Common kernels ($k(x, x’) =$ …):
 Polynomial: $(x^T x’)^M$; feature space: all degree $M$ products of entries in $x$
 General Polynomial: $(x^T x’ + c)^M$ with $c > 0$; feature space: all products of up to $M$ products of entries in $x$
 Gaussian: $\exp({\frac{\mid x  x’ \mid^2}{2 \sigma^2}})$; feature space: infinite!
Gaussian Processes
Key idea: approximating the true function $f(x)$ by an infinitedimensional Gaussian distribution over functions $P(f)$
Distribution over functions: $f(x) \sim GP(m(x), k(x, x’)) \ \forall x, x’$ where
 $m(x) = E(f(x))$ is the mean (= zero because the expectation of a zerocentered Gaussian is zero)
 $k(x, x’) = E((f(x)  m(x)) (f(x’)  m(x’))) = \frac{\phi(x)^T \phi(x’)}{\alpha}$ is the kernel covariance function
Gaussian Process Regression: corresponds to kernelized Bayesian linear regression with a function view instead of a weight space view, posteriors over $f$ instead of $w$ and a complexity, cubic in the number of training points instead of cubic in the number of basis functions. Prediction:
 $P(y’ \mid x’, X, y) = N(\bar{f}(x’), k’(x’, x’))$ where
 $\bar{f}(\cdot) = k(\cdot, X) \ A \ y$
 $k’(\cdot, \cdot) = k(\cdot, \cdot)  k(\cdot, X) \ A \ k(X, \cdot)$
 $A = (K + \sigma^2 I)^{1}$ (the inversion of this step is cubic in the number of training points
Support Vector Machines
Key idea: find the decision hyperplane that maximizes the distances to the closest data point, resulting in a unique and globally optimal maxmargin separator that can be found in polynomial time.
Optimization problem:
 $\max_w \frac{1}{\mid w \mid} \min_n y_n w^T \phi(x_n)$
 Alternatively, set the minimum distance to 1 and minimize $\mid w \mid$: $\min_w \frac{1}{2}{\mid w \mid}^2$ such that $y_n w^T \phi(x_n) \geq 1 \ \forall \ n$ (only the points where the distance is 1 are necessary to define the active constraints and are called support vectors)
 This optimization can be reformulated as a kernelized dual problem, given by: $\max_a \sum_n a_n  \frac{1}{2} \sum_n \sum_{n’} a_n a_{n’} y_n y_{n’} k(x_n, x_{n’})$ such that $\sum_n a_n y_n = 0$ and $a_n \geq 0 \ \forall \ n$ (many $a_n$ will be zero, they will be nonzero only for the support vectors)
Prediction:
 Primal problem: $y_* = sign(w^T \phi(x_*))$
 Dual problem: $y_* = sign(\sum_n a_n y_n k(x_n, x_*))$
Soft margin: for data that is not linearly separable, slack variables may be introduced into the optimzation problem to handle minor misclassifications: $\min_w C \sum_n \xi_n + \frac{1}{2}{\mid w \mid}^2$ such that $y_n w^T \phi(x_n) \geq 1  \xi_n$ and $\xi_n \geq 0 \ \forall \ n$; slack variable $\xi_n$ will be $> 0$ for misclassified examples and $C$ controls the tradeoff between the slack variable penalty and the margin; $C$ can also be interpreted as a regularization parameter; when $C \rightarrow \infty$ we arrive again at the hard margin problem; support vectors are all points that are in the margin or misclassified;
Multiclass SMVs:
 OneAgainstAll: for $K$ classes, train $K$ SMVs to distinguish each class from the rest; drawback: there will be regions that are either claimed by no class or by multiple classes
 Pairwise Comparison: train $O(K^2)$ SMVs to compare each pair of classes; drawbacks: computationally expensive and it is not obvious how the best class should be picked
 Continuous Ranking: single SVM that returns a continuous value to rank all classes (most popular approach today); idea: instead of computing the sign of a linear separator, compare the values of linear functions for each class $k$; classification: $y_* = \arg\max_k w_k^T \phi(x_*)$
Hidden Markov Models
Key idea: make use of sequential correlations between classes
Assumptions:
 Stationary process: transition and emission distributions are identical at each time step: $P(x_t \mid y_t) = P(x_{t+1} \mid y_{t+1})$ and $P(y_t \mid y_{t1}) = P(y_{t+1} \mid y_t) \ \forall \ t$
 Markovian process: next state is independent of previous states given the current state: $P(y_{t+1} \mid y_{t}, y_{t1}, …, y_{1}) = P(y_{t+1} \mid y_{t}) \ \forall \ t$
Parametetrization:
 Initial state distribution: $\pi = P(y_1)$
 Transition distribution: $\theta = P(y_{t} \mid y_{t  1})$
 Emission distribution: $\phi = P(x_t \mid y_t)$
Inference:
 Monitoring: $P(y_t \mid x_{1..t})$; forward algorithm has linear complexity in $t$
 Prediction: $P(y_{t+k} \mid x_{1..t})$; forward algorithm has linear compelxity in $t + k$
 Hindsight: $P(y_{k} \mid x_{1..t})$ where $k < t$; forwardbackward algorithm has linear complexity in $t$
 Most likely explanation: $\arg\max_{y_1,…y_t}P(y_{1..t} \mid x_{1..t})$; Viterbi algorithm has linear complexity in $t$
Maximum likelihood objectives:
 Supervised: $\pi’,\theta’,\phi’ = \arg\max_{\pi,\theta,\phi} P(y_{1..T},x_{1..T} \mid \pi,\theta,\phi)$
 Unsupervised: $\pi’,\theta’,\phi’ = \arg\max_{\pi,\theta,\phi} P(x_{1..T} \mid \pi,\theta,\phi)$
Deep Neural Networks
Definition: neural network with many hidden layers, providing a high level of expressivity (as the number of layers is increased, the number of units needed may decrease exponentially; example: parity function)
Problems:

Gradient vanishing: deep neural networks of sigmoid and hyperbolic units often suffer from vanishing gradients (because the derivative of these activation functions is always $< 1$); for this problem, two popular solutions exist:
 Pretraining of weights

Other types of activation functions:
 Rectified Linear Units:
 Hard version: $g(x) = \max(0, x)$
 Soft version (“Softplus”): $g(x) = \log(1 + e^x)$
 Maxout Units: operate on vectortyped weights and output the maximum value of all the linear combinations of different subsets of input weights
 Rectified Linear Units:

Overfitting: high expressivity increases the risk for overfitting; for this problem, three popular solutions exist:
 Regularization
 Data augmentation (to ensure that the number of parameters in the deep neural net is lower than the amount of data
 Dropout: random “drop” some (input and hidden) units from the network when training; during prediction, multiply the output of each unit by one minus its dropout probability; dropout can be seen as a form of ensemble learning for neural networks
Convolutional Neural Networks
Key idea: learn convolutional features in sequential, spatial or tensor data (e.g., pixels from an input image) in a flexible way. An equivariant representation of the learned model ensures (partial) translation invariance and handling of variablelength inputs.
Convolution: in neural networks, a convolution denotes the linear combination of a subset of units based on a specific pattern of weights. Convolutions are often combined with activation functions to produce a feature value.
Pooling: commutative mathematical operation that combines several units (e.g., max, sum, product, Euclidean norm etc.)
Convolutional Neural Network (CNN): any ANN that includes an alternation of convolution and pooling layers where some of the convolution weights are shared. CNNs are also trained using backpropagation, but gradients of shared weights are combined into a single gradient.
Recurrent and Recursive Neural Networks
Key idea: facilitate processing of variablelength data by instantiating recurrent or recursive patterns in the network.
Recurrent Neural Network: outputs can be fed back to the network as inputs, creating a recurrent structure that can be unrolled to handle variablelength data. Recurrent neural networks are trained by backpropagation on the unrolled network, combining gradients of shared weights into a single gradient. The EncoderDecoder model is a model mapping input sequences to output sequences (e.g., for machine translation, question answering, dialog).
Recursive Neural Network: generalize recurrent neural networks from chains to trees. Parse trees or dependency graphs are used as structures for recursive neural networks. The LongShortTermMemory (LSTM) model is a special gated structure to control memorization and forgetting in recursive neural networks, facilitating longterm memory.
Ensemble Learning
Key idea: combine several imperfect hypotheses into a better hypothesis
Assumptions:
 each hypothesis $h_i$ makes error with probability $p$
 the hypotheses are independent
Probabilities of making errors using majority vote of $n$ hypotheses under these assumptions*:
 $k \leq n$ hypotheses make an error: $p_e(k) = \binom{n}{k} p^k (1p)^{nk}$
 majority makes an error: $\sum_{k=\frac{1}{2}n}^n p_e(k)$
Weighted majority: as the above two assumptions are rarely true, hypotheses can be weighted lower if:
 they are correlated
 they have a lower classification performance
Boosting: technique to “boost” a weak learner by training multiple parametrizations of the same base model using weighted variations of the training set. One popular implementation is AdaBoost.

Boosting algorithm:
 set all instance weights $w_x$ to 1
 repeat until sufficient number of hypotheses:
 $h_i \leftarrow$ learn(dataset, weights)
 increase $w_x$ of misclassified instances $x$
 ensemble hypothesis is the weighted majority of all $h_i$ with weights $z_i$ proportional to the accuracy of $h_i$

Requirement: a weak learner (i.e., a model producing hypotheses at least as good as random), e.g., rules of thumb, decision stumps (= decision trees of one node), perceptrons, naive bayes models
Bagging (= bootstrap aggregation): technique to improve the accuracy of a weak learner by sampling a subset of both the samples and features to obtain independent classifiers; prediction is then done by a simple majority vote (without weighting). Random forests are bags of decision trees.

Bagging algorithm:
 repeat until sufficient number $K$ of hypotheses:
 bootstrap sampling: $D_k \leftarrow$ sample data subset
 random projection: $F_k \leftarrow$ sample feature subset
 training: $h_k \leftarrow$ learn($D_k$, $F_k$)
 ensemble hypothesis is the (nonweighted) majority of all $h_k$ for classification or the average output for regression
 repeat until sufficient number $K$ of hypotheses:
Boosting  Bagging  

Sampling  reweight instances in dataset  sample subset of data and features 
Classifiers obtained  complementary  independent 
Prediction  weighted majority vote  majority vote (no weighting) 
Stream Learning
Key idea: instead of training one hypothesis on a fixed dataset, continuously update a preliminary hypothesis as new data points arrive.
Challenges:
 since the data is streaming, it cannot all be stored $\rightarrow$ old data cannot necessarily be revisited
 time to process new incoming data must be constant and less than the arrival time for the next batch of data
Solutions:
 Bayesian Learning: lends itself naturally to stream learning because it stores information about training data in a product of likelihood distributions which can be updated incrementally every time a new data point comes in

Stochastic Gradient Descent (SGD): for optimizationbased learners like least square regression, logistic regression, maximum likelihood, support vector machines and neural networks, gradient descent can be implemented as an incremental approach where the model parameters are updated based on every incoming data point $(x_n, y_n)$ in isolation (as opposed to all existing data points at once), weighting the corresponding loss gradient by a data point specific learning rate $\alpha_n$.

RobinsMonro conditions: convergence is only guaranteed if the learning rate staisfies the RobinsMonro conditions of $\sum_{\alpha_n=1}^\infty \alpha_n = \infty$ and $\sum_{\alpha_n=1}^\infty \alpha_n^2 < \infty$; an example of such an update rule for learning rate $\alpha_n$ that satisfies these conditions is: $\alpha_n = \frac{1}{(\tau + n)^k}$ where $\tau \geq 0$ and $k \in (0.5, 1]$

AdaGrad: form of SDG that is often used in backpropagation. In AdaGrad, the model parameters $\theta$ are updated as follows: $\theta_m^{(n+1)} \leftarrow \theta_m^{(n)}  \frac{\alpha}{\tau + \sqrt{s_m^{(n)}}} \frac{\delta Loss(x_n; y_n; \theta^{(n)})}{\delta \theta_m^{(n)}}$ where $s_m^{(n)} \leftarrow s_m^{(n  1)} + (\frac{\delta Loss(x_n; y_n; \theta^{(n)})}{\delta \theta_m^{(n)}})^2$
