Notes on classical ML math

Notes on classical ML math

April 2, 2023
Machine Learning, Math

Introduction #

There are two major schools of thought on the interpretation of probability: one is the frequentist school and the other is the Bayesian school.

Later, we will use the following notation for the observed set: \(X_{N\times p}=(x_{1},x_{2},\cdots,x_{N})^{T},x_{i}=(x_{i1},x_{i2},\cdots,x_{ip})^{T}\)

This notation indicates there are \(N\) samples, and each sample is a \(p\) -dimensional vector. Each observation is generated by \(p(x|\theta)\) .

Frequentist #

Frequentists hold that in \(p(x|\theta)\) , \(\theta\) as a parameter is a constant. For \(N\) observations, the probability of the observed set is \(p(X|\theta)\mathop{=}\limits _{iid}\prod\limits _{i=1}^{N}p(x_{i}|\theta))\) . To find the value of \(\theta\) , we use the maximum log-likelihood (MLE) method:

\(\theta_{MLE}=\mathop{argmax}\limits _{\theta}\log p(X|\theta)\mathop{=}\limits _{iid}\mathop{argmax}\limits _{\theta}\sum\limits _{i=1}^{N}\log p(x_{i}|\theta)\)

Bayesian #

Bayesians believe that \(\theta\) in \(p(x|\theta)\) is not a constant. This \(\theta\) follows a preset prior distribution \(\theta\sim p(\theta)\) . So, according to Bayes' theorem, the parameter's posterior distribution based on the observed set can be written as:

\(p(\theta|X)=\frac{p(X|\theta)\cdot p(\theta)}{p(X)}=\frac{p(X|\theta)\cdot p(\theta)}{\int\limits _{\theta}p(X|\theta)\cdot p(\theta)d\theta} \propto p(X|\theta)\cdot p(\theta)\)

To find the value of \(\theta\) , we need to maximize this parameter's posterior distribution using the MAP (Maximum A Posteriori) method:

\(\theta_{MAP}=\mathop{argmax}\limits _{\theta}p(\theta|X)=\mathop{argmax}\limits _{\theta}p(X|\theta)\cdot p(\theta)\)

The second equality is due to the denominator being unrelated to \(\theta\) .

\(p(X|\theta)\) is called the likelihood and represents our model distribution. After obtaining the posterior distribution of the parameters, we can use it for Bayesian prediction:

\(p(x_{new}|X)=\int\limits _{\theta}p(x_{new}|\theta)\cdot p(\theta|X)d\theta\)

In this integral, the multiplicand is the model (likelihood), and the multiplier is the posterior distribution.

Here's the derivation:

  1. Start with the joint probability of observing \(x_{new}\) and \(\theta\) given \(X\) : \(p(x_{new}, \theta|X) = p(x_{new}|\theta, X) \cdot p(\theta|X)\)

  2. To find the marginal probability of observing \(x_{new}\) given \(X\) , integrate the joint probability over all possible values of \(\theta\) : \(p(x_{new}|X) = \int\limits_{\theta} p(x_{new}, \theta|X) d\theta\)

  3. Substitute the joint probability from step 1 into the integral: \(p(x_{new}|X) = \int\limits_{\theta} [p(x_{new}|\theta, X) \cdot p(\theta|X)] d\theta\)

  4. Since \(x_{new}\) is conditionally independent of \(X\) given \(\theta\) , we can simplify \(p(x_{new}|\theta, X)\) as \(p(x_{new}|\theta)\) : \(p(x_{new}|X) = \int\limits_{\theta} [p(x_{new}|\theta) \cdot p(\theta|X)] d\theta\)

The final result is the given formula:

\(p(x_{new}|X) = \int\limits_{\theta} p(x_{new}|\theta) \cdot p(\theta|X) d\theta\)

Gaussian Distribution #

1-d MLE #

\(\theta\) of Gaussian in 1-d MLE

\[\begin{aligned} &\theta=(\mu,\Sigma)=(\mu,\sigma^{2}), \\ &\theta_{MLE}=\mathop{argmax}\limits_{\theta}\log p(X|\theta)\mathop{=}\limits_{iid}\mathop{argmax}\limits_{\theta}\sum\limits _{i=1}^{N}\log p(x_{i}|\theta) \end{aligned}\] Probability density function (PDF) of the multivariate Gaussian distribution in \(p\) -d with mean vector \(\mu\) and covariance matrix \(\Sigma\) is written as: \(p(x|\mu,\Sigma)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\) Plugging into MLE, we consider the one-dimensional case: \(\log p(X|\theta)=\sum\limits _{i=1}^{N}\log p(x_{i}|\theta)=\sum\limits _{i=1}^{N}\log\frac{1}{\sqrt{2\pi}\sigma}\exp(-(x_{i}-\mu)^{2}/2\sigma^{2})\) To find \(\mu_{MLE}\) :

\(\mu_{MLE}=\mathop{argmax}\limits _{\mu}\log p(X|\theta)=\mathop{argmax}\limits _{\mu}\sum\limits _{i=1}^{N}(x_{i}-\mu)^{2}\) \(\frac{\partial}{\partial\mu}\sum\limits _{i=1}^{N}(x_{i}-\mu)^{2}=0\longrightarrow\mu_{MLE}=\frac{1}{N}\sum\limits _{i=1}^{N}x_{i}\) For \(\sigma\) , we have:

\[\begin{aligned} \sigma_{MLE}=\mathop{argmax}\limits _{\sigma}\log p(X|\theta)&=\mathop{argmax}\limits _{\sigma}\sum\limits _{i=1}^{N}[-\log\sigma-\frac{1}{2\sigma^{2}}(x_{i}-\mu)^{2}]\\ &=\mathop{argmin}\limits _{\sigma}\sum\limits _{i=1}^{N}[\log\sigma+\frac{1}{2\sigma^{2}}(x_{i}-\mu)^{2}] \end{aligned}\] \(\therefore\) \(\frac{\partial}{\partial\sigma}\sum\limits _{i=1}^{N}[\log\sigma+\frac{1}{2\sigma^{2}}(x_{i}-\mu)^{2}]=0\)

Given the log-likelihood function to optimize: \(\sum\limits_{i=1}^{N}[\log\sigma + \frac{1}{2\sigma^2}(x_i - \mu)^2]\)

  1. Differentiate the log-likelihood function with respect to \(\sigma\) : \(\frac{\partial}{\partial\sigma}\sum\limits_{i=1}^{N}[\log\sigma + \frac{1}{2\sigma^2}(x_i - \mu)^2]\)

  2. Apply the sum rule and chain rule for differentiation: \(N \cdot \frac{1}{\sigma} - \sum\limits_{i=1}^{N}\frac{(x_i - \mu)^2}{\sigma^3} = 0\)

  3. Solve for \(\sigma^2\) : \(\frac{1}{\sigma} = \frac{1}{N \sigma^3}\sum\limits_{i=1}^{N}(x_i - \mu)^2 \longrightarrow \sigma^2 = \frac{1}{N}\sum\limits_{i=1}^{N}(x_i - \mu)^2\)

The result is the MLE of the variance: \(\sigma_{MLE}^2 = \frac{1}{N}\sum\limits_{i=1}^{N}(x_i - \mu)^2\)

Since we have \(\sigma^2_{MLE}\) , for a unbiased estimate of \(\mathbb{E}_D [\sigma^2_{MLE}]\) :

\[\begin{aligned} \mathbb{E}_{\mathcal{D}}[\sigma_{MLE}^{2}]&=\mathbb{E}_{\mathcal{D}}[\frac{1}{N}\sum\limits _{i=1}^{N}(x_{i}-\mu_{MLE})^{2}]=\mathbb{E}_{\mathcal{D}}[\frac{1}{N}\sum\limits _{i=1}^{N}(x_{i}^{2}-2x_{i}\mu_{MLE}+\mu_{MLE}^{2}) \\&=\mathbb{E}_{\mathcal{D}}[\frac{1}{N}\sum\limits _{i=1}^{N}x_{i}^{2}-\mu_{MLE}^{2}]=\mathbb{E}_{\mathcal{D}}[\frac{1}{N}\sum\limits _{i=1}^{N}x_{i}^{2}-\mu^{2}+\mu^{2}-\mu_{MLE}^{2}]\\ &= \mathbb{E}_{\mathcal{D}}[\frac{1}{N}\sum\limits _{i=1}^{N}x_{i}^{2}-\mu^{2}]-\mathbb{E}_{\mathcal{D}}[\mu_{MLE}^{2}-\mu^{2}]=\sigma^{2}-(\mathbb{E}_{\mathcal{D}}[\mu_{MLE}^{2}]-\mu^{2})\\&=\sigma^{2}-(\mathbb{E}_{\mathcal{D}}[\mu_{MLE}^{2}]-\mathbb{E}_{\mathcal{D}}^{2}[\mu_{MLE}])=\sigma^{2}-Var[\mu_{MLE}]\\&=\sigma^{2}-Var[\frac{1}{N}\sum\limits _{i=1}^{N}x_{i}]=\sigma^{2}-\frac{1}{N^{2}}\sum\limits _{i=1}^{N}Var[x_{i}]=\frac{N-1}{N}\sigma^{2} \end{aligned}\] \(\therefore\) for \(\hat{\sigma}^{2}\) a.k.a \(\mathbb{E}_{\mathcal{D}}[\sigma_{MLE}^{2}]\) :

\(\hat{\sigma}^{2}=\frac{1}{N-1}\sum\limits _{i=1}^{N}(x_{i}-\mu)^{2}\)

Multivariate Gaussian #

\(\because\) \(p(x|\mu,\Sigma)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}\)

Here, \(x,\mu\in\mathbb{R}^{p},\Sigma\in\mathbb{R}^{p\times p}\) , where \(\Sigma\) is the covariance matrix, generally also a semi-positive definite matrix.

First, we deal with the number in the exponent, which can be denoted as the Mahalanobis distance between \(x\) and \(\mu\) .

The Mahalanobis distance is a method for measuring the distance between data points, taking into account the covariance structure of the data. For two points \(x\) and \(y\) , their Mahalanobis distance is defined as:

\(D_{M}(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}\)

Here, \(\Sigma\) is the covariance matrix of the data. The difference between Mahalanobis distance and Euclidean distance lies in the fact that the former takes into account the shape of the data distribution, especially the correlation between the data. When the covariance matrix of the data is an identity matrix, the Mahalanobis distance is equal to the Euclidean distance.

The covariance matrix of the data is used to represent the degree of joint variation between different dimensions in the data. For a dataset with \(n\) samples and \(p\) features, the covariance matrix \(\Sigma\) is a \(p \times p\) matrix, defined as:

\(\Sigma = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^T\)

Here, \(x_i\) is the \(i\) -th sample point, \(\bar{x}\) is the sample mean, and \(n\) is the number of samples.

This equation defines a matrix because each summand \((x_i - \bar{x})(x_i - \bar{x})^T\) itself is a matrix. Let's explain this in more detail:

Suppose \(x_i\) and \(\bar{x}\) are \(p\) -dimensional column vectors, then \((x_i - \bar{x})\) is also a \(p\) -dimensional column vector. When calculating the outer product \((x_i - \bar{x})(x_i - \bar{x})^T\) , we obtain a \(p \times p\) matrix, where each element is the product of the corresponding elements in \((x_i - \bar{x})\) .

Summing these \(p \times p\) matrices over the sample size \(n\) , we obtain a cumulative \(p \times p\) matrix. Finally, the entire sum is divided by \((n-1)\) to obtain the unbiased covariance matrix \(\Sigma\) .

The denominator is n-1{.verbatim} rather than n{.verbatim} because of Bessel's correction, which is used to provide an unbiased estimate of the population variance when working with sample data.

For a symmetric covariance matrix, eigendecomposition can be performed.

\(\Sigma=U\Lambda U^{T}=(u_{1},u_{2},\cdots,u_{p})diag(\lambda_{i})(u_{1},u_{2},\cdots,u_{p})^{T}=\sum\limits _{i=1}^{p}u_{i}\lambda_{i}u_{i}^{T}\) \(\Sigma^{-1}=\sum\limits _{i=1}^{p}u_{i}\frac{1}{\lambda_{i}}u_{i}^{T}\)

In this context, \(u_i\) is the eigenvector of the covariance matrix \(\Sigma\) . Eigendecomposition decomposes the matrix into a product of eigenvectors \(u_i\) and eigenvalues \(\lambda_i\) . \(U\) is a matrix composed of \(u_i\) as column vectors corresponding to the eigenvectors. \(\Lambda\) is a diagonal matrix with elements on the diagonal being the corresponding eigenvalues \(\lambda_i\) .

\[\begin{aligned} \Delta&=(x-\mu)^{T}\Sigma^{-1}(x-\mu) \\ &=\sum\limits _{i=1}^{p}(x-\mu)^{T}u_{i}\frac{1}{\lambda_{i}}u_{i}^{T}(x-\mu) \\ &=\sum\limits _{i=1}^{p}\frac{y_{i}^{2}}{\lambda_{i}} \tag{eq:1} \end{aligned}\] \( y_i = (x-\mu)^T u_i = u_i^T (x-\mu) \)

We notice that scalar \(y_i\) is the projection of \(x-\mu\) onto the eigenvector \(u_i\) (or the inverse).

Specifically, \(u_i^T(x - \mu)\) calculates the dot product between \((x - \mu)\) and \(u_i\) , which gives the component of \((x - \mu)\) in the direction of \(u_i\) . This is why we say \(y_i\) is the projection of \((x - \mu)\) onto the eigenvector \(u_i\) .

In Euclidean space, the dot product (also known as the inner product) is a fundamental operation between two vectors, used to calculate their components (or projections) in the same direction. The formula for calculating Euclidean dot product is:

\(u_i^T (x - \mu) = ||u_i|| ||x - \mu|| \cos{\theta}\)

Where \(||u_i||\) and \(||x - \mu||\) represent the lengths of the vectors \(u_i\) and \((x - \mu)\) , respectively, and \(\theta\) represents the angle between them. Since the eigenvector \(u_i\) is typically normalized to have unit length, i.e., \(||u_i|| = 1\) , the dot product simplifies to:

\(y_i = u_i^T (x - \mu) = ||x - \mu|| \cos{\theta}\)

\(\ref{eq:1}\) represents a generalized concentric ellipses for different values of \(\Delta\) .

Each term \(\frac{y_{i}^{2}}{\lambda_{i}}\) represents the contribution of the projection of the vector \((x - \mu)\) onto the eigenvector \(u_i\) to the overall shape of the ellipse. The eigenvalues \(\lambda_i\) determine the scaling of the ellipse along each eigenvector direction.

Now let's look at two issues in the practical application of multidimensional Gaussian models:

  1. The number of degrees of freedom for parameters \(\Sigma,\mu\) is \(O(p^{2})\) . For high-dimensional data, this leads to too many degrees of freedom.

    Solution: The high degrees of freedom come from \(\Sigma\) having \(\frac{p(p+1)}{2}\) free parameters. We can assume that it is a diagonal matrix or even assume that the elements on the diagonal are the same under the isotropic assumption. The former dimension reduction algorithm is Factor Analysis (FA), and the latter is probabilistic PCA (p-PCA).

  2. The second problem is that a single Gaussian distribution is unimodal, and it cannot achieve good results for data distributions with multiple peaks. Solution: Gaussian Mixture Model (GMM).

Next, we will introduce commonly used theorems for multidimensional Gaussian distributions.

Let \(x=(x_1, x_2,\cdots,x_p)^T=(x_{a,m\times 1}, x_{b,n\times1})^T,\mu=(\mu_{a,m\times1}, \mu_{b,n\times1}),\Sigma=\begin{pmatrix}\Sigma_{aa}&\Sigma_{ab}\\\Sigma_{ba}&\Sigma_{bb}\end{pmatrix}\)

Given: \(x\sim\mathcal{N}(\mu,\Sigma)\)

Given \(x \sim \mathcal{N}(\mu, \Sigma)\) , and \(y \sim Ax + b\) , then \(y \sim \mathcal{N}(A\mu + b, A\Sigma A^T)\) .

\(\mathbb{E}[y]=\mathbb{E}[Ax+b]=A\mathbb{E}[x]+b=A\mu+b\) \ \(Var[y]=Var[Ax+b]=Var[Ax]=A\cdot Var[x]\cdot A^T\)

Now, using this theorem, we can obtain four quantities \(p(x_a), p(x_b), p(x_a|x_b), p(x_b|x_a)\)

  1. \(x_a = \begin{pmatrix}\mathbb{I}_{m\times m} & \mathbb{O}_{m\times n}\end{pmatrix}\begin{pmatrix}x_a \\ x_b\end{pmatrix}\) .

    Substituting into the theorem, we get:

    \[ \begin{aligned} \mathbb{E}[x_a] = \begin{pmatrix}\mathbb{I} & \mathbb{O}\end{pmatrix}\begin{pmatrix}\mu_a \\ \mu_b\end{pmatrix} = \mu_a \\ Var[x_a] = \begin{pmatrix}\mathbb{I} & \mathbb{O}\end{pmatrix}\begin{pmatrix}\Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb}\end{pmatrix}\begin{pmatrix}\mathbb{I} \\ \mathbb{O}\end{pmatrix} = \Sigma_{aa} \end{aligned} \] \(\therefore\)

    \(x_a \sim \mathcal{N}(\mu_a, \Sigma_{aa})\)
  2. Similarly, \(x_b \sim \mathcal{N}(\mu_b, \Sigma_{bb})\) .

  3. For the two conditional probabilities, we introduce three quantities:

    \( x_{b\cdot a} = x_b - \Sigma_{ba}\Sigma_{aa}^{-1}x_a\\ \mu_{b\cdot a} = \mu_b - \Sigma_{ba}\Sigma_{aa}^{-1}\mu_a\\ \Sigma_{bb\cdot a} = \Sigma_{bb} - \Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \)

    In particular, the last equation is called the Schur Complementary of \(\Sigma_{bb}\) . We can see that: \( x_{b\cdot a} = \begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1} & \mathbb{I}_{n\times n}\end{pmatrix}\begin{pmatrix}x_a \\ x_b\end{pmatrix} \)

    So

    \[ \begin{aligned} \mathbb{E}[x_{b\cdot a}] &= \begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1} & \mathbb{I}_{n\times n}\end{pmatrix}\begin{pmatrix}\mu_a \\ \mu_b\end{pmatrix} = \mu_{b\cdot a}\\ Var[x_{b\cdot a}] &= \begin{pmatrix}-\Sigma_{ba}\Sigma_{aa}^{-1} & \mathbb{I}_{n\times n}\end{pmatrix}\begin{pmatrix}\Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb}\end{pmatrix}\begin{pmatrix}-\Sigma_{aa}^{-1}\Sigma_{ba}^T \\ \mathbb{I}_{n\times n}\end{pmatrix} = \Sigma_{bb\cdot a} \end{aligned} \] Using these three quantities, we can get \(x_b = x_{b\cdot a} + \Sigma_{ba}\Sigma_{aa}^{-1}x_a\) .

    Therefore: \( \mathbb{E}[x_b|x_a] = \mu_{b\cdot a} + \Sigma_{ba}\Sigma_{aa}^{-1}x_a \) \( Var[x_b|x_a] = \Sigma_{bb\cdot a} \)

  4. Similarly: \( x_{a\cdot b} = x_a - \Sigma_{ab}\Sigma_{bb}^{-1}x_b\\ \mu_{a\cdot b} = \mu_a - \Sigma_{ab}\Sigma_{bb}^{-1}\mu_b\\ \Sigma_{aa\cdot b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} \)

    \(\therefore\) \( \mathbb{E}[x_a|x_b] = \mu_{a\cdot b} + \Sigma_{ab}\Sigma_{bb}^{-1}x_b \) \( Var[x_a|x_b] = \Sigma_{aa\cdot b} \)

\begin{example} Given: \(p(x) = \mathcal{N}(\mu, \Lambda^{-1}), p(y|x) = \mathcal{N}(Ax + b, L^{-1})\) , find: \(p(y), p(x|y)\) . Solution: Let: \(y = Ax + b + \epsilon, \epsilon \sim \mathcal{N}(0, L^{-1})\) , \(\therefore\) \(\begin{aligned} \mathbb{E}[y] = \mathbb{E}[Ax + b + \epsilon] = A\mu + b Var[y] = A\Lambda^{-1}A^T + L^{-1} \end{aligned}\) \(\therefore\) \( p(y) = \mathcal{N}(A\mu + b, L^{-1} + A\Lambda^{-1}A^T)\) Introduce: \(z = \begin{pmatrix}x \\ y\end{pmatrix}\) , we can obtain \(Cov[x, y] = \mathbb{E}[(x - \mathbb{E}[x])(y - \mathbb{E}[y])^T]\) . For this covariance, we can directly compute: \( \begin{aligned} Cov(x, y) &= \mathbb{E}[(x - \mu)(Ax - A\mu + \epsilon)^T] = \mathbb{E}[(x - \mu)(x - \mu)^TA^T] = Var[x]A^T = \Lambda^{-1}A^T \end{aligned}\) Note the symmetry of the covariance matrix: \(p(z) = \mathcal{N}\begin{pmatrix}\mu \\ A\mu + b\end{pmatrix}, \begin{pmatrix}\Lambda^{-1} & \Lambda^{-1}A^T \\ A\Lambda^{-1} & L^{-1} + A\Lambda^{-1}A^T\end{pmatrix}\) . \(\therefore\) \( \mathbb{E}[x|y] = \mu + \Lambda^{-1}A^T(L^{-1} + A\Lambda^{-1}A^T)^{-1}(y - A\mu - b) \) \( Var[x|y] = \Lambda^{-1} - \Lambda^{-1}A^T(L^{-1} + A\Lambda^{-1}A^T)^{-1}A\Lambda^{-1}\) \end{example}

Linear Regression #

\epigraph{If AI thinks in matrix, so should human.}{anonymous}

Suppose we have dataset \(\mathcal{D}\) : \(\mathcal{D}=\{(x_1, y_1),(x_2, y_2),\cdots,(x_N, y_N)\}\) We annotate \(X\) as samples and \(Y\) as target: \(X=(x_1,x_2,\cdots,x_N)^T,Y=(y_1,y_2,\cdots,y_N)^T\) We are looking at a linear function: \(f(w)=w^Tx\)

Least squares method #

For this problem, we use the squared error defined by the Euclidean norm to define the loss function: \(L(w) = \sum\limits_{i=1}^N ||w^Tx_i - y_i||^2_2\)

Expand it, we get:

\[\begin{aligned} L(w)&=(w^Tx_1-y_1,\cdots,w^Tx_N-y_N)\cdot (w^Tx_1-y_1,\cdots,w^Tx_N-y_N)^T\\ &=(w^TX^T-Y^T)\cdot (Xw-Y)=w^TX^TXw-Y^TXw-w^TX^TY+Y^TY\\ &=w^TX^TXw-2w^TX^TY+Y^TY \end{aligned}\] Now minimize \(\hat{w}\)

\[\begin{aligned} \hat{w}=\mathop{argmin}\limits_wL(w)&\longrightarrow\frac{\partial}{\partial w}L(w)=0\\ &\longrightarrow2X^TX\hat{w}-2X^TY=0\\ &\longrightarrow \hat{w}=(X^TX)^{-1}X^TY=X^+Y \end{aligned}\] \((X^TX)^{-1}X^T\) is also called the pseudo-inverse. We annote: \(X^+ = (X^TX)^-1X^T\)

For row full rank or column full rank \(X\) , the solution can be obtained directly, but for non-full rank sample sets, the singular value decomposition (SVD) method is needed. Apply Singular Value Decomposition (SVD) to \(X\) , we get \(X=U\Sigma V^T\) .

To find the pseudoinverse \(X^+\) for a non-full rank matrix X, we can use the SVD decomposition. We first find the pseudoinverse of the diagonal matrix \(\Sigma\) . For every non-zero diagonal element \(\sigma_i\) , its pseudoinverse \(\sigma_i^+\) is \(\frac{1}{\sigma_i}\) . Then, the pseudoinverse of \(\Sigma\) , denoted as \(\Sigma^+\) , is a diagonal matrix with the reciprocals of the non-zero diagonal elements of \(\Sigma\) .

Now, using the property of pseudoinverse for orthogonal matrices:

\(U^+ = U^T\) \(V^+ = V^T\)

An orthogonal matrix is a square matrix where its transpose is also its inverse:

\(Q^TQ = QQ^T = I\)

In the context of SVD, the matrices \(U\) and \(V\) are orthogonal matrices. The pseudoinverse, denoted by the superscript +, has the property that for orthogonal matrices, the pseudoinverse is equal to the transpose:

  1. For an orthogonal matrix \(U\) , we have \(U^TU = UU^T = I\) , which implies \(U^+ = U^T\) .
  2. Similarly, for an orthogonal matrix \(V\) , we have \(V^TV = VV^T = I\) , which implies \(V^+ = V^T\) .

The pseudoinverse \(X^+\) can be computed as:

\(X^+ = (U\Sigma V^T)^+ = V\Sigma^+ U^T\)

So, we have:

\(X^+ = V\Sigma^{-1}U^T\)

MLE with Gaussian noise #

For 1-d situation:

\(y=w^Tx+\epsilon,\epsilon\sim\mathcal{N}(0,\sigma^2)\) \(\therefore y\sim\mathcal{N}(w^Tx,\sigma^2)\)

Plug it into MLE:

\[\begin{aligned} L(w)=\log p(Y|X,w)&=\log\prod\limits_{i=1}^Np(y_i|x_i,w)\\ &=\sum\limits_{i=1}^N\log(\frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{(y_i-w^Tx_i)^2}{2\sigma^2}})\\ \mathop{argmax}\limits_wL(w)&=\mathop{argmin}\limits_w\sum\limits_{i=1}^N(y_i-w^Tx_i)^2 \end{aligned}\] This expression is the same as the result obtained from the least squares estimation.

Gaussian weight prior in MAP (Maximum A Posteriori) estimation. #

Take the prior distribution \(w \sim \mathcal{N}(0, \sigma_0^2)\) . Then:

\[\begin{aligned} \hat{w} = \mathop{argmax}\limits_w p(w|Y) &= \mathop{argmax}\limits_w p(Y|w) p(w) \\ &= \mathop{argmax}\limits_w \log p(Y|w) p(w) \\ &= \mathop{argmax}\limits_w (\log p(Y|w) + \log p(w)) \\ &= \mathop{argmin}\limits_w [(y - w^Tx)^2 + \frac{\sigma^2}{\sigma_0^2} w^Tw] \end{aligned}\] Here, we omit \(X\) as \(p(Y)\) is not related to \(w\) , and we also use the MLE result for Gaussian distribution from above.

We will see that the existence of the hyperparameter \(\sigma_0\) corresponds to the Ridge regularization term that will be introduced below. Similarly, if the prior distribution is taken as the Laplace distribution, a result similar to L1 regularization will be obtained.

Regularization #

In practical applications, if the sample size is not much larger than the feature dimension, overfitting may occur. For this situation, we have the following three solutions:

  1. Add more data
  2. Feature selection (reduce feature dimension), such as the PCA algorithm.
  3. Regularization

Regularization generally involves adding a regularization term (representing the penalty for the model's complexity) to the loss function (such as the least squares loss introduced above). Below, we introduce two regularization frameworks commonly used in general situations.

\[\begin{aligned} L1&:\mathop{argmin}\limits_w L(w)+\lambda||w||_1,\lambda > 0 \\ L2&:\mathop{argmin}\limits_w L(w)+\lambda||w||^2_2,\lambda > 0 \end{aligned}\]

L1 Lasso #

Caveate: L1 regularization can lead to sparse solutions.

From the perspective of minimizing the loss, the derivative of the L1 term is non-zero near 0 on both the left and right sides, making it easier to obtain a solution of 0.

On the other hand, L1 regularization is equivalent to: \(\mathop{argmin}\limits_w L(w) \\ s.t. ||w||_1 < C\) We have already seen that the squared error loss function is an ellipsoid in the \(w\) space. Therefore, the solution of the above equation is the tangent point between the ellipsoid and \(||w||_1 = C\) , making it more likely to be tangent on the coordinate axis. That means that some of the weight coefficients (w) become exactly zero. This property leads to sparsity in the solution, which can be seen as a form of automatic feature selection. However, in certain cases, it might also result in losing some potentially important information from the features whose coefficients become zero.

L2 Ridge #

\[\begin{aligned} \hat{w}=\mathop{argmin}\limits_wL(w)+\lambda w^Tw&\longrightarrow\frac{\partial}{\partial w}L(w)+2\lambda w=0\\ &\longrightarrow2X^TX\hat{w}-2X^TY+2\lambda \hat w=0\\ &\longrightarrow \hat{w}=(X^TX+\lambda \mathbb{I})^{-1}X^TY \end{aligned}\] As we can see, this regularization parameter coincides with the previous MAP result. Using the L2 norm for regularization not only allows the model to choose smaller values for \(w\) , but also avoids the issue of \(X^TX\) being non-invertible.

Summary #

The linear regression model is the simplest model, but it is a miniature of every machine learning problem. Here, we use the least squares error to obtain a closed-form solution. We also find that when the noise follows a Gaussian distribution, the MLE solution is equivalent to the least squares error. Moreover, after adding a regularization term, the least squares error combined with L2 regularization is equivalent to the MAP estimation under Gaussian noise prior, and with L1 regularization, it is equivalent to the Laplace noise prior.

Traditional machine learning methods more or less contain elements of linear regression models:

  1. Linear models often cannot fit the data well, so there are three strategies to overcome this drawback:
    1. Transform the feature dimensions, such as polynomial regression models, which add higher-order terms based on linear features.
    2. Add a non-linear transformation after the linear equation by introducing a non-linear activation function, such as in linear classification models like perceptrons.
    3. For consistent linear coefficients, we perform multiple transformations so that a single feature is not only affected by a single coefficient, such as in multilayer perceptrons (deep feedforward networks).
  2. Linear regression is linear across the entire sample space. We can modify this limitation by introducing different linear or non-linear functions in different regions, such as in linear spline regression and decision tree models.
  3. Linear regression uses all the samples, but preprocessing the data might yield better learning results (the so-called curse of dimensionality, where high-dimensional data is harder to learn from), such as in PCA algorithms and manifold learning.

Linear Classifiers #

For classification tasks, linear regression models are . However, we can add an activation function to the linear model, which is non-linear. The inverse of the activation function is called the link function. We have two types of linear classification methods:

  1. Hard classification, where we directly need to output the corresponding class of the observation. Representative models of this type include:
    1. Linear Discriminant Analysis (Fisher Discriminant)
    2. Perceptron
  2. Soft classification, which generates probabilities for different classes. These algorithms can be divided into two types based on the different probability methods used:
    1. Discriminative (directly modeling the conditional probability): Logistic Regression

Hard #

Perceptron #

Most intuitively, we can choose an activation function like: \(sign(a)=\left\{\begin{matrix}+1,a\ge0\\-1,a\leq0\end{matrix}\right.\) This way, we can map the result of linear regression to the binary classification result.

Define the loss function as the number of misclassified samples. Start with using the indicator function, but the indicator function is non-differentiable. Therefore, we can define: \(L(w)=\sum\limits_{x_i\in\mathcal{D}_{wrong}}-y_iw^Tx_i\) where \(\mathcal{D}_{wrong}\) is the set of misclassified samples. In each training step, we use the gradient descent algorithm. The partial derivative of the loss function with respect to \(w\) is: \(\frac{\partial}{\partial w}L(w)=\sum\limits_{x_i\in\mathcal{D}_{wrong}}-y_ix_i\) However, if there are a large number of samples, the computational complexity is high. In fact, we do not need the absolute direction of the loss function decrease. We only need the expected value of the loss function to decrease, but calculating the expectation requires knowing the true probability distribution. We can only estimate this probability distribution based on the training data sampling (empirical risk):

\[\mathbb{E}_{\mathcal{D}} {\mathbb{E}_{\hat{p}} {[\nabla_wL(w)]}} = \mathbb{E}_{\mathcal D}\left[\frac{1}{N}\sum\limits_{i=1}^N\nabla_wL(w)\right]\] We know that the larger the \(N\) , the more accurate the sample approximation of the true distribution. However, for a data with a standard deviation of \(\sigma\) , the determined standard deviation is only inversely proportional to \(\sqrt{N}\) , while the calculation speed is directly proportional to \(N\) . Therefore, we can use fewer samples each time, so that the expected loss reduction and calculation speed can be improved. If we use only one misclassified sample each time, we have the following update strategy (based on the Taylor formula, in the negative direction): \(w^{t+1}\leftarrow w^{t}+\lambda y_ix_i\) It is convergent, and using a single observation update can also increase the uncertainty to some extent, thereby reducing the possibility of falling into a local minimum. In larger-scale data, the commonly used method is mini-batch stochastic gradient descent.

LDA #

In LDA, our basic idea is to select a direction, project the experimental samples along this direction, and the projected data should satisfy two conditions to achieve better classification:

  1. The distances between experimental samples within the same class are close.
  2. The distances between different classes are larger.

First, let's consider the projection. We assume that the original data is a vector \(x\) , then the projection along the \(w\) direction is the scalar: \(z=w^T\cdot x(=|w|\cdot|x|\cos\theta)\) For the first point, the samples within the same class are closer. We assume that the number of experimental samples belonging to the two classes are \(N_1\) and \(N_2\) , respectively. We use the variance matrix to characterize the overall distribution within each class. Here we use the definition of covariance, denoted by \(S\) for the original data covariance:

\[\begin{aligned} C_1:Var_z[C_1]&=\frac{1}{N_1}\sum\limits_{i=1}^{N_1}(z_i-\overline{z_{c1}})(z_i-\overline{z_{c1}})^T\\ &=\frac{1}{N_1}\sum\limits_{i=1}^{N_1}(w^Tx_i-\frac{1}{N_1}\sum\limits_{j=1}^{N_1}w^Tx_j)(w^Tx_i-\frac{1}{N_1}\sum\limits_{j=1}^{N_1}w^Tx_j)^T\\ &=w^T\frac{1}{N_1}\sum\limits_{i=1}^{N_1}(x_i-\overline{x_{c1}})(x_i-\overline{x_{c1}})^Tw\\ &=w^TS_1w\\ C_2:Var_z[C_2]&=\frac{1}{N_2}\sum\limits_{i=1}^{N_2}(z_i-\overline{z_{c2}})(z_i-\overline{z_{c2}})^T\\ &=w^TS_2w \end{aligned}\] Therefore, the within-class distance can be denoted as:

\[\begin{aligned} Var_z[C_1]+Var_z[C_2]=w^T(S_1+S_2)w \end{aligned}\] For the second point, we can use the means of the two classes to represent this distance:

\[\begin{aligned} (\overline{z_{c1}}-\overline{z_{c2}})^2&=(\frac{1}{N_1}\sum\limits_{i=1}^{N_1}w^Tx_i-\frac{1}{N_2}\sum\limits_{i=1}^{N_2}w^Tx_i)^2\\ &=(w^T(\overline{x_{c1}}-\overline{x_{c2}}))^2\\ &=w^T(\overline{x_{c1}}-\overline{x_{c2}})(\overline{x_{c1}}-\overline{x_{c2}})^Tw \end{aligned}\] Considering both points, since the covariance is a matrix, we divide these two values to obtain our loss function and maximize this value:

\[\begin{aligned}\hat{w}=\mathop{argmax}\limits_wJ(w)&=\mathop{argmax}\limits_w\frac{(\overline{z_{c1}}-\overline{z_{c2}})^2}{Var_z[C_1]+Var_z[C_2]}\\ &=\mathop{argmax}\limits_w\frac{w^T(\overline{x_{c1}}-\overline{x_{c2}})(\overline{x_{c1}}-\overline{x_{c2}})^Tw}{w^T(S_1+S_2)w}\\ &=\mathop{argmax}\limits_w\frac{w^TS_bw}{w^TS_ww} \end{aligned}\] In this way, we have combined the loss function with the original dataset and parameters. Next, we will find the partial derivative of this loss function. Note that we actually have no requirement for the absolute value of \(w\) , only for its direction, so we can solve it with just one equation:

\[\begin{aligned} &\frac{\partial}{\partial w}J(w)=2S_bw(w^TS_ww)^{-1}-2w^TS_bw(w^TS_ww)^{-2}S_ww=0\\ &\Longrightarrow S_bw(w^TS_ww)=(w^TS_bw)S_ww\\ &\Longrightarrow w\propto S_w^{-1}S_bw=S_w^{-1}(\overline{x_{c1}}-\overline{x_{c2}})(\overline{x_{c1}}-\overline{x_{c2}})^Tw\propto S_w^{-1}(\overline{x_{c1}}-\overline{x_{c2}}) \end{aligned}\] Thus, \(S_w^{-1}(\overline{x_{c1}}-\overline{x_{c2}})\) is the direction we need to find. Finally, we can normalize it to obtain a unit \(w\) value.

Soft #

Logistic regression #

Sometimes we just want to obtain the probability of a single class, so we need a function that can output values within the range of \([0,1]\) . Considering a binary classification model, we use a discriminative model and hope to model \(p(C|x)\) using Bayes' theorem: \(p(C_1|x)=\frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1)+p(x|C_2)p(C_2)}\) Let \(a=\ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}\) , then: \(p(C_1|x)=\frac{1}{1+\exp(-a)}\) The above formula is called the Logistic Sigmoid function, and its parameter represents the logarithm of the ratio of the joint probabilities of the two classes. In the discriminant, we do not care about the specific value of this parameter, and the model assumption is made directly for \(a\) .

The model assumption of Logistic Regression is: \(a=w^Tx\) Thus, by finding the best value of \(w\) , we can obtain the best model under this model assumption. The probability discriminative model often uses the maximum likelihood estimation method to determine the parameters.

For a single observation, the probability of obtaining class \(y\) is (assuming \(C_1=1, C_2=0\) ): \(p(y|x)=p_1^yp_0^{1-y}\)

Then, for \(N\) independent and identically distributed observations, the Maximum Likelihood Estimation (MLE) is: \(\hat{w}=\mathop{argmax}_wJ(w)=\mathop{argmax}_w\sum\limits_{i=1}^N(y_i\log p_1+(1-y_i)\log p_0)\) Notice that this expression is the negative of the cross-entropy expression multiplied by \(N\) . The logarithm in MLE also ensures compatibility with the exponential function, thereby obtaining stable gradients in large intervals.

Taking the derivative of this function, we notice that: \(p_1'=(\frac{1}{1+\exp(-a)})'=p_1(1-p_1)\) Then: \(J'(w)=\sum\limits_{i=1}^Ny_i(1-p_1)x_i-p_1x_i+y_ip_1x_i=\sum\limits_{i=1}^N(y_i-p_1)x_i\) Due to the nonlinearity of probability values, this expression cannot be solved directly when placed in the summation symbol. Therefore, in actual training, similar to the perceptron, we can also use batch stochastic gradient ascent (or gradient descent for minimization) with different sizes to obtain the maximum value of this function.

Gaussian Discrimination Analysis #

In generative models, we model the joint probability distribution and then use Maximum A Posteriori (MAP) to obtain the best parameter values. For binary classification, we adopt the following assumption:

  1. \(y\sim Bernoulli(\phi)\)
  2. \(x|y=1\sim\mathcal{N}(\mu_1,\Sigma)\)
  3. \(x|y=0\sim\mathcal{N}(\mu_0,\Sigma)\)

Then, for an independent and identically distributed dataset, the maximum a posteriori probability can be expressed as:

\[\begin{aligned} \mathop{argmax}_{\phi,\mu_0,\mu_1,\Sigma}\log p(X|Y)p(Y)&=\mathop{argmax}_{\phi,\mu_0,\mu_1,\Sigma}\sum\limits_{i=1}^N (\log p(x_i|y_i)+\log p(y_i))\\ &=\mathop{argmax}_{\phi,\mu_0,\mu_1,\Sigma}\sum\limits_{i=1}^N((1-y_i)\log\mathcal{N}(\mu_0,\Sigma)\\&\quad+y_i\log \mathcal{N}(\mu_1,\Sigma)+y_i\log\phi+(1-y_i)\log(1-\phi)) \end{aligned}\]
  1. First, solve for \(\phi\) , and take the partial derivative of the expression with respect to \(\phi\) :

    \[ \begin{aligned}\sum\limits_{i=1}^N\frac{y_i}{\phi}+\frac{y_i-1}{1-\phi}=0\\ \Longrightarrow\phi=\frac{\sum\limits_{i=1}^Ny_i}{N}=\frac{N_1}{N} \end{aligned} \]
  2. Solve \(\mu_1\)

    \[ \begin{aligned}\hat{\mu_1}&=\mathop{argmax}_{\mu_1}\sum\limits_{i=1}^Ny_i\log\mathcal{N}(\mu_1,\Sigma)\\ &=\mathop{argmin}_{\mu_1}\sum\limits_{i=1}^Ny_i(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1) \end{aligned} \] \(\because\) \( \sum\limits_{i=1}^Ny_i(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1)=\sum\limits_{i=1}^Ny_ix_i^T\Sigma^{-1}x_i-2y_i\mu_1^T\Sigma^{-1}x_i+y_i\mu_1^T\Sigma^{-1}\mu_1 \)

    Taking the derivative of the left side and multiplying by \(\Sigma\) gives:

    \[ \begin{aligned}\sum\limits_{i=1}^N-2y_i\Sigma^{-1}x_i+2y_i\Sigma^{-1}\mu_1=0\\ \Longrightarrow\mu_1=\frac{\sum\limits_{i=1}^Ny_ix_i}{\sum\limits_{i=1}^Ny_i}=\frac{\sum\limits_{i=1}^Ny_ix_i}{N_1} \end{aligned} \]
  3. Solve \(\mu_0\) : \( \mu_0=\frac{\sum\limits_{i=1}^N(1-y_i)x_i}{N_0} \)

  4. The most difficult part is solving for \(\Sigma\) . Our model assumes the same covariance matrix for positive and negative examples, although from the above solution, we can see that even if different matrices are used, it will not affect the previous three parameters. First, we have:

    \[ \begin{aligned} \sum\limits_{i=1}^N\log\mathcal{N}(\mu,\Sigma)&=\sum\limits_{i=1}^N\log(\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}})+(-\frac{1}{2}(x_i-\mu)^T\Sigma^{-1}(x_i-\mu))\\ &=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}Trace((x_i-\mu)^T\Sigma^{-1}(x_i-\mu))\\ &=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}Trace((x_i-\mu)(x_i-\mu)^T\Sigma^{-1})\\ &=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}NTrace(S\Sigma^{-1}) \end{aligned} \] In this expression, we add a trace to the scalar so that the order of the matrices can be exchanged. For the derivative of expressions containing absolute values and traces, we have:

    \[ \begin{aligned} \frac{\partial}{\partial A}(|A|)&=|A|A^{-1}\\ \frac{\partial}{\partial A}Trace(AB)&=B^T \end{aligned} \] \(\therefore\)

    \[ \begin{aligned}[\sum\limits_{i=1}^N((1-y_i)\log\mathcal{N}(\mu_0,\Sigma)+y_i\log \mathcal{N}(\mu_1,\Sigma)]' \\=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}N_1Trace(S_1\Sigma^{-1})-\frac{1}{2}N_2Trace(S_2\Sigma^{-1}) \end{aligned} \] In which, \(S_1, S_2\) are the covariance matrices within the two classes of data, so:

    \[ \begin{aligned}N\Sigma^{-1}-N_1S_1^T\Sigma^{-2}-N_2S_2^T\Sigma^{-2}=0 \\\Longrightarrow\Sigma=\frac{N_1S_1+N_2S_2}{N} \end{aligned} \] The symmetry of the class covariance matrices is applied here.

Thus, we have obtained all the parameters in our model assumptions using the maximum a posteriori method. According to the model, the joint distribution can be obtained, and hence the conditional distribution for inference can be derived.

Naive Bayes #

The Gaussian discriminant analysis above assumes a Gaussian distribution for the dataset and introduces the Bernoulli distribution as the class prior, thereby obtaining the parameters within these assumptions using maximum a posteriori estimation.

The Naive Bayes model makes assumptions about the relationships between the properties of the data. Generally, we need to obtain the probability value \(p(x|y)\) . Since \(x\) has \(p\) dimensions, we need to sample the joint probability of these many dimensions. However, we know that a large number of samples are required to obtain a relatively accurate probability approximation in such a high-dimensional space.

In a general directed probabilistic graphical model, different assumptions are made about the conditional independence relationships between the dimensions of the attributes. The simplest assumption is the conditional independence assumption in the description of the Naive Bayes model: \(p(x|y)=\prod\limits_{i=1}^pp(x_i|y)\) That is: \(x_i\perp x_j|y,\forall\ i\ne j\) Thus, using Bayes' theorem, for a single observation: \(p(y|x)=\frac{p(x|y)p(y)}{p(x)}=\frac{\prod\limits_{i=1}^pp(x_i|y)p(y)}{p(x)}\) Further assumptions are made for the individual dimensions of conditional probabilities and class priors:

  1. \(x_i\) is a continuous variable: \(p(x_i|y)=\mathcal{N}(\mu_i,\sigma_i^2)\)
  2. \(x_i\) is a discrete variable: Categorical distribution: \(p(x_i=i|y)=\theta_i,\sum\limits_{i=1}^K\theta_i=1\)
  3. \(p(y)=\phi^y(1-\phi)^{1-y}\)

To estimate these parameters, the MLE method is often used directly on the dataset. Since there is no need to know the relationships between the dimensions, the required amount of data is greatly reduced. After estimating these parameters, they are substituted back into Bayes' theorem to obtain the posterior distribution of the categories.

Summary #

Classification tasks are divided into two types. For tasks that require directly outputting the class, in the perceptron algorithm, we add a sign function as an activation function to the linear model, which allows us to obtain the class. However, the sign function is not smooth, so we adopt an error-driven approach, introducing \(\sum\limits_{x_i\in\mathcal{D}_{wrong}}-y_iw^Tx_i\) as the loss function, and then minimize this error using the batch stochastic gradient descent method to obtain the optimal parameter values. In linear discriminant analysis, we consider the linear model as a projection of data points in a certain direction, and use the idea of minimizing within-class variance and maximizing between-class variance to define the loss function. The within-class variance is defined as the sum of the variances of the two classes, and the between-class variance is defined as the distance between the centroids of the two classes. Taking the derivative of the loss function gives the direction of the parameters, which is \(S_w^{-1}(\overline x_{c1}-\overline x_{c2})\) , where \(S_w\) is the sum of the variances of the two classes in the original dataset.

Another type of task is to output the probability of classification. For probability models, we have two schemes. The first is the discriminative model, which directly models the conditional probability of the class. By inserting the linear model into the Logistic function, we obtain the Logistic regression model. The probability interpretation here is that the logarithm of the joint probability ratio of the two classes is linear. We define the loss function as the cross-entropy (equivalent to MLE), and taking the derivative of this function gives \(\frac{1}{N}\sum\limits_{i=1}^N(y_i-p_1)x_i\) . We also use the batch stochastic gradient (ascent) method for optimization. The second is the generative model, which introduces the class prior. In Gaussian discriminant analysis, we assume the distribution of the dataset, where the class prior is a binomial distribution, and the likelihood of each class is a Gaussian distribution. Maximizing the log-likelihood of this joint distribution gives the parameters, \(\frac{\sum\limits_{i=1}^Ny_ix_i}{N_1},\frac{\sum\limits_{i=1}^N(1-y_i)x_i}{N_0},\frac{N_1S_1+N_2S_2}{N},\frac{N_1}{N}\) . In Naive Bayes, we further assume the dependence between the dimensions of the attributes, and the conditional independence assumption greatly reduces the data requirements.

Support Vector Machines (SVM) #

Support Vector Machines (SVM) play an important role in classification problems, with the main idea being to maximize the margin between two classes. Based on the characteristics of the dataset:

  1. Linearly separable problems, like those previously handled by the perceptron algorithm
  2. Linearly separable, with only a few misclassified points, like the problems addressed by the evolved Pocket algorithm from the perceptron
  3. Nonlinear problems, completely non-separable, such as those tackled by multilayer perceptrons and deep learning developed from the perceptron problem

For these three situations, SVM has the following three methods:

  1. Hard-margin SVM
  2. Soft-margin SVM
  3. Kernel Method

In solving SVMs, Lagrange multiplier method is widely used. First, let's start with it.

Constrained Optimization #

Generally, a constrained optimization problem (original problem) can be written as:

\[\begin{aligned} &\min_{x\in\mathbb{R^p}}f(x)\\ &s.t.\ m_i(x)\le0,i=1,2,\cdots,M\\ &\ \ \ \ \ \ \ \ n_j(x)=0,j=1,2,\cdots,N \end{aligned}\] Define Lagrange function:

\( L(x,\lambda,\eta) = f(x) + \sum_{i=1}^{M} \lambda_i m_i(x) + \sum_{j=1}^{N} \eta_j n_j(x) \)

Then, the original problem can be equivalently transformed into an unconstrained form:

\(\min_x \max_{\lambda, \eta} L(x,\lambda,\eta)\)

where \(\lambda\) is a vector of Lagrange multipliers for inequality constraints and \(eta\) is a vector of Lagrange multipliers for equality constraints. Note that λ should be non-negative.

This unconstrained form allows us to solve the optimization problem without explicitly considering the constraints, as they are incorporated into the Lagrange function.

This is because, when the inequality constraints of the original problem are satisfied, the maximum value can be obtained when \(\lambda_i=0\) , which is directly equivalent to the original problem. If the inequality constraints of the original problem are not satisfied, the maximum value becomes \(+\infty\) . Since the minimum value is required, this situation will not be chosen.

The dual form of this problem is: \(\max_{\lambda,\eta}\min_{x\in\mathbb{R}^p}L(x,\lambda,\eta)\ s.t.\ \lambda_i\ge0\) The dual problem is a maximization problem with respect to \(\lambda, \eta\) .

\(\because\) \(\max_{\lambda_i,\eta_j}\min_{x}L(x,\lambda_i,\eta_j)\le\min_{x}\max_{\lambda_i,\eta_j}L(x,\lambda_i,\eta_j)\)

The solution of the dual problem is less than or equal to the original problem, with two situations:

  1. Strong duality: The equality can be achieved
  2. Weak duality: The equality cannot be achieved

For a convex optimization problem, the following theorem holds: If the convex optimization problem satisfies certain conditions, such as the Slater condition, then it and its dual problem satisfy a strong duality relationship. Let the problem's domain be defined as: \(\mathcal{D}=domf(x)\cap dom m_i(x)\cap domn_j(x)\) . The Slater condition is: \( \exists \hat{x} \in Relint \mathcal{D}\ such\ that\ \forall i=1,2,\cdots,M, m_i(x) < 0 \) where \(Relint\) represents the relative interior (interior not containing the boundary).

For most convex optimization problems, the Slater condition holds; relax the Slater condition: if there are K affine functions among the M inequality constraints, then it is sufficient for the remaining functions to satisfy the Slater condition.

The above introduced the duality relationship between the original problem and the dual problem, but in practice, it is necessary to solve for the parameters. The solution method uses the KKT (Karush-Kuhn-Tucker) conditions:

The KKT conditions and the strong duality relationship are equivalent. The KKT conditions for the optimal solution are:

  1. Feasible domain:

    \[ \begin{aligned} m_i(x^*)\le0\\ n_j(x^*)=0\\ \lambda^*\ge0 \end{aligned} \]
  2. Complementary slackness \(\lambda^*m_i(x^*)=0,\forall m_i\) . The optimal value of the dual problem is \(d^*\) , and the original problem is \(p^*\)

    \[ \begin{aligned} d^*&=\max_{\lambda,\eta}g(\lambda,\eta)=g(\lambda^*,\eta^*)\\ &=\min_{x}L(x,\lambda^*,\eta^*)\\ &\le L(x^*,\lambda^*,\eta^*)\\ &=f(x^*)+\sum\limits_{i=1}^M\lambda^*m_i(x^*)\\ &\le f(x^*)=p^* \end{aligned} \]

To satisfy the equality, both inequalities must hold. Therefore, for the first inequality, the gradient must be 0; for the second inequality, the complementary slackness condition must be satisfied.

  1. Gradient is 0: \(\frac{\partial L(x,\lambda^*,\eta^*)}{\partial x}|_{x=x^*}=0\)

Hard-margin SVM #

Support Vector Machine (SVM) is also a kind of hard classification model. In the previous perceptron model, we added a sign function on top of the linear model. From a geometric intuition, we can see that if the two classes are well separated, there will actually be infinitely many lines that can separate them. In SVM, we introduce the concept of maximizing the margin, where the margin refers to the minimum distance between the data and the dividing line. Maximizing this value reflects the tendency of our model.

The separating hyperplane can be written as: \(0=w^Tx+b\) Then, maximize the margin (subject to the constraints of the classification task): \(\mathop{argmax}_{w,b}[\min_i\frac{|w^Tx_i+b|}{||w||}]\ s.t.\ y_i(w^Tx_i+b)>0\\ \Longrightarrow\mathop{argmax}_{w,b}[\min_i\frac{y_i(w^Tx_i+b)}{||w||}]\ s.t.\ y_i(w^Tx_i+b)>0\) For this constraint \(y_i(w^Tx_i+b)>0\) , we can fix \(\min y_i(w^Tx_i+b)=1>0\) , since scaling the coefficients of the hyperplane separating the two classes does not change the plane. This is equivalent to constraining the coefficients of the hyperplane. The simplified expression can be represented as: \(\mathop{argmin}_{w,b}\frac{1}{2}w^Tw\ s.t.\ \min_iy_i(w^Tx_i+b)=1\\ \Rightarrow\mathop{argmin}_{w,b}\frac{1}{2}w^Tw\ s.t.\ y_i(w^Tx_i+b)\ge1,i=1,2,\cdots,N\) This is a convex optimization problem with \(N\) constraints, and there are many software tools for solving such problems.

However, if the sample size or dimension is very high, solving the problem directly can be difficult or even infeasible, so further processing is needed. Introduce the Lagrange function: \(L(w,b,\lambda)=\frac{1}{2}w^Tw+\sum\limits_{i=1}^N\lambda_i(1-y_i(w^Tx_i+b))\) The original problem is equivalent to: \(\mathop{argmin}_{w,b}\max_{\lambda}L(w,b,\lambda_i)\ s.t.\ \lambda_i\ge0\) We swap the minimum and maximum symbols to obtain the dual problem: \(\max_{\lambda_i}\min_{w,b}L(w,b,\lambda_i)\ s.t.\ \lambda_i\ge0\) Since the inequality constraint is an affine function, the dual problem is equivalent to the original problem:

  • \(b\) : \(\frac{\partial}{\partial b}L=0\Rightarrow\sum\limits_{i=1}^N\lambda_iy_i=0\)

  • \(w\) :plug in \(b\) \( L(w,b,\lambda_i)=\frac{1}{2}w^Tw+\sum\limits_{i=1}^N\lambda_i(1-y_iw^Tx_i-y_ib)=\frac{1}{2}w^Tw+\sum\limits_{i=1}^N\lambda_i-\sum\limits_{i=1}^N\lambda_iy_iw^Tx_i \) \(\therefore\) \( \frac{\partial}{\partial w}L=0\Rightarrow w=\sum\limits_{i=1}^N\lambda_iy_ix_i \)

  • Plug in both parameters \( L(w,b,\lambda_i)=-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j+\sum\limits_{i=1}^N\lambda_i \)

Therefore, the dual problem is: \(\max_{\lambda}-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j+\sum\limits_{i=1}^N\lambda_i,\ s.t.\ \lambda_i\ge0\) The parameters of the hyperplane can be obtained from the KKT conditions:

The necessary and sufficient conditions for the strong duality relationship between the original problem and the dual problem are that they satisfy the KKT conditions:

\[\begin{aligned} &\frac{\partial L}{\partial w}=0,\frac{\partial L}{\partial b}=0 \\&\lambda_k(1-y_k(w^Tx_k+b))=0(slackness\ complementary)\\ &\lambda_i\ge0\\ &1-y_i(w^Tx_i+b)\le0 \end{aligned}\] Based on these conditions, we can obtain the corresponding optimal parameters:

\[\begin{aligned} \hat{w}&=\sum\limits_{i=1}^N\lambda_iy_ix_i\\ \hat{b}&=y_k-w^Tx_k=y_k-\sum\limits_{i=1}^N\lambda_iy_ix_i^Tx_k, \\ &\exists k,1-y_k(w^Tx_k+b)=0 \end{aligned}\] Thus, the parameter \(w\) of the hyperplane is a linear combination of data points, and the final parameter values are the linear combination of some vectors that satisfy \(y_i(w^Tx_i+b)=1\) (given by the complementary slackness condition). These vectors are also called support vectors.

Soft-margin SVM #

Hard-margin SVM is only solvable for separable data. If the data is not separable, our basic idea is to introduce the possibility of misclassification into the loss function. The number of misclassifications can be written as: \(error=\sum\limits_{i=1}^N\mathbb{I}\{y_i(w^Tx_i+b)<1\}\) This function is discontinuous, so we can rewrite it as: \(error=\sum\limits_{i=1}^N\max\{0,1-y_i(w^Tx_i+b)\}\) The term inside the summation is called the Hinge Function.

By adding this error term into the hard-margin SVM, we have:

\[\begin{aligned} &\mathop{argmin}_{w,b}\frac{1}{2}w^Tw+C\sum\limits_{i=1}^N\max\{0,1-y_i(w^Tx_i+b)\} \\ &\textrm{s.t.} \quad y_i(w^Tx_i+b)\ge1-\xi_i,i=1,2,\cdots,N \end{aligned}\] In this expression, the constant \(C\) can be considered as the allowed error level. To further eliminate the \(\max\) symbol, for each observation in the dataset, we can assume that most of them satisfy the constraint, but some of them violate the constraint. Therefore, this part of the constraint becomes \(y_i(w^Tx+b)\ge1-\xi_i\) , where \(\xi_i=1-y_i(w^Tx_i+b)\) . Further simplification gives: \(\mathop{argmin}_{w,b}\frac{1}{2}w^Tw+C\sum\limits_{i=1}^N\xi_i\ s.t.\ y_i(w^Tx_i+b)\ge1-\xi_i,\xi_i\ge0,i=1,2,\cdots,N\)

Kernel Method #

Kernel methods can be applied to many problems. In classification problems, for strictly non-separable problems, we introduce a feature transformation function to transform the original non-separable dataset into a separable dataset, and then apply the existing model. Often, when transforming a low-dimensional dataset into a high-dimensional dataset, the data becomes separable (the data becomes sparser):

High-dimensional spaces are more likely to be linearly separable than low-dimensional spaces.

When applied to SVM, we observe the dual problem of SVM: \(\max_{\lambda}-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j+\sum\limits_{i=1}^N\lambda_i,\ s.t.\ \lambda_i\ge0\) When solving, we need to find the inner product, so after the feature transformation of non-separable data, we need to find the inner product of the transformed data. It is often difficult to find the inner product of the transformation function. Hence, we directly introduce a transformation function of the inner product: \(\forall x,x'\in\mathcal{X},\exists\phi\in\mathcal{H}:x\rightarrow z\ s.t.\ k(x,x')=\phi(x)^T\phi(x)\) The function \(k(x,x')\) is called a positive definite kernel function, where \(\mathcal{H}\) is a Hilbert space (a complete linear inner product space). If we remove the inner product condition, we simply call it a kernel function.

\(k(x,x')=\exp(-\frac{(x-x')^2}{2\sigma^2})\) is a kernel function.

\[\begin{aligned} \exp(-\frac{(x-x\prime)^2}{2\sigma^2})&=\exp(-\frac{x^2}{2\sigma^2})\exp(\frac{xx\prime}{\sigma^2})\exp(-\frac{x\prime^2}{2\sigma^2})\\ &=\exp(-\frac{x^2}{2\sigma^2})\sum\limits_{n=0}^{+\inf}\frac{x^nx\prime^n}{\sigma^{2n}n!}\exp(-\frac{x\prime^2}{2\sigma^2})\\ &=\exp(-\frac{x^2}{2\sigma^2})\varphi(x)\varphi(x\prime)\exp(-\frac{x\prime^2}{2\sigma^2})\\&=\phi(x)\phi(x\prime) \end{aligned}\] A positive definite kernel function has the following equivalent definitions:

If the kernel function satisfies:

  1. Symmetry
  2. Positive definiteness

Then this kernel function is a positive definite kernel function.

Proof:

  1. Symmetry \(\Leftrightarrow\) \(k(x,z)=k(z,x)\) , which obviously satisfies the definition of an inner product.
  2. Positive definiteness \(\Leftrightarrow\) \(\forall N,x_1,x_2,\cdots,x_N\in\mathcal{X}\) , the corresponding Gram Matrix \(K=[k(x_i,x_j)]\) is positive semi-definite.

To prove: \(k(x,z)=\phi(x)^T\phi(z)\Leftrightarrow K\) is positive semi-definite and symmetric.

  1. \(\Rightarrow\) : Firstly, symmetry is obvious. For positive definiteness: \( K=\begin{pmatrix}k(x_1,x_2)&\cdots&k(x_1,x_N)\\\vdots&\vdots&\vdots\\k(x_N,x_1)&\cdots&k(x_N,x_N)\end{pmatrix} \) Take any \(\alpha\in\mathbb{R}^N\) , we need to prove \(\alpha^TK\alpha\ge0\) : \( \alpha^TK\alpha=\sum\limits_{i,j}\alpha_i\alpha_jK_{ij}=\sum\limits_{i,j}\alpha_i\phi^T(x_i)\phi(x_j)\alpha_j=\sum\limits_{i}\alpha_i\phi^T(x_i)\sum\limits_{j}\alpha_j\phi(x_j) \) This expression is in the form of an inner product. Hilbert space satisfies linearity, so the proof of positive definiteness is complete.

  2. \(\Leftarrow\) : Decompose \(K\) , for the symmetric matrix \(K=V\Lambda V^T\) , then let \(\phi(x_i)=\sqrt{\lambda_i}V_i\) , where \(V_i\) is the eigenvector, and we construct \(k(x,z)=\sqrt{\lambda_i\lambda_j}V_i^TV_j\) .

Summary #

For a long time, classification problems relied on SVM. For strictly separable datasets, Hard-margin SVM selects a hyperplane that maximizes the distance to all data points. A constraint is applied to this plane, fixing \(y_i(w^Tx_i+b)=1\) , resulting in a convex optimization problem with all constraint conditions as affine functions. This satisfies the Slater condition, and the problem is transformed into a dual problem, obtaining an equivalent solution and constraint parameters:

\(\max_{\lambda}-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j+\sum\limits_{i=1}^N\lambda_i,\ s.t.\ \lambda_i\ge0\)

The required hyperplane parameters are solved using the KKT conditions of the strong dual problem:

\[\begin{aligned} &\frac{\partial L}{\partial w}=0,\frac{\partial L}{\partial b}=0 \\&\lambda_k(1-y_k(w^Tx_k+b))=0(slackness\ complementary)\\ &\lambda_i\ge0\\ &1-y_i(w^Tx_i+b)\le0 \end{aligned}\]

The solution is:

\[\begin{aligned} \hat{w}=\sum\limits_{i=1}^N\lambda_iy_ix_i\\ \hat{b}=y_k-w^Tx_k=y_k-\sum\limits_{i=1}^N\lambda_iy_ix_i^Tx_k \\ \exists k,1-y_k(w^Tx_k+b)=0 \end{aligned}\]

When allowing for some errors, an error term can be added to the Hard-margin SVM. The Hinge Function represents the size of the error term, resulting in: \(\mathop{argmin}_{w,b}\frac{1}{2}w^Tw+C\sum\limits_{i=1}^N\xi_i\ s.t.\ y_i(w^Tx_i+b)\ge1-\xi_i,\xi_i\ge0,i=1,2,\cdots,N\)

For completely non-separable problems, we use feature transformation. In SVM, we introduce a positive definite kernel function to directly transform the inner product. As long as this transformation satisfies symmetry and positive definiteness, it can be used as a kernel function.

Exponential Distribution #

The exponential family is a class of distributions that includes Gaussian distribution, Bernoulli distribution, binomial distribution, Poisson distribution, Beta distribution, Dirichlet distribution, Gamma distribution, and a series of other distributions. Exponential family distributions can be written in a unified form:

\(p(x|\eta)=h(x)\exp(\eta^T\phi(x)-A(\eta))=\frac{1}{\exp(A(\eta))}h(x)\exp(\eta^T\phi(x))\)

\(\eta\) is the parameter vector, and \(A(\eta)\) is the log partition function (normalization factor).

In this expression, \(\phi(x)\) is called the sufficient statistic, containing all the information of the sample set, such as the mean and variance in the Gaussian distribution. Sufficient statistics have applications in online learning; for a dataset, it is only necessary to record the sufficient statistics of the samples.

For a model distribution assumption (likelihood), we often need to find a conjugate prior in the solution, so that the prior and posterior forms are the same. For example, if the likelihood is a binomial distribution, the prior can be chosen as a Beta distribution, and the posterior is also a Beta distribution. Exponential family distributions often have conjugate properties, making model selection and inference much more convenient.

The properties of conjugate priors facilitate computation, and at the same time, exponential family distributions satisfy the idea of maximum entropy (uninformative prior). That is, the distribution derived from the empirical distribution using the principle of maximum entropy is the exponential family distribution.

Noticing that the expression of the exponential family distribution is similar to the linear model, in fact, the exponential family distribution naturally derives the generalized linear model: \(y=f(w^Tx)\\ y|x\sim Exp Family\) In more complex probabilistic graphical models, such as in undirected graphical models like Restricted Boltzmann Machines, exponential family distributions also play an important role.

In inference algorithms, such as variational inference, exponential family distributions greatly simplify the computation.

1-d Gaussian distribution #

1- \(d\) Gaussian distribution can be written as: \(p(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x-\mu)^2}{2\sigma^2})\)

Now transform the above:

\[\begin{aligned} &\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{1}{2\sigma^2}(x^2-2\mu x+\mu^2))\\ &=\exp(\log(2\pi\sigma^2)^{-1/2})\exp(-\frac{1}{2\sigma^2}\begin{pmatrix}-2\mu&1\end{pmatrix}\begin{pmatrix}x\\x^2\end{pmatrix}-\frac{\mu^2}{2\sigma^2}) \end{aligned}\] \(\therefore\) \(\eta=\begin{pmatrix}\frac{\mu}{\sigma^2}\\-\frac{1}{2\sigma^2}\end{pmatrix}=\begin{pmatrix}\eta_1\\\eta_2\end{pmatrix}\) , \(A(\eta)=-\frac{\eta_1^2}{4\eta_2}+\frac{1}{2}\log(-\frac{\pi}{\eta_2})\)

Sufficient statistics and log partition functions #

Integrate the probability density function:

\[\begin{aligned} \exp(A(\eta))&=\int h(x)\exp(\eta^T\phi(x))dx \end{aligned}\] Take the derivative with respect to the parameter on both sides:

\[\begin{aligned} \exp(A(\eta))A'(\eta)&=\int h(x)\exp(\eta^T\phi(x))\phi(x)dx\\ \Longrightarrow A'(\eta)&=\mathbb{E}_{p(x|\eta)}[\phi(x)] \end{aligned}\] Similarly: \(A''(\eta)=Var_{p(x|\eta)}[\phi(x)]\)

Since the variance is positive, \(A(\eta)\) must be a convex function.

Sufficient statistics and MLE #

For an independently and identically distributed ( \(iid\) ) dataset \(\mathcal{D}=\{x_1,x_2,\cdots,x_N\}\)

\[\begin{aligned} \eta_{MLE}&=\mathop{argmax}_\eta\sum\limits_{i=1}^N\log p(x_i|\eta)\\ &=\mathop{argmax}_\eta\sum\limits_{i=1}^N(\eta^T\phi(x_i)-A(\eta))\\ &\Longrightarrow A'(\eta_{MLE})=\frac{1}{N}\sum\limits_{i=1}^N\phi(x_i) \end{aligned}\] From this, we can see that in order to estimate the parameters, it is sufficient to know the sufficient statistics.

Maximum entrophy #

The information entropy is denoted as: \(Entropy=\int-p(x)\log(p(x))dx\)

In general, for a completely random variable (equally likely), the information entropy is maximized.

Our assumption is based on the principle of maximum entropy. Assuming the data follows a discrete distribution with probability of \(k\) features being \(p_k\) , the principle of maximum entropy can be formulated as: \( \max\{H(p)\}=\min\{\sum\limits_{k=1}^Kp_k\log p_k\}\ s.t.\ \sum\limits_{k=1}^Kp_k=1 \)

Using the Lagrange multiplier method: \( L(p,\lambda)=\sum\limits_{k=1}^Kp_k\log p_k+\lambda(1-\sum\limits_{k=1}^Kp_k) \)

Hence, we can obtain: \( p_1=p_2=\cdots=p_K=\frac{1}{K} \)

Therefore, the entropy is maximized when all probabilities are equal.

For a dataset \(\mathcal{D}\) , the empirical distribution on this dataset is \(\hat{p}(x)=\frac{Count(x)}{N}\) . In practice, it is impossible to satisfy all empirical probabilities being equal. Therefore, we need to add the constraint of this empirical distribution in the principle of maximum entropy.

For any function, the empirical expectation of the empirical distribution can be obtained as: \(\mathbb{E}_{\hat{p}}[f(x)]=\Delta\)

\(\therefore \max\{H(p)\}=\min\{\sum\limits_{k=1}^Np_k\log p_k\}\ s.t.\ \sum\limits_{k=1}^Np_k=1,\mathbb{E}_p[f(x)]=\Delta\)

The Lagrange function is: \(L(p,\lambda_0,\lambda)=\sum\limits_{k=1}^Np_k\log p_k+\lambda_0(1-\sum\limits_{k=1}^Np_k)+\lambda^T(\Delta-\mathbb{E}_p[f(x)])\)

Taking the derivative, we get:

\(\frac{\partial}{\partial p(x)}L=\sum\limits_{k=1}^N(\log p(x)+1)-\sum\limits_{k=1}^N\lambda_0-\sum\limits_{k=1}^N\lambda^Tf(x)\\ \Longrightarrow\sum\limits_{k=1}^N\log p(x)+1-\lambda_0-\lambda^Tf(x)=0\)

Since the dataset is arbitrary, summing over the dataset also means that every term in the sum is 0: \(p(x)=\exp(\lambda^Tf(x)+\lambda_0-1)\) This is the exponential family distribution.

Expectation–Maximization Algorithm #

The purpose of the Expectation-Maximization (EM) algorithm is to solve the parameter estimation (maximum likelihood estimation) for mixture models with latent variables. The MLE for the parameter estimation of \(p(x|\theta)\) is denoted as: \(\theta_{MLE}=\mathop{argmax}\limits_\theta\log p(x|\theta)\) . The EM algorithm solves this problem using an iterative method:

\(\theta^{\text{t}+1} = \arg\max_{\theta} \int_{z} \log [p(x,z|\theta)] p(z|x,\theta^{\text{t}}) dz = \mathbb{E}_{z|x,\theta^{\text{t}}}[\log p(x,z|\theta)]\)

This formula includes two iterative steps:

  1. E step: Calculate the expectation of \(\log p(x,z|\theta)\) under the probability distribution \(p(z|x,\theta^t)\)
  2. M step: Calculate the parameters that maximize this expectation to obtain the input for the next EM step
\(\log p(x|\theta^t)\le\log p(x|\theta^{t+1})\) \(\log p(x|\theta)=\log p(z,x|\theta)-\log p(z|x,\theta)\)

Integrate both sides:

\begin{aligned} Left:&\int_zp(z|x,\theta^t)\log p(x|\theta)dz=\log p(x|\theta) \ Right:&\int_zp(z|x,\theta^t)\log p(x,z|\theta)dz-\int_zp(z|x,\theta^t)\log p(z|x,\theta)dz=Q(\theta,\theta^t)-H(\theta,\theta^t) \end{aligned}

\(\therefore \log p(x|\theta)=Q(\theta,\theta^t)-H(\theta,\theta^t)\)

\begin{aligned} \because Q(\theta,\theta^t)&=\int_zp(z|x,\theta^t)\log p(x,z|\theta)dz \ \theta^{t+1}&=\mathop{argmax}\limits_{\theta}\int_z\log [p(x,z|\theta)]p(z|x,\theta^t)dz \ \therefore Q(\theta^{t+1},\theta^t)&\ge Q(\theta^t,\theta^t) \end{aligned}

To have \(\log p(x|\theta^t)\le\log p(x|\theta^{t+1})\)

We need \(H(\theta^t,\theta^t)\ge H(\theta^{t+1},\theta^t)\)

\begin{aligned} H(\theta^{t+1},\theta^t)-H(\theta^{t},\theta^t)&=\int_zp(z|x,\theta^{t})\log p(z|x,\theta^{t+1})dz-\int_zp(z|x,\theta^t)\log p(z|x,\theta^{t})dz\ &=\int_zp(z|x,\theta^t)\log\frac{p(z|x,\theta^{t+1})}{p(z|x,\theta^t)} \ &=-KL(p(z|x,\theta^t),p(z|x,\theta^{t+1}))\le0 \end{aligned}

Combining everything we have got: \( \log p(x|\theta^t)\le\log p(x|\theta^{t+1}) \)

Based on the proof above, we see that the likelihood function increases at each step. Furthermore, we look at how the formula in the EM iteration process is derived:

\(\log p(x|\theta)=\log p(z,x|\theta)-\log p(z|x,\theta)=\log \frac{p(z,x|\theta)}{q(z)}-\log \frac{p(z|x,\theta)}{q(z)}\)

Take the expectation \(\mathbb{E}_{q(z)}\) on both sides:

\[\begin{aligned} &Left:\int_zq(z)\log p(x|\theta)dz=\log p(x|\theta)\\ &Right:\int_zq(z)\log \frac{p(z,x|\theta)}{q(z)}dz-\int_zq(z)\log \frac{p(z|x,\theta)}{q(z)}dz=ELBO+KL(q(z),p(z|x,\theta)) \end{aligned}\] In the equation above, the Evidence Lower Bound (ELBO) is a lower bound, so \(\log p(x|\theta)\ge ELBO\) . The equality holds when the KL divergence is 0, i.e., \(q(z)=p(z|x,\theta)\) . The purpose of the EM algorithm is to maximize the ELBO. According to the proof process above, the maximum ELBO is obtained at each step of EM, and the parameters that maximize the ELBO are used as input for the next step:

\(\hat{\theta}=\mathop{argmax}_{\theta}ELBO=\mathop{argmax}_\theta\int_zq(z)\log\frac{p(x,z|\theta)}{q(z)}dz\)

Since the maximum value can be achieved when \( q(z)=p(z|x,\theta^t)\) , we have:

\[\begin{aligned} \hat{\theta}&=\mathop{argmax}_{\theta}ELBO \\ &=\mathop{argmax}_\theta\int_zq(z)\log\frac{p(x,z|\theta)}{q(z)}dz \\ &=\mathop{argmax}_\theta\int_zp(z|x,\theta^t)\log\frac{p(x,z|\theta)}{p(z|x,\theta^t)}dz \\ &=\mathop{argmax}_\theta\int_z p(z|x,\theta^t)\log p(x,z|\theta) \end{aligned}\] This formula is the one used in the EM iteration process above.

Starting from Jensen's inequality, this formula can also be derived:

\[\begin{aligned} \log p(x|\theta)&=\log\int_zp(x,z|\theta)dz \\ &=\log\int_z\frac{p(x,z|\theta)q(z)}{q(z)}dz \\ &=\log \mathbb{E}_{q(z)}[\frac{p(x,z|\theta)}{q(z)}] \\ &\ge \mathbb{E}_{q(z)}[\log\frac{p(x,z|\theta)}{q(z)}] \end{aligned}\] In this case, the right side of the equation is the ELBO, and the equality holds when \( p(x,z|\theta)=Cq(z)\) .

Thus: \(\int_zq(z)dz=\frac{1}{C}\int_zp(x,z|\theta)dz=\frac{1}{C}p(x|\theta)=1\\ \Rightarrow q(z)=\frac{1}{p(x|\theta)}p(x,z|\theta)=p(z|x,\theta)\)

We find that this process is the condition for the maximum value to hold as discussed above.

Generalized EM #

The EM model solves the problem of parameter estimation for probabilistic generative models by introducing latent variable \(z\) to learn \(\theta\) , and specific models have different assumptions for \(z\) . For the learning task \(p(x|\theta)\) , it is the learning task \(\frac{p(x,z|\theta)}{p(z|x,\theta)}\) . In this formula, we assume that in the E step, \(q(z)=p(z|x,\theta)\) . However, if this \(p(z|x,\theta)\) cannot be solved, sampling (MCMC) or variational inference methods must be used to approximate this posterior. We observe the expression of KL divergence. To maximize the ELBO, we need to minimize the KL divergence when \(\theta\) is fixed:

\(\hat{q}(z)=\mathop{argmin}_qKL(p,q)=\mathop{argmax}_qELBO\)

This is the basic idea of the generalized EM:

  1. E step: \( \hat{q}^{t+1}(z)=\mathop{argmax}_q\int_zq^t(z)\log\frac{p(x,z|\theta)}{q^t(z)}dz, \text{ fixed }\theta \)

  2. M step: \( \hat{\theta}=\mathop{argmax}_\theta \int_zq^{t+1}(z)\log\frac{p(x,z|\theta)}{q^{t+1}(z)}dz, \text{ fixed }\hat{q} \)

For the integration above: \(ELBO=\int_zq(z)\log\frac{p(x,z|\theta)}{q(z)}dz=\mathbb{E}_{q(z)}[p(x,z|\theta)]+Entropy(q(z))\)

Therefore, we see that the generalized EM is equivalent to adding an entropy term to the original formula.

Generalization of EM #

The EM algorithm is similar to the coordinate ascent method, where some coordinates are fixed and others are optimized, and then iterated repeatedly. If the posterior probability of \(z\) cannot be solved within the EM framework, some variations of EM need to be adopted to estimate this posterior.

  1. Variational inference based on mean-field, VBEM/VEM
  2. Monte Carlo-based EM, MCEM

Gaussian Mixture Model #

To solve the unimodality problem of the Gaussian model, we introduce the weighted average of multiple Gaussian models to fit the multimodal data:

\(p(x)=\sum\limits_{k=1}^K\alpha_k\mathcal{N}(\mu_k,\Sigma_k)\)

We introduce the latent variable \(z\) , which represents which Gaussian distribution the corresponding sample \(x\) belongs to. This variable is a discrete random variable:

\(p(z=i)=p_i,\sum\limits_{i=1}^kp(z=i)=1\)

As a generative model, the Gaussian mixture model generates samples through the distribution of the latent variable \(z\) . It can be represented with a probability graph:

{width=“30px”}

In this graph, node \(z\) represents the probability mentioned above, and \(x\) represents the generated Gaussian distribution.

Therefore, for \(p(x)\) : \(p(x)=\sum\limits_zp(x,z)=\sum\limits_{k=1}^Kp(x,z=k)=\sum\limits_{k=1}^Kp(z=k)p(x|z=k)\)

Thus: \(p(x)=\sum\limits_{k=1}^Kp_k\mathcal{N}(x|\mu_k,\Sigma_k)\)

MLE #

The samples are \(X=(x_1,x_2,\cdots,x_N)\) , and \((X,Z)\) are the complete parameters, with the parameters being \(\theta=\{p_1,p_2,\cdots,p_K,\mu_1,\mu_2,\cdots,\mu_K\Sigma_1,\Sigma_2,\cdots,\Sigma_K\}\)

We obtain the value of \(\theta\) through maximum likelihood estimation:

\[\begin{aligned}\theta_{MLE}&=\mathop{argmax}\limits_{\theta}\log p(X)=\mathop{argmax}_{\theta}\sum\limits_{i=1}^N\log p(x_i)\\ &=\mathop{argmax}_\theta\sum\limits_{i=1}^N\log \sum\limits_{k=1}^Kp_k\mathcal{N}(x_i|\mu_k,\Sigma_k) \end{aligned}\] This expression cannot be obtained by direct derivation due to the presence of the summation, so the EM algorithm is needed.

Solve GMM using EM #

The basic expression of the EM algorithm is: \(\theta^{t+1}=\mathop{argmax}\limits_{\theta}\mathbb{E}_{z|x,\theta_t}[p(x,z|\theta)]\) . Applying the GMM expression for the dataset, we get:

\[\begin{aligned} Q(\theta,\theta^t)&=\sum\limits_z[\log\prod\limits_{i=1}^Np(x_i,z_i|\theta)]\prod \limits_{i=1}^Np(z_i|x_i,\theta^t)\\ &=\sum\limits_z[\sum\limits_{i=1}^N\log p(x_i,z_i|\theta)]\prod \limits_{i=1}^Np(z_i|x_i,\theta^t) \end{aligned}\]

For the summation in the middle, expanding the first term, we have:

\[\begin{aligned} \sum\limits_z\log p(x_1,z_1|\theta)\prod\limits_{i=1}^Np(z_i|x_i,\theta^t)&=\sum\limits_z\log p(x_1,z_1|\theta)p(z_1|x_1,\theta^t)\prod\limits_{i=2}^Np(z_i|x_i,\theta^t)\\ &=\sum\limits_{z_1}\log p(x_1,z_1|\theta) p(z_1|x_1,\theta^t)\sum\limits_{z_2,\cdots,z_K}\prod\limits_{i=2}^Np(z_i|x_i,\theta^t)\\ &=\sum\limits_{z_1}\log p(x_1,z_1|\theta)p(z_1|x_1,\theta^t) \end{aligned}\] Similarly, \(Q\) can be written as: \(Q(\theta,\theta^t)=\sum\limits_{i=1}^N\sum\limits_{z_i}\log p(x_i,z_i|\theta)p(z_i|x_i,\theta^t)\)

For \(p(x,z|\theta)\) : \(p(x,z|\theta)=p(z|\theta)p(x|z,\theta)=p_z\mathcal{N}(x|\mu_z,\Sigma_z)\)

For \(p(z|x,\theta^t)\) : \(p(z|x,\theta^t)=\frac{p(x,z|\theta^t)}{p(x|\theta^t)}=\frac{p_z^t\mathcal{N}(x|\mu_z^t,\Sigma_z^t)}{\sum\limits_kp_k^t\mathcal{N}(x|\mu_k^t,\Sigma_k^t)}\)

Plug in \(Q\) :

\(Q=\sum\limits_{i=1}^N\sum\limits_{z_i}\log p_{z_i}\mathcal{N(x_i|\mu_{z_i},\Sigma_{z_i})}\frac{p_{z_i}^t\mathcal{N}(x_i|\mu_{z_i}^t,\Sigma_{z_i}^t)}{\sum\limits_kp_k^t\mathcal{N}(x_i|\mu_k^t,\Sigma_k^t)}\)

Find maximum of \(Q\) : \(Q=\sum\limits_{k=1}^K\sum\limits_{i=1}^N[\log p_k+\log \mathcal{N}(x_i|\mu_k,\Sigma_k)]p(z_i=k|x_i,\theta^t)\)

  1. \(p_k^{t+1}\) :

    \( p_k^{t+1}=\mathop{argmax}_{p_k}\sum\limits_{k=1}^K\sum\limits_{i=1}^N[\log p_k+\log \mathcal{N}(x_i|\mu_k,\Sigma_k)]p(z_i=k|x_i,\theta^t)\ s.t.\ \sum\limits_{k=1}^Kp_k=1 \) \(\therefore\\ p_k^{t+1}=\mathop{argmax}_{p_k}\sum\limits_{k=1}^K\sum\limits_{i=1}^N\log p_kp(z_i=k|x_i,\theta^t)\ s.t.\ \sum\limits_{k=1}^Kp_k=1 \)

    Introduce a Lagrange multiplier:

    \(L(p_k,\lambda)=\sum\limits_{k=1}^K\sum\limits_{i=1}^N\log p_kp(z_i=k|x_i,\theta^t)-\lambda(1-\sum\limits_{k=1}^Kp_k)\) \(\therefore\\ \frac{\partial}{\partial p_k}L=\sum\limits_{i=1}^N\frac{1}{p_k}p(z_i=k|x_i,\theta^t)+\lambda=0\\ \Rightarrow \sum\limits_k\sum\limits_{i=1}^N\frac{1}{p_k}p(z_i=k|x_i,\theta^t)+\lambda\sum\limits_kp_k=0\\ \Rightarrow\lambda=-N \) \(\therefore\\ p_k^{t+1}=\frac{1}{N}\sum\limits_{i=1}^Np(z_i=k|x_i,\theta^t) \)
  2. \(\mu_k,\Sigma_k\) , these two parameters are unconstrained, and can be directly derived.