Reviewing Time! Mathematics for Deep Learning
As I was reading through some SOTA papers, I noticed I needed to freshen up some concepts to better understand what I was reading. This work was mostly based on the book “Dive into Deep Learning”, although I used other sources as well, which will be cited throughout the text.
Linear Algebra
Linear Algebra is one of the basis of Deep Learning. Deep Learning is basically a series of matrix multiplications, and understanding how these operations work, and how they can be visualized as changes in coordinate space is fundamental to understand how these models work. One of best references I could find for this subject was the 3Blue1Brown series on Linear Algebra, which I highly recommend:
Dot Product
The dot product has two common forms:
\[\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^\top \mathbf{v} = \mathbf{v}^\top\mathbf{u}\]The angle between two vectors is defined by the dot product:
\[\theta = \text{arccos}\left(\frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}||||\mathbf{v}||}\right)\]Hyperplanes
A hyperplane is an object in \(\mathbb{R}^n\) with \(n-1\) dimensions. This object divides the space into two half-spaces. Hyperplanes in \(n\)-dimensional space are analogous to lines and planes in 2- and 3-dimensional spaces, respectively. These objects are useful for linear classification models, where they are used to separate target classes. In this context, hyperplanes can be referred to as decision planes. Most deep learning classification networks end with a linear layer fed into a softmax, so one can interpret these models as finding a non-linear embedding such that the target classes can be cleanly separated by hyperplanes.
Geometry of Linear Transformations
We can decompose an arbitrary matrix into
\[\mathbf{Av} = \begin{bmatrix}a & b \\ c & d \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix}\] \[\mathbf{Av} = x \left( \mathbf{A} \begin{bmatrix} 1 \\ 0 \end{bmatrix} \right) + y \left( \mathbf{A} \begin{bmatrix} 0 \\ 1 \end{bmatrix} \right)\]From this, we can infer that a matrix only linearly alters the basis of the space. This means that matrices can only skew, rotate, and scale the original space.
Some matrices can have redundant columns, i.e., are composed of column vectors that are linearly dependent. Linear dependence between vectors \(\mathbf{v_1}, \mathbf{v_2}, ..., \mathbf{v_n}\) is defined by the existence of coefficients \(a_1, a_2, ..., a_n\), not all equal to zero, such that:
\[\sum^k_{i=1} a_i \mathbf{v_i} = 0\]The rank of a matrix is the largest number of linearly independent columns among all subsets of columns.
The determinant is the ratio between the areas/volumes/hypervolumes defined by the basis of the transformed space and the basis of the original space.
Einsum
Einstein summation is a generalization of the matrix multiplication operation. In einsum, one can specify the dimensions of the final tensor while giving the dimensions of the input, and the results will follow accordingly.
An example of this in PyTorch is:
torch.einsum("ijk, il, j -> kl", B, A, v)
In this example, a final kl-dimensioned tensor is the objective, and it is created based on B (ijk-dimensioned), A (il-dimensioned), and v (j-dimensioned).
Eigendecompositions
If a value \(\lambda\) and a vector \(\mathbf{v}\) can be found such that:
\[\mathbf{A}\mathbf{v} = \lambda\mathbf{v}\]It can be stated that \(\mathbf{v}\) is an eigenvector with a corresponding \(\lambda\) eigenvalue.
To find these, one can solve the following equation:
\[(\mathbf{A} - \lambda \mathbf{I})\mathbf{v} = 0\]For this equation to be valid, the matrix \(\mathbf{A} - \lambda \mathbf{I}\) must compress the space to a lower dimension, i.e., the column vectors that compose the matrix must be linearly dependent. Therefore, it can be inferred that:
\[\text{det}(\mathbf{A} - \lambda \mathbf{I}) = 0\]Let \(\mathbf{W}\) be the matrix where the columns are the eigenvectors of the matrix \(\mathbf{A}\), and \(\mathbf{\Sigma}\) is the diagonal matrix with the associated eigenvalues on its diagonal. The original matrix can be decomposed as:
\[\mathbf{A} = \mathbf{W \Sigma W}^{-1}\]From this, various interesting consequences can be derived. One example is the exponentiation of \(\mathbf{A}\):
\[\mathbf{A}^n = \mathbf{W \Sigma}^n \mathbf{W}\]This eases the calculation considerably.
Another example is the definition of the determinant of \(\mathbf{A}\). As we decompose this matrix, we observe that the distortion applied by \(\mathbf{W}\) is undone by \(\mathbf{W}^{-1}\). Therefore, the determinants of the matrix are the multiplication of all non-zero eigenvalues.
If one applies an arbitrary matrix operation \(\mathbf{B}\) to an arbitrary vector \(\mathbf{v}\) a large number of times, one notes that the norm of the resulting vector scales roughly by the highest, or principal, eigenvalue. This is because, at every iteration, although the vector is stretched in every direction, it is stretched a bit more in the direction of the principal eigenvector.
Gradient Descent
Gradient descent is a method used to minimize a function. This method is based on the idea that the function decreases most rapidly in the direction of the negative gradient. The gradient is the derivative of the function, and it points in the direction of the steepest ascent. Therefore, the negative gradient points in the direction of the steepest descent.
A small difference \(\mathbf{\epsilon}\) in independent variables \(\mathbf{w}\) can be represented by the following equation, derived from the Taylor Series:
\[L(\mathbf{w+ \epsilon}) \approx L(\mathbf{w}) + \mathbf{\epsilon} \cdot \nabla_{\mathbf{w}} L(\mathbf{w})\]To calculate gradient descent, one must cyclically:
- Initialize \(\mathbf{w}\) randomly.
- Find a direction \(\mathbf{v}\) that decreases \(L\) most rapidly at \(\mathbf{w}\).
- Take a “small” step in that direction \(\mathbf{w} + \mathbf{\epsilon v}\).
This direction vector can be substituted into the first equation and further developed to make:
\[L(\mathbf{w+v}) \approx L(\mathbf{w}) + \mathbf{v} \cdot \nabla_{\mathbf{w}} L(\mathbf{w}) = L(\mathbf{w}) + ||\nabla_{\mathbf{w}} L(\mathbf{w})|| \cos(\theta)\]Note that \(\mathbf{v}\) has length one, by convention. To decrease \(L\), one must decrease \(\cos(\theta)\). To achieve this, \(\theta\) must be \(-\pi\). This means the direction vector \(\mathbf{v}\) must point in the opposite direction to \(\nabla_{\mathbf{w}}L(\mathbf{w})\). Therefore, the former list to calculate gradient descent can be restructured to:
- Initialize \(\mathbf{w}\) randomly.
- Compute \(\nabla_{\mathbf{w}}L(\mathbf{w})\).
- Take a “small” step in the opposite direction, \(\mathbf{w} + \mathbf{\epsilon} \nabla_{\mathbf{w}}L(\mathbf{w})\).
Statistics
Statistics is the science of collecting, analyzing, and interpreting data. It is used to understand the world and make decisions based on data. In Deep Learning, statistics is used to understand the data and the models that are built on top of it.
Probability Density Functions
As probabilities move from discrete to continuous, the probability for each event approaches zero due to the large number of events. To allow for a continuous probability function, one must model the probability density, which measures the probability per unit of length. Although probability densities do not represent probabilities, they can be interpreted as the relative likelihood between events plotted on a graph. To retrieve probability \(P(X)\) from a probability density function (pdf) \(p(x)\), one can:
\[P \left( X \in \left[ a,b \right] \right) = \int_a^b p\left( x \right) dx\]Therefore, it can be inferred that:
\[\int_{-\infty}^{\infty} p\left( x \right) dx = 1\]Cumulative Distribution Functions
To represent probabilities in the continuum, one can use a cumulative distribution function. This function represents the total (cumulative) probability that a random event \(X\) has a value less than or equal to \(x\). More explicitly:
\[F(x) = P(X \leq x) = \int_{-\infty}^{x} p\left( x \right) dx\]Metrics
Random variables can be difficult to interpret by themselves. Therefore, metrics were developed to ease that interpretation. The metrics briefly discussed here are mean, variance, and standard deviation.
Mean
The mean in a discrete distribution can be calculated as the multiplication of the probability of the random event with the value of that random event. Analogously, in the continuum, the mean of a random variable \(X\), \(\mu_X\), can be formally expressed as:
\[\mu_X = E\left[ X \right] = \int_{-\infty}^{\infty} x p\left( x \right) dx\]Variance and Standard Deviation
Variance is defined as the expected square deviation of the random variable from its mean:
\[\text{Var}(X) = E\left[(X-\mu_X)^2\right] = E[X^2] - 2E[X]\mu_X + \mu_X^2 = E[X^2] - \mu_X^2\]Therefore, it can be stated that:
\[\text{Var}(X) = E[X^2] - \mu_X^2 = \int_{-\infty}^{\infty} x^2 p\left( x \right) dx - \left( \int_{-\infty}^{\infty} x p\left( x \right) dx \right)^2\]Standard deviation, \(\sigma_X\), is by definition the square root of the variance:
\[\text{Var}(X) = \sigma_X^2\]Joint Density Functions
The above work was written for a single random variable. However, these problems can extend to two or more random variables that might be dependent on each other. To represent them, joint density functions can be used, which can be defined as \(p(x,y)\) for random variables \(X\) and \(Y\), formally:
\[p(x,y) = P \left( X, Y \in \mathcal{D} \right) = \int_{\mathcal{D}} p\left( x,y \right) dx dy\]To reduce the distribution to a single variable, one can marginalize out the unneeded variables to achieve a marginalized distribution:
\[p_X(x) = \int_{-\infty}^{\infty} p\left( x,y \right) dy\]Covariance and Correlation
With multiple variables, one can measure their relationship by measuring the covariance. Covariance is the degree to which both variables fluctuate together, by calculating the expected value of the product of the deviations of each variable from its mean:
\[\text{Cov}(X,Y) = \sigma_{X,Y} = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]E[Y]\]In the continuum:
\[\text{Cov}(X,Y) = \int_{\mathbb{R}} \left(x - \mu_X\right) \left(y - \mu_Y\right) p\left( x,y \right) dy\]Covariance can also be used to expand the relationship between two variables:
\[\text{Var}(X,Y) = \text{Var}(X) + \text{Var}(Y) + \text{Cov}(X,Y)\]Although covariance is an important metric, it is still bound to the units the variables are described in. That is, a larger covariance does not mean a tighter relationship between variables. To achieve this, a unitless metric is needed, which is where correlation comes in. Correlation is a value between \(-1\) and \(1\), indicating a negative or a positive correlation, respectively. When correlation is \(0\), there is no correlation between the variables, indicating independence. Formally, the correlation coefficient can be written as:
\[\rho \left( X, Y \right) = \frac{\text{Cov} \left( X,Y \right) }{\sigma_X \sigma_Y}\]Note that this correlation can only measure linear correlation. If two variables have a non-linear dependency, it can be missed by this metric.
An interesting fact about correlation is how easily a relation can be found with the cosine of an angle between two vectors. The covariance can be equated to the inner product of two vectors, and the standard deviation can be related to the norm of each of the vectors.
Common Distributions
Some random variable distributions are commonly studied and used in Deep Learning. This section will explore them briefly.
Bernoulli
This is one of the simplest distributions. It functions similarly to a coin flip with a non-defined heads/tails probability. Let us say that \(X\) follows this distribution, formally:
\[X \sim Bernoulli(p)\]\(X\) is \(0\) with probability \(1-p\) and \(1\) with probability \(p\).
Uniform
The uniform distribution consists of, in a specified interval \([a,b]\), all values being equally probable. The probability density of each value is \(\frac{1}{b-a}\). To say \(X\) follows this uniform distribution, one can formally write:
\[X \sim U \left( a,b \right)\]Binomial
The binomial distribution is a sequence of Bernoulli experiments. Explicitly, it is \(n\) runs of Bernoulli random variables with probability \(p\). To say \(X\) follows this distribution is to say:
\[X \sim Binomial(n,p)\]This distribution can be summed up mathematically as:
\[P\left(X = k \right) = \frac{n!}{k!(n-k)!} p^k (1-p)^{n-k}\]where \(k\) is the number of successes in the \(n\) runs of the Bernoulli distribution.
Poisson
The Poisson distribution measures the probability of \(k\) Bernoulli events happening in a specified amount of time, where \(\lambda\) events are expected to occur in that amount of time. This can be represented by
\[X \sim Poisson \left( \lambda \right)\]The probability of \(k\) events happening is:
\[p_k = \frac{\lambda^k e^{-\lambda}}{k!}\]Gaussian
The Gaussian distribution is one of the most important distributions in statistics. Its probability density function can be described as:
\[p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e ^ {-\frac{(x- \mu) ^2}{2\sigma^2}}\]The reason behind the ubiquitous nature of this distribution is the Central Limit Theorem. This theorem states that by adding \(n\) samples of a random variable \(X\), when \(n \rightarrow \infty\), the results of this sum approach a Gaussian distribution. This is because the sum behaves like a convolution, which converges on the normal distribution.
\(X\) can be stated as belonging to a Gaussian distribution as:
\[X \sim \mathcal{N} \left( \mu, \sigma^2 \right)\]Estimators
In statistics, an estimator \(\hat{\theta}\) is a function that estimates a value after observing samples:
\[\hat{\theta}_n = \hat{f} \left( x_1, ..., x_n \right)\]The simplest metric to evaluate an estimator is the Mean Square Error (MSE):
\[\text{MSE}\left(\hat {\theta}_n, \theta \right) = E \left[ (\hat{\theta}_n - \theta)^2 \right]\]Although this is a natural metric, it can be influenced by various factors, one of which is statistical bias. Statistical bias represents the difference between the expected estimator value and the real value:
\[\text{bias}\left( \hat{\theta}_n \right) = E \left( \hat{\theta}_n \right) - \theta\]This makes statistical bias great for measuring systematic errors in the estimator.
To measure random errors, standard deviation/variance is more adequate:
\[\text{Var}\left( \hat{\theta}_n \right) = E\left[ \left( \hat{\theta}_n - E \left( \hat{\theta}_n \right) \right) ^2 \right]\]There is, however, a bias-variance trade-off. We can decompose the MSE into three sources of error: error from high bias, error from high variance, and irreducible error. Formally:
\[\text{MSE}\left( \hat{\theta}_n, \theta \right) = \left( \text{bias} \left[ \hat{\theta}_n \right] \right)^2 + \text{Var} \left( \hat{\theta}_n \right) + \text{Var} \left[ \theta \right]\]The bias error can be seen in a simpler model that cannot extract high-dimensional relations between the features and the outputs. Therefore, if a model has high bias error, it can be said to be underfitting. The variance error can be seen in a complex model that corresponds too closely to the training data without accurately depicting the relation between features and outputs. Thus, if a model has high variance error, it can be said to be overfitting. The irreducible error is the variance error obtained from \(\theta\).
Hypothesis Testing
A hypothesis test is a way of evaluating a statement about a population based on data from a sample. This is done by creating a null hypothesis \(H_0\) that one wants to disprove. Using the sample available, the strategy is to evaluate how extreme (or absurd) this sample would be if \(H_0\) were true. To do so, one must define various variables.
Statistical significance is the probability of correctly accepting \(H_0\):
\[\text{statistical significance} = 1 - \alpha = 1 - P \left( \text{reject } H_0 | H_0 \text{ is true} \right)\]where \(\alpha\) is called the significance level and represents a type I error, i.e., a false positive. \(\alpha\) is usually \(0.05\). Statistical significance is usually used to verify if a sample corresponds to a hypothesis.
Statistical power or sensitivity is the probability of correctly rejecting \(H_0\):
\[\text{statistical power} = 1 - \beta = 1 - P \left( \text{failure to reject } H_0 | H_0 \text{ is false} \right)\]where \(\beta\) represents a type II error, i.e., a false negative. \(\beta\) is usually \(0.2\). Statistical power is usually used to determine if a certain sample size is sufficient to prove a hypothesis.
A test statistic \(T(x)\) is a scalar used to summarize a certain characteristic of the sample data. The objective is to easily compare two or more distributions. The most usual test statistic used is the mean.
The p-value is the probability that \(T(X)\) (test statistic of random variable \(X\)), if \(H_0\) is true, is as extreme (or absurd) as \(T(x)\) (test statistic of measured data \(x\)):
\[\text{p-value} = P_{H_0} \left(T \left( X \right) \geq T \left( x \right) \right)\]If the p-value is lower than \(\alpha\), we may reject the null hypothesis.
If you found this useful, please cite this as:
Gomes, Manuel (Jun 2024). Reviewing Time! Mathematics for Deep Learning. https://manuelgitgomes.github.io.
or as a BibTeX entry:
@article{gomes2024reviewing-time-mathematics-for-deep-learning,
title = {Reviewing Time! Mathematics for Deep Learning},
author = {Gomes, Manuel },
year = {2024},
month = {Jun},
url = {https://manuelgitgomes.github.io/blog/2024/deep-learning-math/}
}