This page contains a list of topics, definitions, and results from Machine Learning course at University of Chicago taught by this guy. The proof is very sketchy and given without any intuition mainly because I'm so lazy. I would appreciate all kinds of help to make the note complete.
Week 1: Introduction and OLS
The materials this week are just about the framework. We show that the problem of minimizing squared errors with respect to a distribution is the same as the problem of learning
. If the examples are drawn from discrete domain, this is a classification problem, but in case of continuous domain, it's a regression problem.
After defining all necessary definitions, an example of linear regression is given as a motivation of how regression is done.
Learning problem
Given a distribution
on
. We want to learn the objective function
(with respect to the distribution :
).
Learning Algorithms
Let Z be the set of possible samples. The learning algorithm is a function that maps a number of samples to a measurable function (denoted here by F a class of all measurable functions). Sometimes we consider a class of computable functions instead.
data:image/s3,"s3://crabby-images/02893/02893f32cdcc5c9f046f3c58942e82267be75067" alt="{\displaystyle A:\cup _{n=1}^{\infty }Z^{n}\rightarrow F}"
Learning errors
Suppose the learning algorithm outputs h. The learning error can be measured by
data:image/s3,"s3://crabby-images/7d5c3/7d5c3fdc3313c61787f703e569e7666874b2c735" alt="{\displaystyle \int (y-h(x))^{2}dP}"
One can prove that minimizing this quantity could be reduced to the problem of minimizing the following quantity.
data:image/s3,"s3://crabby-images/19517/195174c766fbe2af8993654f02266330b9be46dd" alt="{\displaystyle ||f_{p}-h||_{l_{2}(\mathbb {P} )}^{2}=\int (f_{p}(x)-h(x))^{2}P_{x}(x)dx}"
And that's the reason why we try to learn
In other word, we claim that
data:image/s3,"s3://crabby-images/b6412/b641203044f48207b252a6a695132c5a73a7ef4e" alt="{\displaystyle argmin_{h}\int (y-h(x))^{2}dP=argmin_{h}||f_{p}-h||_{l_{2}(\mathbb {P} )}^{2}}"
The proof is easy.
data:image/s3,"s3://crabby-images/c170d/c170df513f6134b5e8c0f9cfc9f36fd297cb14d6" alt="{\displaystyle \int (y-h(x))^{2}dP=\int ((y-f_{p}(x))+(f_{p}(x)-h(x)))^{2})dP}"
We get
data:image/s3,"s3://crabby-images/bc606/bc6066aa3a0f2bd6fe3dd38dc263aa8ed300b85e" alt="{\displaystyle \int (y-h(x))^{2}dP=\int (y-f_{p}(x))^{2}dP+\int (f_{p}(x)-h(x))^{2}dP+2\int (y-f_{p}(x))(f_{p}(x)-h(x))dP}"
Then observe that,
- The first term only depends on distribution
data:image/s3,"s3://crabby-images/c2f04/c2f045bc8d643124cc08290b3c7fb503ca954bda" alt="{\displaystyle \mathbb {P} }"
- The third term is zero
![{\displaystyle \int \int (y-f_{p}(x))(f_{p}(x)-h(x))p(x,y)dydx=\int p(x)(f_{p}(x)-h(x))[\int (y-f_{p}(x))p(y|x)dy]dx}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d955a4ecf56a0f85df9d27be396441caa92d63a2)
Observe also that the term
which is zero.
- The second term is equal to
data:image/s3,"s3://crabby-images/8d1a4/8d1a46c3db8982a8a2c3f75626e574c515fdc5d1" alt="{\displaystyle \int _{X}\int _{Y}(f_{p}(x)-h(x))^{2}p(x,y)dydx=\int _{X}(f_{p}(x)-h(x))^{2}\int _{Y}p(x,y)dydx=||f_{p}-h||_{l_{2}(\mathbb {P} )}^{2}}"
Example 1
When the class
, i.e. classification problem, our objective
reduces to
![{\displaystyle f_{p}(x)=\mathbb {E} _{p}[y|x]=Pr[y=1|x]-Pr[y=-1|x]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/302e20c9a4efc21087a8962b3cf89493b0f3ee61)
One can show that the function
![{\displaystyle h_{*}(x)=sgn\mathbb {E} _{p}[y|x]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6f8ff8bee30f84ea0a9505c2a3ed6575f93d1ef0)
minimizes the loss (proof omited)
Example 2
When
, the problem is just like regression where we try to regress :
Learning lowerbound
Let
be a class of probability distribution of interest. No algorithm achieve error smaller than
data:image/s3,"s3://crabby-images/aa65e/aa65ee8c935656b7fa5deff3782b890fef1f9291" alt="{\displaystyle \mathbf {R} (n,\mathbb {M} )=\inf _{A}\sup _{P\in \mathbb {M} }E_{z}||f_{p}-A(z)||^{2}}"
Ordinary Least Square
If the relation is linear,
data:image/s3,"s3://crabby-images/29ebc/29ebc9c9089ba21499566e89a3c7aac53e5cb204" alt="{\displaystyle y=X\mathbf {\beta } +\mathbf {\epsilon } }"
OLS provably gives the minimum squared error.
Consider the error
data:image/s3,"s3://crabby-images/e0f0e/e0f0e2ff0ed8f9290b0b04fdcb710d765d71be65" alt="{\displaystyle ||y-X\mathbf {\beta } ||^{2}=(y-X\beta )^{T}(y-X\beta )}"
Differentiating the error and make it zero, we get
data:image/s3,"s3://crabby-images/5be1c/5be1c6934fe012d2a6501c0ac6be0f956237452e" alt="{\displaystyle (X^{T}X)\mathbf {\beta } =X^{T}\mathbf {y} }"
If
has no nullspace (or is full rank), then it has an inverse. We get
data:image/s3,"s3://crabby-images/28f14/28f14188945e5732b1dafd47fedf33520841d927" alt="{\displaystyle \mathbf {\beta } =(X^{T}X)^{-1}X^{T}y}"
But if the matrix
is not full rank, we can still show that
data:image/s3,"s3://crabby-images/1ffc2/1ffc2696c629d84d76a72aa4c61bd2116d86e00f" alt="{\displaystyle \mathbf {\beta } =(X^{T}X)^{\dagger }X^{T}y}"
minimizes the error, where
is the Moore-Penrose pseudoinverse of the matrix.
First, let
be eigenvalues of
and let
be corresponding eigenvectors.
Note that:
- Since
is positive semidefinite, all eigenvalues are non-negative
- We can decompose
to
(because
is symmetric so that every eigenvector is orthogonal to others.)
Denote the column space of
by
. Assume WLOG that
and
.
We will first show that
in fact, lives in the subspace
. Observe that since the set of eigenvectors span
, it suffices to show that it is orthogonal to every vector in
,
.
Let
, that is
. We get
data:image/s3,"s3://crabby-images/1dc6c/1dc6c6c7db38172b341a580623ead38381088165" alt="{\displaystyle z^{T}\mathbf {X} ^{T}\mathbf {X} z=||Xz||^{2}=0}"
This implies that
and thus,
.
From the fact that
lives in
, we can verify that the following
is what we need.
data:image/s3,"s3://crabby-images/31a31/31a311fb7582e5376cd12476cae74b23854365e0" alt="{\displaystyle \beta =(\sum _{i=1}^{m}{\frac {1}{\lambda _{i}}}v_{i}v_{i}^{T})\mathbf {X} ^{T}y}"
And it's straightforward to see that the solution for
is not unique since any
for
would also satisfy the equation.
Note that the term
is a Moore-Penrose pseudoinverse of a singular matrix.
Tikhonov Regularization
Instead of solving
data:image/s3,"s3://crabby-images/e7ed4/e7ed45c2d5148dc25724560a7d569547992c09a5" alt="{\displaystyle \beta ^{*}=argmin_{\beta \in \mathbb {R} ^{d}}||y-\mathbf {X} ^{T}\beta ||^{2}}"
we look at
data:image/s3,"s3://crabby-images/ba0e7/ba0e750de693f7e36e16da724c7e42c1926e96a7" alt="{\displaystyle \beta ^{*}=argmin_{\beta \in \mathbb {R} ^{d}}||y-\mathbf {X} ^{T}\beta ||^{2}+\alpha ||\beta ||^{2}}"
If
tends to zero, it problem becomes just like OLS. If
is a certain positive number, differentiating both sides give (and equate to zero)
data:image/s3,"s3://crabby-images/5355c/5355cd21da2578400e25f16c5d2e4641bbaab7a2" alt="{\displaystyle (\mathbf {X} ^{T}\mathbf {X} +\alpha I)\beta =\mathbf {X} ^{T}y}"
are eigenvalues of
, where
are eigenvalues of
.
Observe that all eigenvalues are positive, so nullspace is zero. And thus, matrix on the left-hand side is invertible. The solution is unique.
Week 2: PCA, LDA, and Introduction to SVM
In this week, we cover another theme of techniques, i.e. Principle Component Analysis (PCA) and Fisher Linear discriminant analysis. As compared to last week topics which involve solving linear systems, this week topics involve solving eigenvalues & eigenvectors of matrix.
Since the theory of positive semidefinite will be crucial throughout the course, we give a quick introduction.
A quick introduction to theory of positive matrix
A matrix
is said to be positive semidefinite if
for all w
Observation: Matrix in the form
is positive semidefinite.
Proof. for any w,
property 1: Positive semidefinite matrix has eigenvalue
Proof.
data:image/s3,"s3://crabby-images/91be9/91be96ab676d07dc660c28abc2d8b75946bb78b8" alt="{\displaystyle \lambda ={\frac {v^{T}Av}{v^{T}v}}\geq 0}"
(This is because
can be thought
as a generalized dot product so that it will always have non-negative value).
PCA
Given n points of data in
, we want to find a projection
which best splits these n points. The notion of 'best' is measured by variance of the projected data. Note that we can assume WLOG that
. That is, we want to maximize
where
.
where
is positive semidefinite.
The problem is given as follows.
- s.t.
data:image/s3,"s3://crabby-images/f5e33/f5e33ada87c8a9863b6f9222b070ea66ee1b3477" alt="{\displaystyle ||\mathbf {w} ||^{2}=1}"
Using Lagrange's,
data:image/s3,"s3://crabby-images/ede97/ede97e91fcb4d2339961e047ddcaeed7b2c67edc" alt="{\displaystyle \max w^{T}\mathbf {A} w-\lambda w^{T}w}"
Differentiating the above,
data:image/s3,"s3://crabby-images/b49ac/b49ac718a9851c6bc0e6a79b0329d11f63ceaa20" alt="{\displaystyle 2\mathbf {A} w-2\lambda w=0}"
data:image/s3,"s3://crabby-images/a65d4/a65d49922dc5545323075e1f8075ca197f502ef1" alt="{\displaystyle \mathbf {A} w=\lambda w}"
This tells us that the solution must be among eigenvectors
of
.
data:image/s3,"s3://crabby-images/0b0b5/0b0b519cce0704bb98f44332c50bd3728883230a" alt="{\displaystyle v_{i}^{T}Av_{i}=v_{i}^{T}\lambda _{i}v_{i}=\lambda _{i}v_{i}^{T}v_{i}=\lambda _{i}}"
The solution becomes clear now, i.e. choose
to be the eigenvectors corresponding to the largest eigenvalue of
.
Fischer LDA
Another method along the same line was proposed by Fischer. The idea is to find the direction which best seperates the mean of different classes, as contrasted to PCA which finds the direction in which the projected data have highest variance. Fisher's criterion is as follows: (note that we consider the task of classifications with two labels, 1 and -1)
data:image/s3,"s3://crabby-images/dc4a6/dc4a6bd6c3f29f37c5ea23548f218224821a40d5" alt="{\displaystyle \max _{w\in \mathbb {R} ^{d}}{\frac {||w^{T}m_{1}-w^{T}m_{-1}||^{2}}{\sum _{i\in I_{1}}||w^{T}x_{i}-w^{T}m_{1}||^{2}+\sum _{i\in I_{-1}}||w^{T}x_{i}-w^{T}m_{-1}||^{2}}}}"
where
The numerator could be written as
which we denote the middle term by
. Also, the denominator reduces to
data:image/s3,"s3://crabby-images/47549/47549ce88b053333e2d3ee65eeff23a89fb2a5c4" alt="{\displaystyle w^{T}(\sum (x_{i}-m_{1})(x_{i}-m_{1})^{T}+\sum (x_{i}-m_{2})(x_{i}-m_{2})^{T})w}"
which we denote the middle term by
. The problem becomes
data:image/s3,"s3://crabby-images/db339/db339ad3e83f0c024f37becdb2bebc24c9799737" alt="{\displaystyle \max _{w\in \mathbb {R} ^{d}}{\frac {w^{T}\mathbb {S} _{B}w}{w^{T}\Sigma w}}}"
This is equivalent to
data:image/s3,"s3://crabby-images/535c8/535c8d73d53736f24536ee0eb0c9ce7997ee0123" alt="{\displaystyle \max _{w\in \mathbb {R} ^{d}}w^{T}\mathbb {S} _{B}w}"
- s.t.
data:image/s3,"s3://crabby-images/78d8b/78d8bac589769ad505db54cd7f946f64f4fbbde6" alt="{\displaystyle \mathbf {w} ^{T}\Sigma \mathbf {w} =1}"
Again, consider Lagrange and differentiate the formula. we get
data:image/s3,"s3://crabby-images/8c227/8c22796ffe9fa65a178c0ebf411bba06dd4caef0" alt="{\displaystyle 2\mathbb {S} _{B}w=2\lambda \Sigma w}"
data:image/s3,"s3://crabby-images/3f0ea/3f0eac0524728c51b773ed0e99939a697a43879c" alt="{\displaystyle \Sigma ^{-1}\mathbb {S} _{B}w=\lambda w}"
This would have been done if
is symmetric and positive semidefinite. We can use the eigenvector solutions like before. We can work it out as follows:
- Observe that
is a rank-one symmetric, positive semidefinite matrix (rank-one is not important, though). So, the notation
makes sense because we can consider changing to eigen basis,
. So
.
Replacing
with v, we get
data:image/s3,"s3://crabby-images/75598/755989f14d531ce1f02bf2fcd0fc95f26691fc59" alt="{\displaystyle \mathbb {S} _{B}^{1/2}\Sigma ^{-1}\mathbb {S} _{B}^{1/2}v=\lambda v}"
And one can check that the linear map on the lefthand is symmetric, positive semidefinite. Thus, we can find eigenvector solution
corresponding to
and compute
.
One can also verify that, plugging this solution back to the original equation, we get
data:image/s3,"s3://crabby-images/aa6a9/aa6a98c51c471b80b8892f173907300477624e4f" alt="{\displaystyle {\frac {w_{k}^{T}\mathbb {S} _{B}w_{k}}{w_{k}^{T}\Sigma w_{k}}}={\frac {\lambda _{k}w_{k}^{T}\Sigma w_{k}}{w_{k}^{T}\Sigma w_{k}}}=\lambda _{k}}"
So, we can maximize the objective by choosing the vector
corresponding to largest eigenvalue.
Week 3: Finishing up SVM
This week, we will talk about support vector machines which involve solving quadratic programs. At the same time, we start the theory of VC dimension.
See this.
Please see chapter 10 in Vapnik's book Statistical Learning Theory for reference.
Week 4: A theory of reproducing kernels
This week, we start looking at kernel methods which will be a main focus for this quarter. The first class gives several definitions and concepts relating to reproducing kernel hilbert spaces. We talk about two constructions of r.k. Hilbert space (i.e. via a completion of linear combinations and via measure-theoretic view). These two methods (and some others in the liturature) are equivalent. Examples of the construction are given in the finite-domain case.
here
Please see the reference for more detail.
Week 5: More on kernels
The plan for this week is to talk about this paper
Useful Links
General Theory
Famous Applications of PCA in machine learning
- Eigenface for face recognition