# Deep Learning Book Series · 2.12 Example Principal Components Analysis

This content is part of a series following the chapter 2 on linear algebra from the Deep Learning Book by Goodfellow, I., Bengio, Y., and Courville, A. (2016). It aims to provide intuitions/drawings/python code on mathematical theories and is constructed as my understanding of these concepts. You can check the syllabus in the introduction post but here are the links to the other articles:

# Introduction

This is the last chapter of this series on linear algebra! It is about Principal Components Analysis (PCA). We will use some knowledge that we acquired along the preceding chapters to understand this important data analysis tool! Feel free to check out the preceding chapters!

# 2.12 Example - Principal Components Analysis

Dimensions are a crucial topic in data science. The dimensions are all the features of the dataset. For instance, if you are looking at a dataset containing pieces of music, dimensions could be the genre, the length of the piece, the number of instruments, the presence of a singer etc. You can imagine all these dimensions as different columns. When there is only two dimensions, it is very convenient to plot: you can use the $x$- and $y$-axis. Add color and you can represent a third dimension. It is similar if you have tens or hundereds of dimensions, it will just be harder to visualize it.

When you have that many dimensions it happens that some of them are correlated. For instance, we can reasonably think that the genre dimension will correlate with the instruments dimensions in our previous example. One way to reduce dimensionality is simply to keep only some of them. The problem is that you loose good information. It would be nice to have a way to reduce these dimensions while keeping all the information present in the data set.

The aim of principal components analysis (PCA) is generaly to reduce the number of dimensions of a dataset where dimensions are not completly decorelated. PCA provides us with a new set of dimensions, the principal components (PC). They are ordered: the first PC is the dimension having the largest variance. In addition, each PC is orthogonal to the preceding one. Remember that orthogonal vectors means that their dot product is equal to $0$ (see 2.6). This means that each PC is decorelated to the preceding one. It is way better than feature selection where you loose a lot of information.

### Example 1.

Unit vectors are an example of orthogonal vectors:

*Orthogonal vectors*

## Describing the problem

The problem can be expressed as finding a function that converts a set of data points from $\mathbb{R}^n$ to $\mathbb{R}^l$. This means that we change the number of dimensions of our dataset. We also need a function that can decode back from the transformed dataset to the initial one:

*Principal components analysis as a change of coordinate system*

The first step is to understand the shape of the data. $x^{(i)}$ is one data point containing $n$ dimensions. Let’s have $m$ data points organized as column vectors (one column per point):

If we deploy the $n$ dimensions of our data points we will have:

We can also write:

$c$ will have the shape:

## Adding some constraints: the decoding function

The encoding function $f(\bs{x})$ transforms $\bs{x}$ into $\bs{c}$ and the decoding function transforms back $\bs{c}$ into an approximation of $\bs{x}$. To keep things simple, PCA will respect some constraints:

### Constraint 1.

The decoding function has to be a simple matrix multiplication:

By applying the matrix $\bs{D}$ to the dataset from the new coordinates system we should get back to the initial coordinate system.

### Constraint 2.

The columns of $\bs{D}$ must be orthogonal (see 2.6).

### Constraint 3.

The columns of $\bs{D}$ must have unit norm (see 2.6).

## Finding the encoding function

Important: For now we will consider only **one data point**. Thus we will have the following dimensions for these matrices (note that $\bs{x}$ and $\bs{c}$ are column vectors):

*The decoding function*

We want a decoding function which is a simple matrix multiplication. For that reason, we have $g(\bs{c})=\bs{Dc}$. We will then find the encoding function from the decoding function. We want to minimize the error between the decoded data point and the actual data point. With our previous notation, this means reducing the distance between $\bs{x}$ and $g(\bs{c})$. As an indicator of this distance, we will use the squared $L^2$ norm (see 2.5):

This is what we want to minimize. Let’s call $\bs{c}^*$ the optimal $\bs{c}$. Mathematically it can be written:

This means that we want to find the values of the vector $\bs{c}$ such that $\norm{\bs{x} - g(\bs{c})}_2^2$ is as small as possible.

If you have a look back to 2.5 you can see that the squared $L^2$ norm can be expressed as:

We have named the variable $\bs{y}$ to avoid confusion with our $\bs{x}$. Here $\bs{y}=\bs{x} - g(\bs{c})$

Thus the equation that we want to minimize becomes:

Since the transpose respects addition we have:

By the distributive property (see 2.2) we can develop:

The commutative property (see 2.2) tells us that $ \bs{x^\text{T}y} = \bs{y^\text{T}x} $. We can use that in the previous equation: we have $ \bs{x}^\text{T}g(\bs{c}) = g(\bs{c})^\text{T}\bs{x} $. So the equation becomes:

The first term $\bs{x^\text{T}x}$ does not depends on $\bs{c}$ and since we want to minimize the function according to $\bs{c}$ we can just get off this term. We simplify to:

Since $g(\bs{c})=\bs{Dc}$:

With $(\bs{Dc})^\text{T}=\bs{c}^\text{T}\bs{D}^\text{T}$ (see 2.2), we have:

As we saw in 2.6, $\bs{D}^\text{T}\bs{D}=\bs{I}_l$ because $\bs{D}$ is orthogonal (actually, it is semi-orthogonal if $n \neq l$) and their columns have unit norm. We can replace in the equation:

### Minimizing the function

So far so good! Now the goal is to find the minimum of the function $- 2\bs{x}^\text{T}\bs{Dc} + \bs{c}^\text{T}\bs{c}$. One widely used way of doing that is to use the **gradient descent** algorithm. It is not the focus of this chapter but we will say a word about it (see 4.3 of the Deep Learning Book for more details). The main idea is that the sign of the derivative of the function at a specific value of $x$ tells you if you need to increase or decrease $x$ to reach the minimum. When the slope is near $0$, the minimum should have been reached.

*Gradient descent*

However, functions with local minima can trouble the descent:

*Gradient descent can get stuck in local minima*

These examples are in 2 dimensions but the principle stands for higher dimensional functions. The gradient is a vector containing the partial derivatives of all dimensions. Its mathematical notation is $\nabla_xf(\bs{x})$.

### Calculating the gradient of the function

Here we want to minimize through each dimension of $\bs{c}$. We are looking for a slope of $0$. The equation is:

Let’s take these terms separately to calculate the derivative according to $\bs{c}$.

The second term is $\bs{c}^\text{T}\bs{c}$. We can develop the vector $\bs{c}$ and calculate the derivative for each element:

So we can progress in our derivatives:

Great! We found the encoding function! Here are its dimensions:

*The encoding function*

To go back from $\bs{c}$ to $\bs{x}$ we use $g(\bs{c})=\bs{Dc}$:

*The reconstruction function*

## Finding $\bs{D}$

The next step is to find the matrix $\bs{D}$. Recall that the purpose of the PCA is to change the coordinate system in order to maximize the variance along the first dimensions of the projected space. This is equivalent to minimizing the error between data points and their reconstruction (cf here). See bellow the covariance matrix to have more details.

Maximizing the variance corresponds to minimizing the error of the reconstruction.

### The Frobenius norm

Since we have to take all points into account (the same matrix $\bs{D}$ will be used for all points) we will use the Frobenius norm of the errors (see 2.5) which is the equivalent of the $L^2$ norm for matrices. Here the formula of the Frobenius norm:

It is like if you unroll the matrix to end up with a one dimensional vector and that you take the $L^2$ norm of this vector.

We will call $\bs{D}^*$ the optimal $\bs{D}$ (in the sense that the error is as small as possible). We have:

With the constraint that $\bs{D}^\text{T}\bs{D}=\bs{I}_l$ because we have chosen the constraint of having the columns of $\bs{D}$ orthogonal.

### The first principal component

We will start to find only the first principal component (PC). For that reason, we will have $l=1$. So the matrix $\bs{D}$ will have the shape $(n \times 1)$: it is a simple column vector. Since it is a vector we will call it $\bs{d}$:

*The first principal component*

We can therefore remove the sum over $j$ and the square root since we will take the squared $L^2$ norm:

We have also seen that:

Since we are looking only for the first PC:

We can plug $r(\bs{x})$ into the equation:

Because of the constraint 3. (the columns of $\bs{D}$ have unit norms) we have $\norm{\bs{d}}_2 = 1$. $\bs{d}$ is one of the columns of $\bs{D}$ and thus has a unit norm.

Instead of using the sum along the $m$ data points $\bs{x}$ we can have the matrix $\bs{X}$ which gather all the observations:

We want $\bs{x}^{(i)\text{T}}$ instead of $\bs{x}^{(i)}$ in our expression of $\bs{d}^*$. We can transpose the content of the norm:

and

with the constraint that $\bs{dd}^\text{T}=1$.

### Using the Trace operator

We will now use the Trace operator (see 2.10) to simplify the equation to minimize. Recall that:

So here $\bs{A}=\bs{X}-\bs{X}\bs{dd}^\text{T}$. So we have:

Since we can cycle the order of the matrices in a Trace (see 2.10) we can write:

And $(\bs{Xdd}^\text{T})^\text{T}=(\bs{d}^\text{T})^\text{T}\bs{d}^\text{T}\bs{X}^\text{T}=\bs{d}\bs{d}^\text{T}\bs{X}^\text{T}$. Let’s plug that into our equation:

We can remove the first term that not depends on $d$:

Still because of the cycling property of a trace, we have

We can simplify to:

and then

Because of the constraint $\bs{dd}^\text{T}=1$:

and with the cycling property:

### Eigendecomposition

We will see that we can find the maximum of the function by calculating the eigenvectors of $\bs{X^\text{T}X}$.

### Covariance matrix

As we wrote above, the optimization problem of maximizing the variance of the components and minimizing the error between the reconstructed and the actual data are equivalent. Actually, if you look at the formula of $\bs{d}$ you can see that there is the term $\bs{X^\text{T}X}$ in the middle.

If we have centered our data around 0 (see bellow for more details about centering), $\bs{X^\text{T}X}$ is the covariance matrix (see this Quora question).

The covariance matrix is a $n$ by $n$ matrix ($n$ being the number of dimensions). Its diagonal is the variance of the corresponding dimensions and the other cells are the covariance between the two corresponding dimensions (the amount of redundancy).

This means that the largest covariance we have between two dimensions the more redundancy exists between these dimensions. This also means that the best-fit line is associated with small errors if the covariance is hight. To maximize the variance and minimize the covariance (in order to decorrelate the dimensions) means that the ideal covariance matrix is a diagonal matrix (non-zero values in the diagonal only). Therefore the diagonalization of the covariance matrix will give us the optimal solution.

### Example 2.

As an example we will create again a 2D data set (like in 2.9). To see the effect of the PCA we will introduce some correlations between the two dimensions. Let’s create 100 data points with 2 dimensions:

```
np.random.seed(123)
x = 5*np.random.rand(100)
y = 2*x + 1 + np.random.randn(100)
x = x.reshape(100, 1)
y = y.reshape(100, 1)
X = np.hstack([x, y])
X.shape
```

(100, 2)

Let’s plot the data:

```
plt.plot(X[:,0], X[:,1], '*')
plt.show()
```

*Toy dataset with correlated features*

Highly correlated data means that the dimensions are redundant. It is possible to predict one from the other without losing much information.

The first processing we will do is to center the data around 0. PCA is a regression model without intercept (see here) and the first component is thus necessarly crossing the origin.

Here is a simple function that substract the mean of each column to each data point of this column. It can be used to center the data points around 0.

```
def centerData(X):
X = X.copy()
X -= np.mean(X, axis = 0)
return X
```

So let’s center our data $\bs{X}$ around 0 for both dimensions:

```
X_centered = centerData(X)
plt.plot(X_centered[:,0], X_centered[:,1], '*')
plt.show()
```

*The dataset is now centered in $0$*

That’s better!

We can now look for PCs. We saw that they correspond to values taken by $\bs{d}$ that maximize the following function:

To find $\bs{d}$ we can calculate the eigenvectors of $\bs{X^\text{T}X}$ (see 2.7 for more details about eigendecomposition). So let’s do that:

```
eigVals, eigVecs = np.linalg.eig(X_centered.T.dot(X_centered))
eigVecs
```

array([[-0.91116273, -0.41204669], [ 0.41204669, -0.91116273]])

These are the vectors maximizing our function. Each column vector is associated with an eigenvalue. The vector associated with the larger eigenvalue tells us the direction associated with the larger variance in our data. To check that, we will plot these vectors along with the data.

```
orange = '#FF9A13'
blue = '#1190FF'
plotVectors(eigVecs.T, [orange, blue])
plt.plot(X_centered[:,0], X_centered[:,1], '*')
plt.xlim(-3, 3)
plt.ylim(-3, 3)
plt.show()
```

*Eigenvectors of the covariance matrix*

We can see that the blue vector direction corresponds to the oblique shape of our data. The idea is that if you project the data points on the line corresponding to the blue vector direction you will end up with the largest variance. This vector has the direction that maximizes variance of projected data. Have a look at the following figure:

*Projection of the data point: this line direction is the one with the largest variance*

When you project data points on the pink line there is more variance. This line has the direction that maximizes the variance of the data points. It is the same for the figure above: our blue vector has the direction of the line where data point projection has the higher variance. Then the second eigenvector is orthogonal to the first.

In our figure above, the blue vector is the second eigenvector so let’s check that it is the one associated with the bigger eigenvalue:

```
eigVals
```

array([ 18.04730409, 798.35242844])

So yes, the second vector corresponds to the biggest eigenvalue.

Now that we have found the matrix $\bs{d}$ we will use the encoding function to rotate the data. The goal of the rotation is to end up with a new coordinate system where data is uncorrelated and thus where the basis axes gather all the variance. It is then possible to keep only few axes: this is the purpose of dimensionality reduction.

Recall that the encoding function is:

$\bs{D}$ is the matrix containing the eigenvectors that we have calculated before. In addition, this formula corresponds to only one data point where dimensions are the rows of $\bs{x}$. In our case, we will apply it to all data points and since $\bs{X}$ has dimensions on the columns we need to transpose it.

```
X_new = eigVecs.T.dot(X_centered.T)
plt.plot(eigVecs.T.dot(X_centered.T)[0, :], eigVecs.T.dot(X_centered.T)[1, :], '*')
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.show()
```

*We rotated the data in order to have the largest variance on one axis*

It worked! The rotation transformed our dataset that have now the more variance on one of the basis axis. You could keep only this dimension and have a fairly good representation of the data.

### About the unit norm constraint

We saw that the maximization is subject to $\bs{dd}^\text{T}=1$. This means that the solution vector has to be a unit vector. Without this constraint, you could scale $\bs{d}$ up to the infinity to increase the function to maximize (see here). For instance, let’s see some vectors $\bs{x}$ that could maximize the function:

```
d = np.array([[12], [26]])
d.T.dot(X.T).dot(X).dot(d)
```

array([[ 4165298.04389264]])

However this $\bs{d}$ has not a unit norm (since $\bs{d}$ is a column vector we use the transpose of $\bs{dd}^\text{T}$ (see 2.2):

```
d.T.dot(d)
```

array([[820]])

The eigenvectors have unit norm and thus respect the constraint:

```
eigVecs[:,0].dot(eigVecs[:,0].T)
```

1.0

and

```
eigVecs[:,1].dot(eigVecs[:,1].T)
```

1.0

And… This is the end! We have gone through a lot of things during this series on linear algebra! I hope that it was a useful introduction to this topic which is of large importance in the data science/machine learning/deep learning fields.

# References

## PCA

## Semi-orthogonal matrix

## Intuition about PCA

## Derivatives

## Link between variance maximized and error minimized:

## Centering data

## Unit norm constraint

Feel free to drop me an email or a comment. The syllabus of this series can be found in the introduction post. All the notebooks can be found on Github.

1. Scalars, Vectors, Matrices and Tensors

2. Multiplying Matrices and Vectors

3. Identity and Inverse Matrices

4. Linear Dependence and Span

5. Norms

6. Special Kinds of Matrices and Vectors

7. Eigendecomposition

8. Singular Value Decomposition

9. The Moore-Penrose Pseudoinverse

10. The Trace Operator

11. The Determinant

12. Principal Components Analysis (PCA)