We will see an important concept for machine learning and deep learning. The norm is what is generally used to evaluate the error of a model. For instance it is used to calculate the error between the output of a neural network and what is expected (the actual label or value). You can think of the norm as the length of a vector. It is a function that maps a vector to a positive value. Different functions can be used and we will see few examples.

2.5 Norms

Norms are any functions that are characterized by the following properties:

1- Norms are non-negative values. If you think of the norms as a length, you easily see why it can’t be negative.

2- Norms are $0$ if and only if the vector is a zero vector

3- Norms respect the triangle inequity. See bellow.

4- $\norm{\bs{k}\cdot \bs{u}}= | \bs{k} | \cdot\norm{\bs{u}}$. The norm of a vector multiplied by a scalar is equal to the absolute value of this scalar multiplied by the norm of the vector.

It is usually written with two horizontal bars: $\norm{\bs{x}}$

The triangle inequity

The norm of the sum of some vectors is less than or equal to the sum of the norms of these vectors.

$ \norm{\bs{u}+\bs{v}} \leq \norm{\bs{u}}+\norm{\bs{v}} $

Example 1.

$ \bs{u}= \begin{bmatrix} 1 & 6 \end{bmatrix} $


$ \bs{v}= \begin{bmatrix} 4 & 2 \end{bmatrix} $

$ \norm{\bs{u}+\bs{v}} = \sqrt{(1+4)^2+(6+2)^2} = \sqrt{89} \approx 9.43 $

$ \norm{\bs{u}}+\norm{\bs{v}} = \sqrt{1^2+6^2}+\sqrt{4^2+2^2} = \sqrt{37}+\sqrt{20} \approx 10.55 $

Let’s check these results:

u = np.array([1, 6])
array([1, 6])
v = np.array([4, 2])
array([4, 2])
array([5, 8])
u = [0,0,1,6]
v = [0,0,4,2]
u_bis = [1,6,v[2],v[3]]
w = [0,0,5,8]
plt.quiver([u[0], u_bis[0], w[0]],
           [u[1], u_bis[1], w[1]],
           [u[2], u_bis[2], w[2]],
           [u[3], u_bis[3], w[3]],
           angles='xy', scale_units='xy', scale=1, color=sns.color_palette())
# plt.rc('text', usetex=True)
plt.xlim(-2, 6)
plt.ylim(-2, 9)
plt.axvline(x=0, color='grey')
plt.axhline(y=0, color='grey')

plt.text(-1, 3.5, r'$||\vec{u}||$', color=sns.color_palette()[0], size=20)
plt.text(2.5, 7.5, r'$||\vec{v}||$', color=sns.color_palette()[1], size=20)
plt.text(2, 2, r'$||\vec{u}+\vec{v}||$', color=sns.color_palette()[2], size=20)

Vector illustration of the triangle inequity Vector illustration of the triangle inequity

Geometrically, this simply means that the shortest path between two points is a line

Essential Math for Data Science
If you want to learn more math for data science and machine learning using Python, check my book Essential Math for Data Science.

P-norms: general rules

Here is the recipe to get the $p$-norm of a vector:

  1. Calculate the absolute value of each element
  2. Take the power $p$ of these absolute values
  3. Sum all these powered absolute values
  4. Take the power $\frac{1}{p}$ of this result

This is more condensly expressed with the formula:

$ \norm{\bs{x}}_p=(\sum_i|\bs{x}_i|^p)^{1/p} $

This will be clear with examples using these widely used $p$-norms.

The $L^0$ norm

All positive values will get you a $1$ if you calculate its power $0$ except $0$ that will get you another $0$. Therefore this norm corresponds to the number of non-zero elements in the vector. It is not really a norm because if you multiply the vector by $\alpha$, this number is the same (rule 4 above).

The $L^1$ norm

$p=1$ so this norm is simply the sum of the absolute values:

$ \norm{\bs{x}}1=\sum{i} |\bs{x}_i| $

The Euclidean norm ($L^2$ norm)

The Euclidean norm is the $p$-norm with $p=2$. This may be the more used norm with the squared $L^2$ norm.

$ \norm{\bs{x}}_2=(\sum_i \bs{x}_i^2)^{1/2}\Leftrightarrow \sqrt{\sum_i \bs{x}_i^2} $

Let’s see an example of this norm:

Example 2.

Graphically, the Euclidean norm corresponds to the length of the vector from the origin to the point obtained by linear combination (like applying Pythagorean theorem).

$ \bs{u}= \begin{bmatrix} 3 \\ 4 \end{bmatrix} $

$ \begin{aligned} \norm{\bs{u}}_2 &=\sqrt{|3|^2+|4|^2}\\ &=\sqrt{25}\\ &=5 \end{aligned} $

So the $L^2$ norm is $5$.

The $L^2$ norm can be calculated with the linalg.norm function from numpy. We can check the result:

np.linalg.norm([3, 4])

Here is the graphical representation of the vectors:

u = [0,0,3,4]

           angles='xy', scale_units='xy', scale=1)

plt.xlim(-2, 4)
plt.ylim(-2, 5)
plt.axvline(x=0, color='grey')
plt.axhline(y=0, color='grey')

plt.annotate('', xy = (3.2, 0), xytext = (3.2, 4),
             arrowprops=dict(edgecolor='black', arrowstyle = '<->'))
plt.annotate('', xy = (0, -0.2), xytext = (3, -0.2),
             arrowprops=dict(edgecolor='black', arrowstyle = '<->'))

plt.text(1, 2.5, r'$\vec{u}$', size=18)
plt.text(3.3, 2, r'$\vec{u}_y$', size=18)
plt.text(1.5, -1, r'$\vec{u}_x$', size=18)

Vector illustration of the l2 norm Vectors of the example 2.

In this case, the vector is in a 2-dimensional space but this stands also for more dimensions.

$ u= \begin{bmatrix} u_1\\ u_2\\ \cdots \\ u_n \end{bmatrix} $

$ ||u||_2 = \sqrt{u_1^2+u_2^2+\cdots+u_n^2} $

The squared Euclidean norm (squared $L^2$ norm)

$ \sum_i|\bs{x}_i|^2 $

The squared $L^2$ norm is convenient because it removes the square root and we end up with the simple sum of every squared values of the vector.

The squared Euclidean norm is widely used in machine learning partly because it can be calculated with the vector operation $\bs{x}^\text{T}\bs{x}$. There can be performance gain due to the optimization See here and here for more details.

Example 3.

$ \bs{x}= \begin{bmatrix} 2 \\ 5 \\ 3 \\ 3 \end{bmatrix} $

$ \bs{x}^\text{T}= \begin{bmatrix} 2 & 5 & 3 & 3 \end{bmatrix} $

$ \begin{aligned} \bs{x}^\text{T}\bs{x}&= \begin{bmatrix} 2 & 5 & 3 & 3 \end{bmatrix} \times \begin{bmatrix} 2 \\ 5 \\ 3 \\ 3 \end{bmatrix}\\ &= 2\times 2 + 5\times 5 + 3\times 3 + 3\times 3= 47 \end{aligned} $

x = np.array([[2], [5], [3], [3]])
euclideanNorm =

It works!

Derivative of the squared $L^2$ norm

Another advantage of the squared $L^2$ norm is that its partial derivative is easily computed:

$ u= \begin{bmatrix} u_1\\ u_2\\ \cdots \\ u_n \end{bmatrix} $

$ \norm{u}_2 = u_1^2+u_2^2+\cdots+u_n^2 $

$ \begin{cases} \dfrac{d\norm{u}_2}{du_1} = 2u_1\\ \dfrac{d\norm{u}_2}{du_2} = 2u_2\\ \cdots\\ \dfrac{d\norm{u}_2}{du_n} = 2u_n \end{cases} $

Derivative of the $L^2$ norm

In the case of the $L^2$ norm, the derivative is more complicated and takes every elements of the vector into account:

$ \norm{u}_2 = \sqrt{(u_1^2+u_2^2+\cdots+u_n^2)} = (u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}} $

$ \begin{aligned} \dfrac{d\norm{u}_2}{du_1} &= \dfrac{1}{2}(u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}-1}\cdot \dfrac{d}{du_1}(u_1^2+u_2^2+\cdots+u_n^2)\\ &=\dfrac{1}{2}(u_1^2+u_2^2+\cdots+u_n^2)^{-\frac{1}{2}}\cdot \dfrac{d}{du_1}(u_1^2+u_2^2+\cdots+u_n^2)\\ &=\dfrac{1}{2}\cdot\dfrac{1}{(u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}}}\cdot \dfrac{d}{du_1}(u_1^2+u_2^2+\cdots+u_n^2)\\ &=\dfrac{1}{2}\cdot\dfrac{1}{(u_1^2+u_2^2+\cdots+u_n^2)^{\frac{1}{2}}}\cdot 2\cdot u_1\\ &=\dfrac{u_1}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\ \end{aligned} $

$ \begin{cases} \dfrac{d\norm{u}_2}{du_1} = \dfrac{u_1}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\ \dfrac{d\norm{u}_2}{du_2} = \dfrac{u_2}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\ \cdots\\ \dfrac{d\norm{u}_2}{du_n} = \dfrac{u_n}{\sqrt{(u_1^2+u_2^2+\cdots+u_n^2)}}\\ \end{cases} $

One problem of the squared $L^2$ norm is that it hardly discriminates between 0 and small values because the increase of the function is slow.

We can see this by graphically comparing the squared $L^2$ norm with the $L^2$ norm. The $z$-axis corresponds to the norm and the $x$- and $y$-axis correspond to two parameters. The same thing is true with more than 2 dimensions but it would be hard to visualize it.

$L^2$ norm:

Representation of the L2 norm The L2 norm

Squared $L^2$ norm:

Representation of the squared L2 norm The squared L2 norm

$L^1$ norm:

Representation of the L1 norm The L1 norm

These plots are done with the help of this website. Go and plot these norms if you need to move them in order to catch their shape.

The max norm

It is the $L^\infty$ norm and corresponds to the absolute value of the greatest element of the vector.

$ \norm{\bs{x}}_\infty = \max\limits_i|x_i| $

Matrix norms: the Frobenius norm

\(\norm{\bs{A}}_{F} = \sqrt{\sum_{i,j}A^{2}_{i,j}}\)

This is equivalent to take the $L^2$ norm of the matrix after flattening.

The same Numpy function can be use:

A = np.array([[1, 2], [6, 4], [3, 2]])
array([[1, 2],
       [6, 4],
       [3, 2]])

Expression of the dot product with norms

$ \bs{x}^\text{T}\bs{y} = \norm{\bs{x}}_2\cdot\norm{\bs{y}}_2\cos\theta $

Example 4.

$ \bs{x}= \begin{bmatrix} 0 \\ 2 \end{bmatrix} $


$ \bs{y}= \begin{bmatrix} 2 \\ 2 \end{bmatrix} $

x = [0,0,0,2]
y = [0,0,2,2]

plt.xlim(-2, 4)
plt.ylim(-2, 5)
plt.axvline(x=0, color='grey', zorder=0)
plt.axhline(y=0, color='grey', zorder=0)

plt.quiver([x[0], y[0]],
           [x[1], y[1]],
           [x[2], y[2]],
           [x[3], y[3]],
           angles='xy', scale_units='xy', scale=1)

plt.text(-0.5, 1, r'$\vec{x}$', size=18)
plt.text(1.5, 0.5, r'$\vec{y}$', size=18)

Expression of the dot product with norms Expression of the dot product with norms

We took this example for its simplicity. As we can see, the angle $\theta$ is equal to 45°.

$ \bs{x^\text{T}y}= \begin{bmatrix} 0 & 2 \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 2 \end{bmatrix} = 0\times2+2\times2 = 4 $


$ \norm{\bs{x}}_2=\sqrt{0^2+2^2}=\sqrt{4}=2 $

$ \norm{\bs{y}}_2=\sqrt{2^2+2^2}=\sqrt{8} $

$ 2\times\sqrt{8}\times cos(45)=4 $

Here are the operations using numpy:

# Note: np.cos take the angle in radian


Feel free to drop me an email or a comment. The syllabus of this series can be found in the introduction post. All the notebooks can be found on Github.

This content is part of a series following the chapter 2 on linear algebra from the Deep Learning Book by Goodfellow, I., Bengio, Y., and Courville, A. (2016). It aims to provide intuitions/drawings/python code on mathematical theories and is constructed as my understanding of these concepts. You can check the syllabus in the introduction post.

✠  Previous Deep Learning Book Series · 2.4 Linear Dependence and Span ✠  Next Deep Learning Book Series · 2.6 Special Kinds of Matrices and Vectors