# 18.1. Geometry and Linear Algebraic Operations¶

In Section 2.3, we encountered the basics of linear algebra and saw how it could be used to express common operations for transforming our data. Linear algebra is one of the key mathematical pillars underlying much of the work that we do in deep learning and in machine learning more broadly. While Section 2.3 contained enough machinery to communicate the mechanics of modern deep learning models, there is a lot more to the subject. In this section, we will go deeper, highlighting some geometric interpretations of linear algebra operations, and introducing a few fundamental concepts, including of eigenvalues and eigenvectors.

## 18.1.1. Geometry of Vectors¶

First, we need to discuss the two common geometric interpretations of vectors, as either points or directions in space. Fundamentally, a vector is a list of numbers such as the Python list below.

```
v = [1, 7, 0, 1]
```

Mathematicians most often write this as either a *column* or *row*
vector, which is to say either as

or

These often have different interpretations, where data examples are column vectors and weights used to form weighted sums are row vectors. However, it can be beneficial to be flexible. As we have described in Section 2.3, though a single vector’s default orientation is a column vector, for any matrix representing a tabular dataset, treating each data example as a row vector in the matrix is more conventional.

Given a vector, the first interpretation that we should give it is as a
point in space. In two or three dimensions, we can visualize these
points by using the components of the vectors to define the location of
the points in space compared to a fixed reference called the *origin*.
This can be seen in Fig. 18.1.1.

This geometric point of view allows us to consider the problem on a more abstract level. No longer faced with some insurmountable seeming problem like classifying pictures as either cats or dogs, we can start considering tasks abstractly as collections of points in space and picturing the task as discovering how to separate two distinct clusters of points.

In parallel, there is a second point of view that people often take of vectors: as directions in space. Not only can we think of the vector \(\mathbf{v} = [3,2]^\top\) as the location \(3\) units to the right and \(2\) units up from the origin, we can also think of it as the direction itself to take \(3\) steps to the right and \(2\) steps up. In this way, we consider all the vectors in figure Fig. 18.1.2 the same.

One of the benefits of this shift is that we can make visual sense of the act of vector addition. In particular, we follow the directions given by one vector, and then follow the directions given by the other, as is seen in Fig. 18.1.3.

Vector subtraction has a similar interpretation. By considering the identity that \(\mathbf{u} = \mathbf{v} + (\mathbf{u}-\mathbf{v})\), we see that the vector \(\mathbf{u}-\mathbf{v}\) is the direction that takes us from the point \(\mathbf{v}\) to the point \(\mathbf{u}\).

## 18.1.2. Dot Products and Angles¶

As we saw in Section 2.3, if we take two column vectors \(\mathbf{u}\) and \(\mathbf{v}\), we can form their dot product by computing:

Because (18.1.3) is symmetric, we will mirror the notation of classical multiplication and write

to highlight the fact that exchanging the order of the vectors will yield the same answer.

The dot product (18.1.3) also admits a geometric interpretation: it is closely related to the angle between two vectors. Consider the angle shown in Fig. 18.1.4.

To start, let us consider two specific vectors:

The vector \(\mathbf{v}\) is length \(r\) and runs parallel to the \(x\)-axis, and the vector \(\mathbf{w}\) is of length \(s\) and at angle \(\theta\) with the \(x\)-axis. If we compute the dot product of these two vectors, we see that

With some simple algebraic manipulation, we can rearrange terms to obtain

In short, for these two specific vectors, the dot product combined with the norms tell us the angle between the two vectors. This same fact is true in general. We will not derive the expression here, however, if we consider writing \(\|\mathbf{v} - \mathbf{w}\|^2\) in two ways: one with the dot product, and the other geometrically using the law of cosines, we can obtain the full relationship. Indeed, for any two vectors \(\mathbf{v}\) and \(\mathbf{w}\), the angle between the two vectors is

This is a nice result since nothing in the computation references two-dimensions. Indeed, we can use this in three or three million dimensions without issue.

As a simple example, let us see how to compute the angle between a pair of vectors:

```
%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
from mxnet import gluon, np, npx
npx.set_np()
def angle(v, w):
return np.arccos(v.dot(w) / (np.linalg.norm(v) * np.linalg.norm(w)))
angle(np.array([0, 1, 2]), np.array([2, 3, 4]))
```

```
array(0.41899002)
```

```
%matplotlib inline
from d2l import torch as d2l
from IPython import display
import torch
from torchvision import transforms
import torchvision
def angle(v, w):
return torch.acos(v.dot(w) / (torch.norm(v) * torch.norm(w)))
angle(torch.tensor([0, 1, 2], dtype=torch.float32), torch.tensor([2.0, 3, 4]))
```

```
tensor(0.4190)
```

```
%matplotlib inline
from d2l import tensorflow as d2l
from IPython import display
import tensorflow as tf
def angle(v, w):
return tf.acos(tf.tensordot(v, w, axes=1) / (tf.norm(v) * tf.norm(w)))
angle(tf.constant([0, 1, 2], dtype=tf.float32), tf.constant([2.0, 3, 4]))
```

```
<tf.Tensor: shape=(), dtype=float32, numpy=0.41899002>
```

We will not use it right now, but it is useful to know that we will
refer to vectors for which the angle is \(\pi/2\) (or equivalently
\(90^{\circ}\)) as being *orthogonal*. By examining the equation
above, we see that this happens when \(\theta = \pi/2\), which is
the same thing as \(\cos(\theta) = 0\). The only way this can happen
is if the dot product itself is zero, and two vectors are orthogonal if
and only if \(\mathbf{v}\cdot\mathbf{w} = 0\). This will prove to be
a helpful formula when understanding objects geometrically.

It is reasonable to ask: why is computing the angle useful? The answer
comes in the kind of invariance we expect data to have. Consider an
image, and a duplicate image, where every pixel value is the same but
\(10\%\) the brightness. The values of the individual pixels are in
general far from the original values. Thus, if one computed the distance
between the original image and the darker one, the distance can be
large. However, for most ML applications, the *content* is the same—it
is still an image of a cat as far as a cat/dog classifier is concerned.
However, if we consider the angle, it is not hard to see that for any
vector \(\mathbf{v}\), the angle between \(\mathbf{v}\) and
\(0.1\cdot\mathbf{v}\) is zero. This corresponds to the fact that
scaling vectors keeps the same direction and just changes the length.
The angle considers the darker image identical.

Examples like this are everywhere. In text, we might want the topic being discussed to not change if we write twice as long of document that says the same thing. For some encoding (such as counting the number of occurrences of words in some vocabulary), this corresponds to a doubling of the vector encoding the document, so again we can use the angle.

### 18.1.2.1. Cosine Similarity¶

In ML contexts where the angle is employed to measure the closeness of
two vectors, practitioners adopt the term *cosine similarity* to refer
to the portion

The cosine takes a maximum value of \(1\) when the two vectors point in the same direction, a minimum value of \(-1\) when they point in opposite directions, and a value of \(0\) when the two vectors are orthogonal. Note that if the components of high-dimensional vectors are sampled randomly with mean \(0\), their cosine will nearly always be close to \(0\).

## 18.1.3. Hyperplanes¶

In addition to working with vectors, another key object that you must
understand to go far in linear algebra is the *hyperplane*, a
generalization to higher dimensions of a line (two dimensions) or of a
plane (three dimensions). In an \(d\)-dimensional vector space, a
hyperplane has \(d-1\) dimensions and divides the space into two
half-spaces.

Let us start with an example. Suppose that we have a column vector \(\mathbf{w}=[2,1]^\top\). We want to know, “what are the points \(\mathbf{v}\) with \(\mathbf{w}\cdot\mathbf{v} = 1\)?” By recalling the connection between dot products and angles above (18.1.8), we can see that this is equivalent to

If we consider the geometric meaning of this expression, we see that this is equivalent to saying that the length of the projection of \(\mathbf{v}\) onto the direction of \(\mathbf{w}\) is exactly \(1/\|\mathbf{w}\|\), as is shown in Fig. 18.1.5. The set of all points where this is true is a line at right angles to the vector \(\mathbf{w}\). If we wanted, we could find the equation for this line and see that it is \(2x + y = 1\) or equivalently \(y = 1 - 2x\).

If we now look at what happens when we ask about the set of points with \(\mathbf{w}\cdot\mathbf{v} > 1\) or \(\mathbf{w}\cdot\mathbf{v} < 1\), we can see that these are cases where the projections are longer or shorter than \(1/\|\mathbf{w}\|\), respectively. Thus, those two inequalities define either side of the line. In this way, we have found a way to cut our space into two halves, where all the points on one side have dot product below a threshold, and the other side above as we see in Fig. 18.1.6.

The story in higher dimension is much the same. If we now take \(\mathbf{w} = [1,2,3]^\top\) and ask about the points in three dimensions with \(\mathbf{w}\cdot\mathbf{v} = 1\), we obtain a plane at right angles to the given vector \(\mathbf{w}\). The two inequalities again define the two sides of the plane as is shown in Fig. 18.1.7.

While our ability to visualize runs out at this point, nothing stops us
from doing this in tens, hundreds, or billions of dimensions. This
occurs often when thinking about machine learned models. For instance,
we can understand linear classification models like those from
Section 3.4, as methods to find hyperplanes that separate
the different target classes. In this context, such hyperplanes are
often referred to as *decision planes*. The majority of deep learned
classification models end with a linear layer fed into a softmax, so one
can interpret the role of the deep neural network to be to find a
non-linear embedding such that the target classes can be separated
cleanly by hyperplanes.

To give a hand-built example, notice that we can produce a reasonable model to classify tiny images of t-shirts and trousers from the Fashion MNIST dataset (seen in Section 3.5) by just taking the vector between their means to define the decision plane and eyeball a crude threshold. First we will load the data and compute the averages.

```
# Load in the dataset
train = gluon.data.vision.FashionMNIST(train=True)
test = gluon.data.vision.FashionMNIST(train=False)
X_train_0 = np.stack([x[0] for x in train if x[1] == 0]).astype(float)
X_train_1 = np.stack([x[0] for x in train if x[1] == 1]).astype(float)
X_test = np.stack(
[x[0] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
y_test = np.stack(
[x[1] for x in test if x[1] == 0 or x[1] == 1]).astype(float)
# Compute averages
ave_0 = np.mean(X_train_0, axis=0)
ave_1 = np.mean(X_train_1, axis=0)
```

```
# Load in the dataset
trans = []
trans.append(transforms.ToTensor())
trans = transforms.Compose(trans)
train = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=True, download=True)
test = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=False, download=True)
X_train_0 = torch.stack(
[x[0] * 256 for x in train if x[1] == 0]).type(torch.float32)
X_train_1 = torch.stack(
[x[0] * 256 for x in train if x[1] == 1]).type(torch.float32)
X_test = torch.stack(
[x[0] * 256 for x in test if x[1] == 0 or x[1] == 1]).type(torch.float32)
y_test = torch.stack([torch.tensor(x[1]) for x in test
if x[1] == 0 or x[1] == 1]).type(torch.float32)
# Compute averages
ave_0 = torch.mean(X_train_0, axis=0)
ave_1 = torch.mean(X_train_1, axis=0)
```

```
# Load in the dataset
((train_images, train_labels), (
test_images, test_labels)) = tf.keras.datasets.fashion_mnist.load_data()
X_train_0 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 0]] * 256), dtype=tf.float32)
X_train_1 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
train_labels) if label == 1]] * 256), dtype=tf.float32)
X_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 0]] * 256), dtype=tf.float32)
y_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
test_labels) if label == 1]] * 256), dtype=tf.float32)
# Compute averages
ave_0 = tf.reduce_mean(X_train_0, axis=0)
ave_1 = tf.reduce_mean(X_train_1, axis=0)
```

It can be informative to examine these averages in detail, so let us plot what they look like. In this case, we see that the average indeed resembles a blurry image of a t-shirt.

```
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
```

```
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
```

```
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(tf.reshape(ave_0, (28, 28)), cmap='Greys')
d2l.plt.show()
```

In the second case, we again see that the average resembles a blurry image of trousers.

```
# Plot average trousers
d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
```

```
# Plot average trousers
d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
```

```
# Plot average trousers
d2l.plt.imshow(tf.reshape(ave_1, (28, 28)), cmap='Greys')
d2l.plt.show()
```

In a fully machine learned solution, we would learn the threshold from the dataset. In this case, I simply eyeballed a threshold that looked good on the training data by hand.

```
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
predictions = X_test.reshape(2000, -1).dot(w.flatten()) > -1500000
# Accuracy
np.mean(predictions.astype(y_test.dtype) == y_test, dtype=np.float64)
```

```
array(0.801, dtype=float64)
```

```
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
# '@' is Matrix Multiplication operator in pytorch.
predictions = X_test.reshape(2000, -1) @ (w.flatten()) > -1500000
# Accuracy
torch.mean(predictions.type(y_test.dtype) == y_test, dtype=torch.float64)
```

```
tensor(0.7870, dtype=torch.float64)
```

```
# Print test set accuracy with eyeballed threshold
w = tf.transpose(ave_1 - ave_0)
predictions = tf.reduce_sum(X_test * tf.nest.flatten(w), axis=0) > -1500000
# Accuracy
tf.reduce_mean(
tf.cast(tf.cast(predictions, y_test.dtype) == y_test, tf.float32))
```

```
<tf.Tensor: shape=(), dtype=float32, numpy=0.4602704>
```

## 18.1.4. Geometry of Linear Transformations¶

Through Section 2.3 and the above discussions, we have a solid understanding of the geometry of vectors, lengths, and angles. However, there is one important object we have omitted discussing, and that is a geometric understanding of linear transformations represented by matrices. Fully internalizing what matrices can do to transform data between two potentially different high dimensional spaces takes significant practice, and is beyond the scope of this appendix. However, we can start building up intuition in two dimensions.

Suppose that we have some matrix:

If we want to apply this to an arbitrary vector \(\mathbf{v} = [x, y]^\top\), we multiply and see that

This may seem like an odd computation, where something clear became
somewhat impenetrable. However, it tells us that we can write the way
that a matrix transforms *any* vector in terms of how it transforms *two
specific vectors*: \([1,0]^\top\) and \([0,1]^\top\). This is
worth considering for a moment. We have essentially reduced an infinite
problem (what happens to any pair of real numbers) to a finite one (what
happens to these specific vectors). These vectors are an example a
*basis*, where we can write any vector in our space as a weighted sum of
these *basis vectors*.

Let us draw what happens when we use the specific matrix

If we look at the specific vector \(\mathbf{v} = [2, -1]^\top\), we see this is \(2\cdot[1,0]^\top + -1\cdot[0,1]^\top\), and thus we know that the matrix \(A\) will send this to \(2(\mathbf{A}[1,0]^\top) + -1(\mathbf{A}[0,1])^\top = 2[1, -1]^\top - [2,3]^\top = [0, -5]^\top\). If we follow this logic through carefully, say by considering the grid of all integer pairs of points, we see that what happens is that the matrix multiplication can skew, rotate, and scale the grid, but the grid structure must remain as you see in Fig. 18.1.8.

This is the most important intuitive point to internalize about linear transformations represented by matrices. Matrices are incapable of distorting some parts of space differently than others. All they can do is take the original coordinates on our space and skew, rotate, and scale them.

Some distortions can be severe. For instance the matrix

compresses the entire two-dimensional plane down to a single line. Identifying and working with such transformations are the topic of a later section, but geometrically we can see that this is fundamentally different from the types of transformations we saw above. For instance, the result from matrix \(\mathbf{A}\) can be “bent back” to the original grid. The results from matrix \(\mathbf{B}\) cannot because we will never know where the vector \([1,2]^\top\) came from—was it \([1,1]^\top\) or \([0, -1]^\top\)?

While this picture was for a \(2\times2\) matrix, nothing prevents us from taking the lessons learned into higher dimensions. If we take similar basis vectors like \([1,0, \ldots,0]\) and see where our matrix sends them, we can start to get a feeling for how the matrix multiplication distorts the entire space in whatever dimension space we are dealing with.

## 18.1.5. Linear Dependence¶

Consider again the matrix

This compresses the entire plane down to live on the single line
\(y = 2x\). The question now arises: is there some way we can detect
this just looking at the matrix itself? The answer is that indeed we
can. Let us take \(\mathbf{b}_1 = [2,4]^\top\) and
\(\mathbf{b}_2 = [-1, -2]^\top\) be the two columns of
\(\mathbf{B}\). Remember that we can write everything transformed by
the matrix \(\mathbf{B}\) as a weighted sum of the columns of the
matrix: like \(a_1\mathbf{b}_1 + a_2\mathbf{b}_2\). We call this a
*linear combination*. The fact that
\(\mathbf{b}_1 = -2\cdot\mathbf{b}_2\) means that we can write any
linear combination of those two columns entirely in terms of say
\(\mathbf{b}_2\) since

This means that one of the columns is, in a sense, redundant because it does not define a unique direction in space. This should not surprise us too much since we already saw that this matrix collapses the entire plane down into a single line. Moreover, we see that the linear dependence \(\mathbf{b}_1 = -2\cdot\mathbf{b}_2\) captures this. To make this more symmetrical between the two vectors, we will write this as

In general, we will say that a collection of vectors
\(\mathbf{v}_1, \ldots, \mathbf{v}_k\) are *linearly dependent* if
there exist coefficients \(a_1, \ldots, a_k\) *not all equal to
zero* so that

In this case, we can solve for one of the vectors in terms of some
combination of the others, and effectively render it redundant. Thus, a
linear dependence in the columns of a matrix is a witness to the fact
that our matrix is compressing the space down to some lower dimension.
If there is no linear dependence we say the vectors are *linearly
independent*. If the columns of a matrix are linearly independent, no
compression occurs and the operation can be undone.

## 18.1.6. Rank¶

If we have a general \(n\times m\) matrix, it is reasonable to ask
what dimension space the matrix maps into. A concept known as the *rank*
will be our answer. In the previous section, we noted that a linear
dependence bears witness to compression of space into a lower dimension
and so we will be able to use this to define the notion of rank. In
particular, the rank of a matrix \(\mathbf{A}\) is the largest
number of linearly independent columns amongst all subsets of columns.
For example, the matrix

has \(\mathrm{rank}(B)=1\), since the two columns are linearly dependent, but either column by itself is not linearly dependent. For a more challenging example, we can consider

and show that \(\mathbf{C}\) has rank two since, for instance, the first two columns are linearly independent, however any of the four collections of three columns are dependent.

This procedure, as described, is very inefficient. It requires looking at every subset of the columns of our given matrix, and thus is potentially exponential in the number of columns. Later we will see a more computationally efficient way to compute the rank of a matrix, but for now, this is sufficient to see that the concept is well defined and understand the meaning.

## 18.1.7. Invertibility¶

We have seen above that multiplication by a matrix with linearly dependent columns cannot be undone, i.e., there is no inverse operation that can always recover the input. However, multiplication by a full-rank matrix (i.e., some \(\mathbf{A}\) that is \(n \times n\) matrix with rank \(n\)), we should always be able to undo it. Consider the matrix

which is the matrix with ones along the diagonal, and zeros elsewhere.
We call this the *identity* matrix. It is the matrix which leaves our
data unchanged when applied. To find a matrix which undoes what our
matrix \(\mathbf{A}\) has done, we want to find a matrix
\(\mathbf{A}^{-1}\) such that

If we look at this as a system, we have \(n \times n\) unknowns (the
entries of \(\mathbf{A}^{-1}\)) and \(n \times n\) equations
(the equality that needs to hold between every entry of the product
\(\mathbf{A}^{-1}\mathbf{A}\) and every entry of \(\mathbf{I}\))
so we should generically expect a solution to exist. Indeed, in the next
section we will see a quantity called the *determinant*, which has the
property that as long as the determinant is not zero, we can find a
solution. We call such a matrix \(\mathbf{A}^{-1}\) the *inverse*
matrix. As an example, if \(\mathbf{A}\) is the general
\(2 \times 2\) matrix

then we can see that the inverse is

We can test to see this by seeing that multiplying by the inverse given by the formula above works in practice.

```
M = np.array([[1, 2], [1, 4]])
M_inv = np.array([[2, -1], [-0.5, 0.5]])
M_inv.dot(M)
```

```
array([[1., 0.],
[0., 1.]])
```

```
M = torch.tensor([[1, 2], [1, 4]], dtype=torch.float32)
M_inv = torch.tensor([[2, -1], [-0.5, 0.5]])
M_inv @ M
```

```
tensor([[1., 0.],
[0., 1.]])
```

```
M = tf.constant([[1, 2], [1, 4]], dtype=tf.float32)
M_inv = tf.constant([[2, -1], [-0.5, 0.5]])
tf.matmul(M_inv, M)
```

```
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1., 0.],
[0., 1.]], dtype=float32)>
```

### 18.1.7.1. Numerical Issues¶

While the inverse of a matrix is useful in theory, we must say that most
of the time we do not wish to *use* the matrix inverse to solve a
problem in practice. In general, there are far more numerically stable
algorithms for solving linear equations like

than computing the inverse and multiplying to get

Just as division by a small number can lead to numerical instability, so can inversion of a matrix which is close to having low rank.

Moreover, it is common that the matrix \(\mathbf{A}\) is *sparse*,
which is to say that it contains only a small number of non-zero values.
If we were to explore examples, we would see that this does not mean the
inverse is sparse. Even if \(\mathbf{A}\) was a \(1\) million by
\(1\) million matrix with only \(5\) million non-zero entries
(and thus we need only store those \(5\) million), the inverse will
typically have almost every entry non-negative, requiring us to store
all \(1\text{M}^2\) entries—that is \(1\) trillion entries!

While we do not have time to dive all the way into the thorny numerical issues frequently encountered when working with linear algebra, we want to provide you with some intuition about when to proceed with caution, and generally avoiding inversion in practice is a good rule of thumb.

## 18.1.8. Determinant¶

The geometric view of linear algebra gives an intuitive way to interpret
a fundamental quantity known as the *determinant*. Consider the grid
image from before, but now with a highlighted region
(Fig. 18.1.9).

Look at the highlighted square. This is a square with edges given by \((0, 1)\) and \((1, 0)\) and thus it has area one. After \(\mathbf{A}\) transforms this square, we see that it becomes a parallelogram. There is no reason this parallelogram should have the same area that we started with, and indeed in the specific case shown here of

it is an exercise in coordinate geometry to compute the area of this parallelogram and obtain that the area is \(5\).

In general, if we have a matrix

we can see with some computation that the area of the resulting
parallelogram is \(ad-bc\). This area is referred to as the
*determinant*.

Let us check this quickly with some example code.

```
import numpy as np
np.linalg.det(np.array([[1, -1], [2, 3]]))
```

```
5.000000000000001
```

```
torch.det(torch.tensor([[1, -1], [2, 3]], dtype=torch.float32))
```

```
tensor(5.)
```

```
tf.linalg.det(tf.constant([[1, -1], [2, 3]], dtype=tf.float32))
```

```
<tf.Tensor: shape=(), dtype=float32, numpy=5.0>
```

The eagle-eyed amongst us will notice that this expression can be zero or even negative. For the negative term, this is a matter of convention taken generally in mathematics: if the matrix flips the figure, we say the area is negated. Let us see now that when the determinant is zero, we learn more.

Let us consider

If we compute the determinant of this matrix, we get \(2\cdot(-2 ) - 4\cdot(-1) = 0\). Given our understanding above, this makes sense. \(\mathbf{B}\) compresses the square from the original image down to a line segment, which has zero area. And indeed, being compressed into a lower dimensional space is the only way to have zero area after the transformation. Thus we see the following result is true: a matrix \(A\) is invertible if and only if the determinant is not equal to zero.

As a final comment, imagine that we have any figure drawn on the plane. Thinking like computer scientists, we can decompose that figure into a collection of little squares so that the area of the figure is in essence just the number of squares in the decomposition. If we now transform that figure by a matrix, we send each of these squares to parallelograms, each one of which has area given by the determinant. We see that for any figure, the determinant gives the (signed) number that a matrix scales the area of any figure.

Computing determinants for larger matrices can be laborious, but the intuition is the same. The determinant remains the factor that \(n\times n\) matrices scale \(n\)-dimensional volumes.

## 18.1.9. Tensors and Common Linear Algebra Operations¶

In Section 2.3 the concept of tensors was introduced. In this section, we will dive more deeply into tensor contractions (the tensor equivalent of matrix multiplication), and see how it can provide a unified view on a number of matrix and vector operations.

With matrices and vectors we knew how to multiply them to transform data. We need to have a similar definition for tensors if they are to be useful to us. Think about matrix multiplication:

or equivalently

This pattern is one we can repeat for tensors. For tensors, there is no one case of what to sum over that can be universally chosen, so we need specify exactly which indices we want to sum over. For instance we could consider

Such a transformation is called a *tensor contraction*. It can represent
a far more flexible family of transformations that matrix multiplication
alone.

As a often-used notational simplification, we can notice that the sum is
over exactly those indices that occur more than once in the expression,
thus people often work with *Einstein notation*, where the summation is
implicitly taken over all repeated indices. This gives the compact
expression:

### 18.1.9.1. Common Examples from Linear Algebra¶

Let us see how many of the linear algebraic definitions we have seen before can be expressed in this compressed tensor notation:

\(\mathbf{v} \cdot \mathbf{w} = \sum_i v_iw_i\)

\(\|\mathbf{v}\|_2^{2} = \sum_i v_iv_i\)

\((\mathbf{A}\mathbf{v})_i = \sum_j a_{ij}v_j\)

\((\mathbf{A}\mathbf{B})_{ik} = \sum_j a_{ij}b_{jk}\)

\(\mathrm{tr}(\mathbf{A}) = \sum_i a_{ii}\)

In this way, we can replace a myriad of specialized notations with short tensor expressions.

### 18.1.9.2. Expressing in Code¶

Tensors may flexibly be operated on in code as well. As seen in Section 2.3, we can create tensors as is shown below.

```
# Define tensors
B = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = np.array([[1, 2], [3, 4]])
v = np.array([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
```

```
((2, 2), (2, 2, 3), (2,))
```

```
# Define tensors
B = torch.tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = torch.tensor([[1, 2], [3, 4]])
v = torch.tensor([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
```

```
(torch.Size([2, 2]), torch.Size([2, 2, 3]), torch.Size([2]))
```

```
# Define tensors
B = tf.constant([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = tf.constant([[1, 2], [3, 4]])
v = tf.constant([1, 2])
# Print out the shapes
A.shape, B.shape, v.shape
```

```
(TensorShape([2, 2]), TensorShape([2, 2, 3]), TensorShape([2]))
```

Einstein summation has been implemented directly. The indices that occurs in the Einstein summation can be passed as a string, followed by the tensors that are being acted upon. For instance, to implement matrix multiplication, we can consider the Einstein summation seen above (\(\mathbf{A}\mathbf{v} = a_{ij}v_j\)) and strip out the indices themselves to get the implementation:

```
# Reimplement matrix multiplication
np.einsum("ij, j -> i", A, v), A.dot(v)
```

```
(array([ 5, 11]), array([ 5, 11]))
```

```
# Reimplement matrix multiplication
torch.einsum("ij, j -> i", A, v), A@v
```

```
(tensor([ 5, 11]), tensor([ 5, 11]))
```

```
# Reimplement matrix multiplication
tf.einsum("ij, j -> i", A, v), tf.matmul(A, tf.reshape(v, (2, 1)))
```

```
(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([ 5, 11], dtype=int32)>,
<tf.Tensor: shape=(2, 1), dtype=int32, numpy=
array([[ 5],
[11]], dtype=int32)>)
```

This is a highly flexible notation. For instance if we want to compute what would be traditionally written as

it can be implemented via Einstein summation as:

```
np.einsum("ijk, il, j -> kl", B, A, v)
```

```
array([[ 90, 126],
[102, 144],
[114, 162]])
```

```
torch.einsum("ijk, il, j -> kl", B, A, v)
```

```
tensor([[ 90, 126],
[102, 144],
[114, 162]])
```

```
tf.einsum("ijk, il, j -> kl", B, A, v)
```

```
<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[ 90, 126],
[102, 144],
[114, 162]], dtype=int32)>
```

This notation is readable and efficient for humans, however bulky if for
whatever reason we need to generate a tensor contraction
programmatically. For this reason, `einsum`

provides an alternative
notation by providing integer indices for each tensor. For example, the
same tensor contraction can also be written as:

```
np.einsum(B, [0, 1, 2], A, [0, 3], v, [1], [2, 3])
```

```
array([[ 90, 126],
[102, 144],
[114, 162]])
```

```
# PyTorch doesn't support this type of notation.
```

```
# TensorFlow doesn't support this type of notation.
```

Either notation allows for concise and efficient representation of tensor contractions in code.

## 18.1.10. Summary¶

Vectors can be interpreted geometrically as either points or directions in space.

Dot products define the notion of angle to arbitrarily high-dimensional spaces.

Hyperplanes are high-dimensional generalizations of lines and planes. They can be used to define decision planes that are often used as the last step in a classification task.

Matrix multiplication can be geometrically interpreted as uniform distortions of the underlying coordinates. They represent a very restricted, but mathematically clean, way to transform vectors.

Linear dependence is a way to tell when a collection of vectors are in a lower dimensional space than we would expect (say you have \(3\) vectors living in a \(2\)-dimensional space). The rank of a matrix is the size of the largest subset of its columns that are linearly independent.

When a matrix’s inverse is defined, matrix inversion allows us to find another matrix that undoes the action of the first. Matrix inversion is useful in theory, but requires care in practice owing to numerical instability.

Determinants allow us to measure how much a matrix expands or contracts a space. A nonzero determinant implies an invertible (non-singular) matrix and a zero-valued determinant means that the matrix is non-invertible (singular).

Tensor contractions and Einstein summation provide for a neat and clean notation for expressing many of the computations that are seen in machine learning.

## 18.1.11. Exercises¶

What is the angle between

(18.1.35)¶\[\begin{split}\vec v_1 = \begin{bmatrix} 1 \\ 0 \\ -1 \\ 2 \end{bmatrix}, \qquad \vec v_2 = \begin{bmatrix} 3 \\ 1 \\ 0 \\ 1 \end{bmatrix}?\end{split}\]True or false: \(\begin{bmatrix}1 & 2\\0&1\end{bmatrix}\) and \(\begin{bmatrix}1 & -2\\0&1\end{bmatrix}\) are inverses of one another?

Suppose that we draw a shape in the plane with area \(100\mathrm{m}^2\). What is the area after transforming the figure by the matrix

(18.1.36)¶\[\begin{split}\begin{bmatrix} 2 & 3\\ 1 & 2 \end{bmatrix}.\end{split}\]Which of the following sets of vectors are linearly independent?

\(\left\{\begin{pmatrix}1\\0\\-1\end{pmatrix}, \begin{pmatrix}2\\1\\-1\end{pmatrix}, \begin{pmatrix}3\\1\\1\end{pmatrix}\right\}\)

\(\left\{\begin{pmatrix}3\\1\\1\end{pmatrix}, \begin{pmatrix}1\\1\\1\end{pmatrix}, \begin{pmatrix}0\\0\\0\end{pmatrix}\right\}\)

\(\left\{\begin{pmatrix}1\\1\\0\end{pmatrix}, \begin{pmatrix}0\\1\\-1\end{pmatrix}, \begin{pmatrix}1\\0\\1\end{pmatrix}\right\}\)

Suppose that you have a matrix written as \(A = \begin{bmatrix}c\\d\end{bmatrix}\cdot\begin{bmatrix}a & b\end{bmatrix}\) for some choice of values \(a, b, c\), and \(d\). True or false: the determinant of such a matrix is always \(0\)?

The vectors \(e_1 = \begin{bmatrix}1\\0\end{bmatrix}\) and \(e_2 = \begin{bmatrix}0\\1\end{bmatrix}\) are orthogonal. What is the condition on a matrix \(A\) so that \(Ae_1\) and \(Ae_2\) are orthogonal?

How can you write \(\mathrm{tr}(\mathbf{A}^4)\) in Einstein notation for an arbitrary matrix \(A\)?