Understand Why A Sole Linear Layer Does Not Classify Well

In deep neural network, the most frequently used component may be the linear layer. However, the linear layer itself does not work well as a classifier. In this article, I intend to explain why from my own point of view.

A linear layer is a linear function. Just as in the following form: $$F(A) = A (\vec{x})$$, where $A$ is a matrix, $\vec{x}$ is the input vector. The training process of a DNN is to find an optimal $A$ to maximize an objective function. In this process, $\frac{\partial{F}}{\partial{A}}$ is computed, and it is used to update $A$. To view $\frac{\partial{F}}{\partial{A}}$, we can rewrite $F$ as $$
F'(A)=M vec(A)
$$, where $vec(A)$ is the vectorized $A$, and $M$ is a corresponding matrix which contains suitably distributed entries from $\vec{x}$ so that $$
F(A) = F'(vec(A)) = M vec(A)
$$.
Now, it is clear that $$
\frac{\partial{F}}{\partial{A}} = \frac{\partial{F’}}{\partial{vec(A)}} = M
$$. When an optimizer uses a multiple of $M$ to update $A$ as in a gradient descending algorithm, $A \leftarrow \vec{\Lambda} M$, where $\vec{\Lambda}$ is some vector that summarizes backpropagation of gradients from previous layers. Hence, it is clear that $A$ eventually only changes along some vector in the column space of $M$. When the optimal $A$ is not in the column space of $M$, no matter how $A$ is updated, the optimal point will never be reached.

The above findings can help us improve design of even simple networks. For example, assume a layer $G(x) = A \vec{x}$, where $A$ is a $2 \times m$ matrix. When $\vec{x}$s are almost colinear, $A$ will only move along the $\vec{x}$s. Hence, it is quite difficult to achieve an effect like switching two entries of $A$.

To solve the problem, the effective dimension of $\vec{x}$s must be increased, either directly by adding more various samples, or indirectly by expanding the model to multiple layers. For example, define $$
F(\vec{x}) = A v( B \vec{x} )
$$, where $v$ is some activation function. This is in fact two linear layers. $B \vec{x}$ first expands $\vec{x}$ into more various values, then $v$ introduces more linear-independency, before right-multiplying with $A$. With this result, Relu is clearly a good choice as an activation function, because it behaves far from a linear function, therefore, produces higher dimensions. Then, the high dimension makes columns of A combine with more variations.

On the contrary, if $v$ is a sigmoid function, it changes smoothly everywhere, and behaves closely to a linear function at the origin. Hence, its ability of adding dimension is low, as compared to Relu.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *